Cost-Effective Architectural Support for Rollback Recovery ...

28
ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang*, Josep Torrellas University of Illinois at Urbana-Champaign *Hewlett-Packard Laboratories

Transcript of Cost-Effective Architectural Support for Rollback Recovery ...

Page 1: Cost-Effective Architectural Support for Rollback Recovery ...

ReVive:Cost-Effective Architectural Support for

Rollback Recoveryyin Shared-Memory Multiprocessors

Milos Prvulovic, Zheng Zhang*, Josep Torrellas

University of Illinois at Urbana-Champaign*Hewlett-Packard Laboratoriesw c d b s

Page 2: Cost-Effective Architectural Support for Rollback Recovery ...

Motivation

Availability & Reliability increasingly important

Frequency ↑, Feature Size ↓ ⇒ Errors ↑

Complexity ↑ Verification Cost ↑ ⇒ Errors ↑Complexity ↑, Verification Cost ↑ ⇒ Errors ↑

Multiprocessors ⇒ Errors ↑

Global software-only recovery too slow

Can hardware help?Can hardware help?

Prvulovic et al. ReVive: Cost Effective Rollback Recovery 2

Page 3: Cost-Effective Architectural Support for Rollback Recovery ...

Motivation

Cost vs. Performance vs. Availability

Low Cost– Simple changes to a few key componentsSimple changes to a few key components

Low Performance Overhead– Handle frequent operations in hardware

High Availability– Fast recovery from a wide class of errors

Prvulovic et al. ReVive: Cost Effective Rollback Recovery 3

Page 4: Cost-Effective Architectural Support for Rollback Recovery ...

Contribution: New Scheme

Low Cost– HW changes only to directory controllers

– Memory overhead only 12.5% (with 7+1 parity)

Low Performance Overhead– Only 6% performance overhead on average– Only 6% performance overhead on average

High Availability– Recovery from: system-wide transients, loss of one node

– Availability better than 99.999% (assuming 1 error/ day)

Prvulovic et al. ReVive: Cost Effective Rollback Recovery 4

Page 5: Cost-Effective Architectural Support for Rollback Recovery ...

Overview of ReVive

Entire main memory protected by distributed parity– Like RAID-5, but in memory

Periodically establish a checkpointod c y s b s c c po– Main memory is the checkpoint state

Writ b k dirt d t fr h pr r t t– Write-back dirty data from caches, save processor context

Save overwritten data to enable restoring checkpoint– When program execution modifies memory for 1st time

Prvulovic et al. ReVive: Cost Effective Rollback Recovery 5

Page 6: Cost-Effective Architectural Support for Rollback Recovery ...

Distributed N+1 ParityNode 0 Node 1 Node N

Parity Data DataParity Group

. . .Distributed tominimize contention

Allocation Granularity: page

dUpdate Granularity: cache line

Prvulovic et al. ReVive: Cost Effective Rollback Recovery 6

Page 7: Cost-Effective Architectural Support for Rollback Recovery ...

Distributed Parity Update in HW

WB

ParPar Ack

Di

WB Line X

DiDi OR

XORDir

M

Wr

Dir

M

Dir

M

Rd Rd Wr

XOR

Mem MemMem

Home of Line X Home ofparity for Line X

Prvulovic et al. ReVive: Cost Effective Rollback Recovery 7

parity for Line X

Page 8: Cost-Effective Architectural Support for Rollback Recovery ...

ReVive: Checkpoint Creation Timeline

Checkpoint (<1ms for 2MB L2) Execute (100 ms)

<5μs ~400μs/MB<1μs ~20μsP0

P1

TimeP2

P3

Write-BackSave CPU

Timer Interrupt Sync

P3

Prvulovic et al. ReVive: Cost Effective Rollback Recovery 8

Context

Page 9: Cost-Effective Architectural Support for Rollback Recovery ...

Logging in HW

RdExcl Li X

Data

Dir

Line XNote:Wr Log also updates the parity

Mem

Wr LogRd Line X

Home of Line X

Prvulovic et al. ReVive: Cost Effective Rollback Recovery 9

Page 10: Cost-Effective Architectural Support for Rollback Recovery ...

Log Filtering

Add L bit to directory entry of each liney y– Clear all L bits on each checkpoint

Set when logged– Set when logged

– Do not log if already set

Not needed for correctness– Can be only in directory cache

– Can be completely omitted

Prvulovic et al. ReVive: Cost Effective Rollback Recovery 10

Page 11: Cost-Effective Architectural Support for Rollback Recovery ...

Classes of Recoverable Errors

CPU

Caches

Dir CtrlRevive Hardware

Mem Ctrl

DRAM

Net Ctrl InterconnectionNetwork

DRAM

(Trans + perm) errors in 1 nodeCan recover fromModule

Prvulovic et al. ReVive: Cost Effective Rollback Recovery 11

Trans errors in N nodes( p )

Page 12: Cost-Effective Architectural Support for Rollback Recovery ...

Permanent Node Loss: Recovery

Unavailable (~840ms) Degraded

Detection Repair Log Repair DataExecuteP0

~100ms ~490ms ~20s80ms100msP1

Time

Rollback

P2

P3

Time

B !

HWRollback

Checkpoint

P3

Prvulovic et al. ReVive: Cost Effective Rollback Recovery 12

Bzzzt!

Page 13: Cost-Effective Architectural Support for Rollback Recovery ...

Evaluation Setup

Splash-2 benchmarksp

16 superscalar processors (6-issue at 1GHz)

16kB L1 cache, 512kB L2 cache

2-D torus network, virtual cut-through routing, g g

100MHz DDR SDRAM

Using 7+1 distributed parity

Checkpoint interval: 10ms and infinite

Prvulovic et al. ReVive: Cost Effective Rollback Recovery 13

Page 14: Cost-Effective Architectural Support for Rollback Recovery ...

Performance Overhead

20%

25% L2 Misses/1000 Instructions

5

15%

20%

me

Incr

ease Cp10ms 7+1

CpInf 7+1

Cp10ms 1+1

56

9Ckp+Log

5%

10%

Exec

. Tim Cp 0 s

CpInf 1+1

Par

0%

Barnes

holes

ky FFTFMM LUOce

anad

iosity

Radix

aytra

ceVolr

end

ater-N

2ate

r-Sp

Averag

e

Par

B Cho O

Rad Ray VoW

at Wat Ave

Tolerable 6% performance overhead

Prvulovic et al. ReVive: Cost Effective Rollback Recovery 14

Page 15: Cost-Effective Architectural Support for Rollback Recovery ...

Worst-Case Recovery Time

60

70

30

40

50

seco

nds Reconstruct (Log)

Reconstruct (Mem)

Undo

10

20

30

Mill

is Undo

0

Barnes

Cholesky FFT

FMM LUOce

anRad

iosity

Radix

Raytra

ceVolre

ndWate

r-N2

Water-S

pAve

rage

Radix: 590ms + 180ms + 50ms = 820ms99 999% il bili

C R R W W

Redo Work HW Repair

Prvulovic et al. ReVive: Cost Effective Rollback Recovery 15

⇒ 99.999% availability

Page 16: Cost-Effective Architectural Support for Rollback Recovery ...

Network Traffic

Bytes/Instr

1.2

PARCkp WBExe WB

0.6

Exe WBRD/RDX

0

Barn

es

oles

ky

FFT

FMM LU

Oce

an

dios

ity

Rad

ix

ytra

ce

Volre

nd

ter-

N2

ter-

Sp

Mea

n

B

Cho O

Rad Ray V

Wat

Wat H

Prvulovic et al. ReVive: Cost Effective Rollback Recovery 16

Page 17: Cost-Effective Architectural Support for Rollback Recovery ...

Memory Traffic

Bytes/Instr

2.0 PARLOGCkp WB

1.0

Ckp WBExe WBRD/RDX

0

Barn

es

oles

ky

FFT

FMM LU

Oce

an

dios

ity

Rad

ix

aytra

ce

Volre

nd

ater

-N2

ater

-Sp

H M

ean

B

Ch

Ra Ra V

Wa

Wa H

Prvulovic et al. ReVive: Cost Effective Rollback Recovery 17

Page 18: Cost-Effective Architectural Support for Rollback Recovery ...

Related Work

Device- or problem-specific schemesp p– DIVA, Redundant Multithreading, Slipstream, ECC, etc.

ReVive can handle errors that escape these schemes– ReVive can handle errors that escape these schemes,improving overall availability at low additional cost

Other system recovery schemesOther system-recovery schemes– Plank et al. - N+1 parity in software

– Masubuchi et al. - logging with bus-snooper

– SafetyNet

Prvulovic et al. ReVive: Cost Effective Rollback Recovery 18

Page 19: Cost-Effective Architectural Support for Rollback Recovery ...

Related Work: SafetyNet

Types of recoverable errors– ReVive: Permanent (loss of a node)+Transient

– SafetyNet: Transient; perm only w/ redundant devices

HW modifications– ReVive: Directory controller onlyReVive: Directory controller only

– SafetyNet: Memory, caches, coherence protocol

Performance Overhead– 6% with ReVive, negligible with SafetyNet

Prvulovic et al. ReVive: Cost Effective Rollback Recovery 19

Page 20: Cost-Effective Architectural Support for Rollback Recovery ...

Conclusions

Recovery from: system-wide transients loss of 1 nodeRecovery from: system wide transients, loss of 1 node

Availability better than 99.999%

Low performance overhead (6% on average)

HW changes only to directory controllersg y y

Memory overhead 12.5% with 7+1 parity

O h d b d d b i i i– Overhead can be reduced by increasing parity groups

Prvulovic et al. ReVive: Cost Effective Rollback Recovery 20

Page 21: Cost-Effective Architectural Support for Rollback Recovery ...

ReVive:Cost-Effective Architectural Support for

Rollback Recoveryyin Shared-Memory Multiprocessors

Milos Prvulovic, Zheng Zhang, Josep Torrellas

http://[email protected] v v @cs c d

Page 22: Cost-Effective Architectural Support for Rollback Recovery ...

Rollback Recovery in Multiprocessors

Checkpoint Consistencyp y– Global, Local Coordinated or Local Uncoordinated

Ch k i S iCheckpoint Separation– Full or Partial

– Partial can be with Logging, Renaming or Buffering

Checkpoint Storagep g– Safe External, Safe Internal or for a Specialized Error Class

Prvulovic et al. ReVive: Cost Effective Rollback Recovery 22

Page 23: Cost-Effective Architectural Support for Rollback Recovery ...

Checkpoint Consistency

GlobalSynchronization is fast enough on h d hiGlobal

– All synchronize to make a single consistent checkpoint

shared-memory machines

Local Coordinated– Synchronize as needed for a set of consistent checkpointsy p

Local UncoordinatedD t h i– Do not synchronize

– Set of consistent checkpoints computed when recovering

Prvulovic et al. ReVive: Cost Effective Rollback Recovery 23

Page 24: Cost-Effective Architectural Support for Rollback Recovery ...

Checkpoint Storage

Safe E ternal (e g RAID) Not fast eno ghSafe External (e.g. RAID)– Recovery data on redundancy protected-disk

Not fast enough

Safe Internal (e.g. DRAM)– Recovery data in redundancy-protected memoryy y p y

Unsafe InternalR d t t t t d b d d

Not general enough

– Recovery data not protected by redundancy

– Assumes memory content survives errors

Prvulovic et al. ReVive: Cost Effective Rollback Recovery 24

Page 25: Cost-Effective Architectural Support for Rollback Recovery ...

Checkpoint Separation

Full Too much storage needed

– Checkpoint and working data sets do not intersect

Partial with Buffering Commit atomicity, overheadg– Buffer non-checkpoint data, flush to commit

Partial with Renaming

y, d

Complex HW or coarse grainPartial with Renaming– Rename to avoid overwriting checkpoint data

P i l i h L i

Complex HW or coarse grain

Partial with Logging– Save overwritten checkpoint data in a log

Prvulovic et al. ReVive: Cost Effective Rollback Recovery 25

Page 26: Cost-Effective Architectural Support for Rollback Recovery ...

Log & Parity Update Races

Error while log update in progress– Must fully perform log update before starting overwrite

Error while parity update in progresso w p y pd p og ss– Assume a single node fails

C r r ith r ld r t t– Can recover either old or new content

– Both result in consistent recovery (see paper)

Long error detection latency– Keep sufficient logs to recover far enough into the past

Prvulovic et al. ReVive: Cost Effective Rollback Recovery 26

Page 27: Cost-Effective Architectural Support for Rollback Recovery ...

Availability vs Overhead

If checkpoint interval too shortp– Lost work and hardware self-check dominate recovery

Fault free execution performance suffers– Fault-free execution performance suffers

If checkpoint interval too long– Low availability

Find a good balanceg– Checkpoint intervals of 100ms to 1s

Prvulovic et al. ReVive: Cost Effective Rollback Recovery 27

Page 28: Cost-Effective Architectural Support for Rollback Recovery ...

Analysis

Cache size vs. checkpoint intervalp– 512kB caches with checkpoints every 10ms

5MB caches with checkpoints every 100ms– 5MB caches with checkpoints every 100ms

Log size vs. checkpoint interval– Log will grow in sub-linear proportion to interval size

– 10ms: <3MB per node, only two apps >128kB per node

Parity overhead: 12.5% of system memory is parity

Prvulovic et al. ReVive: Cost Effective Rollback Recovery 28