Resilient Theme Task # 5.5.3 GSRC Annual Symposium · SWAT_poster_draft_3.pptx Author: Pradeep...

1
GSRC Annual Symposium September 28, 2010 through October 1, 2010 Detection Results Low SDC rate for all apps <0.5% of injections SDCs Short detection latency >90% in <100K instr Low-cost symptom detection feasible for HW faults Diagnosis Results >95% successful diagnosis Latency <10M invisible µarch-level diagnosis for repair SWAT diagnoses faults in single and multi-core systems Recovery Results Pradeep Ramachandran, Siva Kumar SastryHari, Manlap Li, SwarupSahoo, Robert Smolinsk, Xin Fu, Lei Chen, SaritaAdve, VikramAdve Resilient Theme Task # 5.5.3 The Reliability Threat Technology scaling smaller devices vulnerable to failures Increased in-the-field failures in commodity systems Need low-cost detection, diagnosis, recovery, repair solutions Traditional solutions high area, performance, power SWAT: A Comprehensive Low Cost Solution Fault Detection [ASPLOS ʻ08, DSN ʻ08] Fault Recovery [submitted] Key Findings SWAT effective for permanent, transient faults in many apps Detection: <0.5% SDC rate in SPEC, server, media apps Low overheads during fault-free execution Recovery: Majority of faults recoverable in <100K instructions <5% perf, near-zero area impact from recovery operations Diagnosis: >95% of detected faults successfully diagnosed Faulty core identified without spare core TMR/DMR only for diagnosis does not impact fault-free exec Fault Diagnosis [DSN ʼ08, MICRO ʻ09] Transient errors Wear-out Design Bugs … and so on Goal: Effective, quick detection with minimal fault-free impact Use symptom detectors to monitor anomalous SW execution Simple hardware detectors with low area overheads Low-cost SW detectors to aid HW detectors Goal: Low-cost fault recovery in the presence of I/O HW checkpoint to restore system state Low-cost recovery for proc + memory Buffer external outputs in dedicated HW First low-cost implementation w/ simple HW Avoids commonly ignored output-commit problem Leverage SW support for device reset, input replay Goal: Diagnose fault source without affecting fault-free exec No spares for diagnosis Diagnose faulty core even when symptom from fault-free core Fatal Traps Div by zero, RED state, etc. Hangs Simple HW hang detector Kernel Panic OS panics due to fault High OS High contiguous OS activity App Abort App abort due to fault 0% 20% 40% 60% 80% 100% Full No-Device No-I/O Full No-Device No-I/O Full No-Device No-I/O Full No-Device No-I/O 100K 10M 100K 10M Permanents Transients Injected Faults Potential SDC DUE Recovered Masked 6.3% 2.5% 2.3% 1.5% Ongoing and Future Work Ongoing: Prototyping SWAT on FPGA Implement SWAT firmware in OpenSolaris Demonstrate SWAT on multicore OpenSPARC FPGA Leverage Univ. of Michigan CrashTest for fault injection Understand when/why SWAT works Evaluate SWAT for off-core faults, other fault models A B C D Challenges Multithreaded applications Full-system deterministic replay No known good core Isolated deterministic replay Emulated TMR Key Ideas T A T B T C T D T A T A T B T C T D T A T B T C T D T A T B T C T D 0% 20% 40% 60% 80% 100% Decoder INT ALU Reg Dbus Int reg ROB RAT AGEN Average Detected Faults CorrectlyDiagnosed Undiagnosed 99 100 99 87 100 78 99 95.9 1 10 100 10K 100K 1M 2M 5M 10M Client exec time with buffer/ without buffer Chkpt Interval (in instructions) apache sshd squid mysql Fault Out-of-Bounds HW/SW co-designed detector Monitor legal limit of addresses Low perf, area overhead iSWAT Compiler support to detect faults Use likely invariants as detectors Low false +ves, perf. impact 0% 20% 40% 60% 80% 100% SPEC Server Media SPEC Server Media Permanents Transients Total injections Masked Detected App-Tolerated SDC 0.1 0.1 0.2 0.2 0.3 0.5 * Does not include iSWAT detectors Low overheads @ 100K inst <5% perf, <2KB area Practical sol delay <1M inst High recovery at 100K interval Low perf, area impact SWAT effective for low-cost fault recovery

Transcript of Resilient Theme Task # 5.5.3 GSRC Annual Symposium · SWAT_poster_draft_3.pptx Author: Pradeep...

Page 1: Resilient Theme Task # 5.5.3 GSRC Annual Symposium · SWAT_poster_draft_3.pptx Author: Pradeep Created Date: 9/15/2010 8:03:22 PM ...

GSRC Annual Symposium

September 28, 2010 through

October 1, 2010

Detection Results

Low SDC rate for all apps

<0.5% of injections SDCs

Short detection latency

>90% in <100K instr

⇒ Low-cost symptom detection feasible for HW faults

Diagnosis Results >95% successful diagnosis

Latency <10M ⇒invisible

µarch-level diagnosis for repair

⇒ SWAT diagnoses faults in single and multi-core systems

Recovery Results

Pradeep Ramachandran, Siva Kumar SastryHari, Manlap Li, SwarupSahoo, Robert Smolinsk, Xin Fu, Lei Chen, SaritaAdve, VikramAdve

Resilient Theme Task # 5.5.3

The Reliability Threat

Technology scaling ⇒ smaller devices vulnerable to failures

Increased in-the-field failures in commodity systems

Need low-cost detection, diagnosis, recovery, repair solutions

Traditional solutions ⇒ high area, performance, power

SWAT: A Comprehensive Low Cost Solution

Fault Detection [ASPLOS ʻ08, DSN ʻ08] Fault Recovery [submitted]

Key Findings SWAT effective for permanent, transient faults in many apps

Detection: <0.5% SDC rate in SPEC, server, media apps

Low overheads during fault-free execution

Recovery: Majority of faults recoverable in <100K instructions

<5% perf, near-zero area impact from recovery operations

Diagnosis: >95% of detected faults successfully diagnosed

Faulty core identified without spare core

TMR/DMR only for diagnosis ⇒ does not impact fault-free exec

Fault Diagnosis [DSN ʼ08, MICRO ʻ09]

Transient errors Wear-out Design Bugs … and so on

Goal: Effective, quick detection with minimal fault-free impact

Use symptom detectors to monitor anomalous SW execution

Simple hardware detectors with low area overheads

Low-cost SW detectors to aid HW detectors

Goal: Low-cost fault recovery in the presence of I/O

HW checkpoint to restore system state

Low-cost recovery for proc + memory

Buffer external outputs in dedicated HW

First low-cost implementation w/ simple HW

Avoids commonly ignored output-commit problem

Leverage SW support for device reset, input replay

Goal: Diagnose fault source without affecting fault-free exec

⇒ No spares for diagnosis

Diagnose faulty core even when symptom from fault-free core

Fatal Traps

Div by zero, RED state, etc.

Hangs

Simple HW hang detector

Kernel Panic

OS panics due to fault

High OS

High contiguous OS activity

App Abort

App abort due to fault

0% 20% 40% 60% 80%

100%

Full

No-

Dev

ice

No-

I/O

Full

No-

Dev

ice

No-

I/O

Full

No-

Dev

ice

No-

I/O

Full

No-

Dev

ice

No-

I/O

100K 10M 100K 10M

Permanents Transients

Inje

cted

Fau

lts

Potential SDC DUE Recovered Masked

6.3% 2.5% 2.3% 1.5%

Ongoing and Future Work Ongoing: Prototyping SWAT on FPGA

Implement SWAT firmware in OpenSolaris

Demonstrate SWAT on multicore OpenSPARC FPGA

Leverage Univ. of Michigan CrashTest for fault injection

Understand when/why SWAT works

Evaluate SWAT for off-core faults, other fault models

A B C D Challenges

Multithreaded applications

Full-system deterministic

replay

No known good core

Isolated deterministic

replay

Emulated TMR

Key Ideas TA TB TC TD

TA

TA TB TC TD

TA TB TC TD

TA TB TC TD

0%

20%

40%

60%

80%

100%

Dec

oder

INT

ALU

Reg

Dbu

s

Int r

eg

RO

B

RAT

AG

EN

Aver

age

Det

ecte

d Fa

ults

CorrectlyDiagnosed Undiagnosed

99 100 99 87 100 78 99 95.9

1

10

100

10K 100K 1M 2M 5M 10M Clie

nt e

xec

time

with

buf

fer/

with

out b

uffe

r

Chkpt Interval (in instructions)

apache

sshd

squid

mysql

Fault

Out-of-Bounds HW/SW co-designed detector

Monitor legal limit of addresses Low perf, area overhead

iSWAT Compiler support to detect faults Use likely invariants as detectors

Low false +ves, perf. impact

0%

20%

40%

60%

80%

100%

SP

EC

Ser

ver

Med

ia

SP

EC

Ser

ver

Med

ia

Permanents Transients

Tota

l inj

ectio

ns

Masked Detected App-Tolerated SDC

0.1 0.1 0.2 0.2 0.3 0.5

* Does not include iSWAT detectors

Low overheads @ 100K inst

<5% perf, <2KB area

Practical sol ⇒delay <1M inst

High recovery at 100K interval

Low perf, area impact

⇒ SWAT effective for low-cost fault recovery