SWAT: Designing Reisilent Hardware by Treating Software Anomalies

SWAT: Designing Reisilent Hardware byTreating Software Anomalies

Man-Lap (Alex) Li, Pradeep Ramachandran, Swarup K. Sahoo,Siva Kumar Sastry Hari, Rahmet Ulya Karpuzcu,

Sarita Adve, Vikram Adve, Yuanyuan Zhou

Department of Computer ScienceUniversity of Illinois at Urbana-Champaign

[email protected]

2

Motivation• Hardware failures will happen in the field

– Aging, soft errors, inadequate burn-in, design defects, …

Need in-field detection, diagnosis, recovery, repair

• Reliability problem pervasive across many markets– Traditional redundancy (e.g., nMR) too expensive– Piecemeal solutions for specific fault model too expensive– Must incur low area, performance, power overhead

Today: low-cost solution for multiple failure sources

3

Observations

• Need handle only hardware faults that propagate to software

• Fault-free case remains common, must be optimized

Watch for software anomalies (symptoms)Hardware fault detection ~ Software bug detectionZero to low overhead “always-on” monitors

Diagnose cause after symptom detected May incur high overhead, but rarely invoked

SWAT: SoftWare Anomaly Treatment

4

SWAT Framework Components

• Detection: Symptoms of S/W misbehavior, minimal backup H/W

• Recovery: Hardware/Software checkpoint and rollback

• Diagnosis: Rollback/replay on multicore

• Repair/reconfiguration: Redundant, reconfigurable hardware

• Flexible control through firmware

Fault Error Symptomdetected

Recovery

Diagnosis Repair

Checkpoint Checkpoint

5

SWAT

4. Accurate Fault Modeling

2. Detectors w/ Software support [Sahoo et al., DSN ‘08]

3. Trace Based Fault Diagnosis [Li et al., DSN ‘08]

1. Detectors w/ Hardware support [ASPLOS ‘08]

Diagnosis


Recovery

Repair


6

Hardware-Only Symptom-based detection

• Observe anomalous symptoms for fault detection– Incur low overheads for “always-on” detectors– Minimal support from hardware

• Fatal traps generated by hardware– Division by Zero, RED State, etc.

• Hangs detected using simple hardware hang detector• High OS activity detected with performance counter

– Typical OS invocations take 10s or 100s of instructions

7

Experimental Methodology

• Microarchitecture-level fault injection– GEMS timing models + Simics full-system simulation– SPEC workloads on Solaris-9 OS

• Permanent fault models– Stuck-at, bridging faults in latches of 8 arch structures– 12,800 faults, <0.3% error @ 95% confidence

• Simulate impact of fault in detail for 10M instructions

10M instr

Timing simulation

If no symptom in 10M instr, run to completion

Functional simulation

Fault

App masked, or symptom > 10M, or silent data corruption (SDC)

8

Efficacy of Hardware-only Detectors

• Coverage: Percentage of unmasked faults detected– 98% faults detected, 0.4% give SDC (w/o FPU)

Additional support required for FPU-like units

– 66% of detected faults corrupt OS state, need recovery Despite low OS activity in fault-free execution

• Latency: Number of instr between activation and detection– HW recovery for upto 100k instr, SW longer latencies– App in 87% of detections recoverable using HW– OS recoverable in virtually all detections using HW

OS recovery using SW hard

9

Improving SWAT Detection Coverage

Can we improve coverage, SDC rate further?

• SDC faults primarily corrupt data values– Illegal control/address values caught by other symptoms– Need detectors to capture “semantic” information

• Software-level invariants capture program semantics– Use when higher coverage desired– Sound program invariants expensive static analysis– We use likely program invariants

10

Likely Program Invariants

• Likely program invariants– Hold on all observed inputs, expected to hold on others– But suffer from false positives– Use SWAT diagnosis to detect false positives on-line

• iSWAT - Compiler-assisted symptom detectors– Range-based value invariants [Sahoo et al. DSN ‘08]– Check MIN value MAX on data values– Disable invariant when diagnose false-positive

11

iSWAT implementation

Training PhaseApplication

Compiler Pass in LLVM

- - - - - Application

- - - - -

Ranges i/p #1 . . . . Range

s i/p #n

Invariant Ranges

Invariant Monitoring

Code

Test,train,

external inputs

12





- - - - -


s i/p #n

Invariant Ranges


Code



- - - - -

Invariant Checking

Code

Full System Simulation

Inject Faults

SWAT Diagnosis

InvariantViolation

False Positive(Disable Invariant)

Fault Detection

Fault Detection Phase

Test,train,

external inputs

Refinput

13

iSWAT Results• Explored SWAT with 5 apps on previous methodology

• Undetected faults reduce by 30%• Invariants reduce SDCs by 73% (33 to 9)

• Overheads: 5% on x86, 14% on UltraSparc IIIi– Reasonably low overheads on some machines– Un-optimized invariants used, can be further reduced

• Exploring more sophistication for coverage, overheads

14

Fault Diagnosis

• Symptom-based detection is cheap but – High latency from fault activation to detection– Difficult to diagnose root cause of fault– How to diagnose SW bug vs. transient vs. permanent fault?

• For permanent fault within core– Disable entire core? Wasteful!– Disable/reconfigure µarch-level unit?– How to diagnose faults to µarch unit granularity?

• Key ideas– Single core fault model, multicore fault-free core available– Checkpoint/replay for recovery replay on good core, compare– Synthesizing DMR, but only for diagnosis

15

SW Bug vs. Transient vs. Permanent• Rollback/replay on same/different core• Watch if symptom reappears

No symptom SymptomFalse positive (iSWAT) or

Deterministic s/w orPermanent h/w bug

Symptom detectedFaulty Good

Rollback on faulty core

Rollback/replay on good core

Continue Execution

Transient or non-deterministic s/w bug

SymptomPermanenth/w fault,

needs repair!

No symptomFalse positive (iSWAT) orDeterministic s/w bug, send to s/w layer

16

Diagnosis Framework

Permanent fault

Microarchitecture-LevelDiagnosis

Unit X is faulty

Symptomdetected

Diagnosis

Softwarebug

Transientfault

17

Fault-Free CoreExecution

Faulty CoreExecution

Trace-Based Fault Diagnosis (TBFD)Permanent

fault detected

Invoke TBFD

DiagnosisAlgorithm

=?

18


fault detected

Invoke TBFD

Rollback faulty-core to checkpoint

Replay execution, collect info

=?

DiagnosisAlgorithm

Fault-Free CoreExecution

19


fault detected


Replay execution, collect info

=?

DiagnosisAlgorithm

Load checkpoint on fault-free core

Fault-free instruction exec

What info to collect?

What info to compare?What to do on divergence?

Invoke TBFD

20

Can a Divergent Instruction Lead to Diagnosis?

Simpler case: ALU fault

sub r6,r1,r2sub r6,r1,r2 2 1 72 x 9

FaultyFault-free HW usedresults

add r1,r3,r5add r1,r3,r5 0dec alu

1 12

dstpreg

5 x 3

Both divergent instructions used same ALU ALU1 faulty

21

r2 p20

p20 4

• Complex example: Fault in register alias table (RAT) entry

• Divergent instructions do not directly lead to faulty unit• Instead, look backward/forward in instruction stream

– Need to collect and analyze instruction trace

Can a Divergent Instruction Lead to Diagnosis?

r2 p20

r1

log phyp4

r3 p13

r5 p24

RAT

IA: r3 r2 + r2

phy valp20 4p24 3

Reg File

p4 8r3 p55

error!

r3 p24

r5 p24

p24 3p24 8

IB: r1 r5 * r2

r1 p4

p4 32

Fault-freer1=12

Diverged!

But IB does not use faulty HW…

22

Diagnosing Permanent Fault to µarch Granularity

• Trace-based fault diagnosis (TBFD)– Compare instruction trace of faulty vs. good execution – Divergence faulty hardware used diagnosis clues

• Diagnose faults to µarch units of processor– Check µarch-level invariants in several parts of processor– Front end, Meta-datapath, datapath faults– Diagnosis in out-of-order logic (meta-datapath) complex

• Results– 98% of the faults by SWAT successfully diagnosed– TBFD flexible for other detectors/granularity of repair

23

SWAT

4. Accurate Fault Modeling

2. Detectors w/ Software support [Sahoo et al., DSN ‘08]

3. Trace Based Fault Diagnosis [Li et al., DSN ‘08]

1. Detectors w/ Hardware support [ASPLOS ‘08]

Diagnosis


Recovery

Repair


24

SWATSim: Fast and Accurate Fault Models

• Need accurate µarch-level fault models– Gate level injections accurate but too slow– µarch (latch) level injections fast but inaccurate

• Can we achieve µarch-level speed at gate-level accuracy?

• Mix-mode (hierarchical) Simulation– µarch-level + Gate-level simulation– Simulate only faulty component at gate-level, on-demand– Invoke gate-level sim at online for permanent faults

Simulating fault effect with real-world vectors

25

SWAT-Sim: Gate-level Accuracy at µarch Speedsµarch simulation

r3 r1 op r2

Faulty UnitUsed?

Continue µarch simulation

µarch-LevelSimulation

NoInput

Output

Gate-LevelFault

Simulation

Stimuli

Response

Fault propagatedto output

Yes

r3

26

Results from SWAT-Sim• SWAT-sim implemented within full-system simulation

– NCVerilog + VPI for gate-level sim of ALU/AGEN modules

• SWAT-Sim: High accuracy at low overheads– 100,000x faster than gate-level, same modeling fidelity– 2x slowdown over µarch-level, at higher accuracy

• Accuracy of µarch models using SWAT coverage/latency– µarch stuck-at models generally inaccurate– Differences in activation rate, multi-bit flips

• Complex manifestations Hard to derive better models– Need SWAT-Sim, at least for now

27

SWAT Summary

• SWAT: SoftWare Anomaly Treatment– Handle all and only faults that matter– Low, amortized overheads– Holistic systems view enables novel solutions– Customizable and flexible

• Prior results:– Low-cost h/w detectors gave high coverage, low SDC rate

• This talk:– iSWAT: Higher coverage w/ software-assisted detectors– TBFD: µarch level fault diagnosis by synthesizing DMR– SWAT-Sim: Gate-level fault accuracy at µarch level speed

28

Future Work• Recovery: hybrid, application-specific• Aggressive use of software reliability techniques

– Leverage diagnosis mechanism• Multithreaded software• Off-core faults• Post-silicon debug and test

– Use faulty trace as fault-model oblivious test vector• Validation on FPGA (w/ Michigan)• Hardware assertions to complement software symptoms

BACKUP SLIDES

30

0%

20%

40%

60%

80%

100%

Decoder Int ALU

Reg Dbus Int reg

ROB RAT AGEN FP ALU

Avg no FP

Total injections

SDC

Symp>10M

High-OS

Hang-App

Hang-OS

FatalTrap-AppFatalTrap-OSApp-Mask

Arch-Mask

100% 98% 98% 96% 100% 100% 95% 98%27%

Breakup of Detections by SW symptoms

• 98% unmasked faults detected within 10M instr (w/o FPU) – Need HW support or SW monitoring for FPU

31

SW Components Corrupted

• 66% of faults corrupt system state before detection– Need to recover system state

0%

20%

40%

60%

80%

100%

Decoder INT ALU Reg Dbus

Int reg ROB RAT

AGEN FP ALU

Percentage of Injections

None

OS and maybe app

App only

32

Latency from Application mismatch

0%

20%

40%

60%

80%

100%


Int reg ROB RAT

AGEN FP ALU

100000001000000100000100001000100101

• 86% of faults detected under 100k– 42% detected under 10k

33

0%

20%

40%

60%

80%

100%


Int reg ROB RAT

AGEN FP ALU

100000001000000100000100001000100101

Latency from OS mismatch

• 99% of faults detected under 100k

34





- - - - -


s i/p #n

Invariant Ranges


Code



- - - - -

Invariant Checking

Code

Full System Simulation

Inject Faults

SWAT Diagnosis

InvariantViolation

False Positive(Disable Invariant)

Fault Detection

Fault Detection Phase

Test,train,

external inputs

Refinput

35


fault detected

Invoke diagnosis


Load checkpoint on fault-free core

Replay execution, collect µarch info

Fault-free instruction exec

TBFDFaults in Front-end

Meta-datapath Faults

Datapath Faults

Faulty trace Test trace=?

36

Fault Diagnosability

0%

20%

40%

60%

80%

100%

Decoder INT ALU Reg Dbus Int Reg ROB RAT AGEN Overall

Percentage of Detected Faults

Incorrect

NoMismatch

D-Other

D-Unique

• 98% of detected faults are diagnosed– 89% diagnosed to unique unit/array entry– Meta-datapath faults in out-of-order exec mislead TBFD

37

Accuracy of existing Fault Models

• SWAT-sim implemented within full-system simulator– NCVerilog + VPI to simulate gate-level ALU and AGEN

AGEN

0%

20%

40%

60%

80%

100%

uarch s@1 uarch s@0 Gate s@1 Gate s@0 Gate Delay


Uarch-Mask Arch-Mask App-MaskDetected Detected>10M SDC

97.1% 94.0% 95.3% 95.5%96.0%Integer ALU

0%

20%

40%

60%

80%

100%

uarch s@1 uarch s@0 Gate s@1 Gate s@0 Gate Delay


Uarch-Mask Arch-Mask App-MaskDetected Detected>10M SDC

100% 98.8% 94.4% 89.4%93.9%

• Existing µarch-level fault models inaccurate– Differences in activation rate, multi-bsit flips

• Accurate models hard to derive need SWAT-Sim!

38

Summary: SWAT Advantages• Handles all faults that matter

– Oblivious to low-level failure modes & masked faults

• Low, amortized overheads– Optimize for common case, exploit s/w reliability solutions

• Holistic systems view enables novel solutions– Invariant detectors use diagnosis mechanisms– Diagnosis uses recovery mechanisms

• Customizable and flexible– Firmware based control affords hybrid, app-specific recovery (TBD)

• Beyond hardware reliability– SWAT treats hardware faults as software bugs

Long-term goal: unified system (hw + sw) reliability at lowest cost– Potential applications to post-silicon test and debug

39

Transients Results

• 6400 transient faults injected across 8 structures• 83% unmasked faults detected within 10M instr• Only 0.4% of injected faults results in SDCs

SWAT: Designing Reisilent Hardware by Treating Software Anomalies

Documents

Transcript of SWAT: Designing Reisilent Hardware by Treating Software Anomalies