SWAT: Designing Reisilent Hardware by Treating Software Anomalies
description
Transcript of SWAT: Designing Reisilent Hardware by Treating Software Anomalies
SWAT: Designing Reisilent Hardware byTreating Software Anomalies
Man-Lap (Alex) Li, Pradeep Ramachandran, Swarup K. Sahoo,Siva Kumar Sastry Hari, Rahmet Ulya Karpuzcu,
Sarita Adve, Vikram Adve, Yuanyuan Zhou
Department of Computer ScienceUniversity of Illinois at Urbana-Champaign
2
Motivation• Hardware failures will happen in the field
– Aging, soft errors, inadequate burn-in, design defects, …
Need in-field detection, diagnosis, recovery, repair
• Reliability problem pervasive across many markets– Traditional redundancy (e.g., nMR) too expensive– Piecemeal solutions for specific fault model too expensive– Must incur low area, performance, power overhead
Today: low-cost solution for multiple failure sources
3
Observations
• Need handle only hardware faults that propagate to software
• Fault-free case remains common, must be optimized
Watch for software anomalies (symptoms)Hardware fault detection ~ Software bug detectionZero to low overhead “always-on” monitors
Diagnose cause after symptom detected May incur high overhead, but rarely invoked
SWAT: SoftWare Anomaly Treatment
4
SWAT Framework Components
• Detection: Symptoms of S/W misbehavior, minimal backup H/W
• Recovery: Hardware/Software checkpoint and rollback
• Diagnosis: Rollback/replay on multicore
• Repair/reconfiguration: Redundant, reconfigurable hardware
• Flexible control through firmware
Fault Error Symptomdetected
Recovery
Diagnosis Repair
Checkpoint Checkpoint
5
SWAT
4. Accurate Fault Modeling
2. Detectors w/ Software support [Sahoo et al., DSN ‘08]
3. Trace Based Fault Diagnosis [Li et al., DSN ‘08]
1. Detectors w/ Hardware support [ASPLOS ‘08]
Diagnosis
Fault Error Symptomdetected
Recovery
Repair
Checkpoint Checkpoint
6
Hardware-Only Symptom-based detection
• Observe anomalous symptoms for fault detection– Incur low overheads for “always-on” detectors– Minimal support from hardware
• Fatal traps generated by hardware– Division by Zero, RED State, etc.
• Hangs detected using simple hardware hang detector• High OS activity detected with performance counter
– Typical OS invocations take 10s or 100s of instructions
7
Experimental Methodology
• Microarchitecture-level fault injection– GEMS timing models + Simics full-system simulation– SPEC workloads on Solaris-9 OS
• Permanent fault models– Stuck-at, bridging faults in latches of 8 arch structures– 12,800 faults, <0.3% error @ 95% confidence
• Simulate impact of fault in detail for 10M instructions
10M instr
Timing simulation
If no symptom in 10M instr, run to completion
Functional simulation
Fault
App masked, or symptom > 10M, or silent data corruption (SDC)
8
Efficacy of Hardware-only Detectors
• Coverage: Percentage of unmasked faults detected– 98% faults detected, 0.4% give SDC (w/o FPU)
Additional support required for FPU-like units
– 66% of detected faults corrupt OS state, need recovery Despite low OS activity in fault-free execution
• Latency: Number of instr between activation and detection– HW recovery for upto 100k instr, SW longer latencies– App in 87% of detections recoverable using HW– OS recoverable in virtually all detections using HW
OS recovery using SW hard
9
Improving SWAT Detection Coverage
Can we improve coverage, SDC rate further?
• SDC faults primarily corrupt data values– Illegal control/address values caught by other symptoms– Need detectors to capture “semantic” information
• Software-level invariants capture program semantics– Use when higher coverage desired– Sound program invariants expensive static analysis– We use likely program invariants
10
Likely Program Invariants
• Likely program invariants– Hold on all observed inputs, expected to hold on others– But suffer from false positives– Use SWAT diagnosis to detect false positives on-line
• iSWAT - Compiler-assisted symptom detectors– Range-based value invariants [Sahoo et al. DSN ‘08]– Check MIN value MAX on data values– Disable invariant when diagnose false-positive
11
iSWAT implementation
Training PhaseApplication
Compiler Pass in LLVM
- - - - - Application
- - - - -
Ranges i/p #1 . . . . Range
s i/p #n
Invariant Ranges
Invariant Monitoring
Code
Test,train,
external inputs
12
iSWAT implementation
Training PhaseApplication
Compiler Pass in LLVM
- - - - - Application
- - - - -
Ranges i/p #1 . . . . Range
s i/p #n
Invariant Ranges
Invariant Monitoring
Code
Compiler Pass in LLVM
- - - - - Application
- - - - -
Invariant Checking
Code
Full System Simulation
Inject Faults
SWAT Diagnosis
InvariantViolation
False Positive(Disable Invariant)
Fault Detection
Fault Detection Phase
Test,train,
external inputs
Refinput
13
iSWAT Results• Explored SWAT with 5 apps on previous methodology
• Undetected faults reduce by 30%• Invariants reduce SDCs by 73% (33 to 9)
• Overheads: 5% on x86, 14% on UltraSparc IIIi– Reasonably low overheads on some machines– Un-optimized invariants used, can be further reduced
• Exploring more sophistication for coverage, overheads
14
Fault Diagnosis
• Symptom-based detection is cheap but – High latency from fault activation to detection– Difficult to diagnose root cause of fault– How to diagnose SW bug vs. transient vs. permanent fault?
• For permanent fault within core– Disable entire core? Wasteful!– Disable/reconfigure µarch-level unit?– How to diagnose faults to µarch unit granularity?
• Key ideas– Single core fault model, multicore fault-free core available– Checkpoint/replay for recovery replay on good core, compare– Synthesizing DMR, but only for diagnosis
15
SW Bug vs. Transient vs. Permanent• Rollback/replay on same/different core• Watch if symptom reappears
No symptom SymptomFalse positive (iSWAT) or
Deterministic s/w orPermanent h/w bug
Symptom detectedFaulty Good
Rollback on faulty core
Rollback/replay on good core
Continue Execution
Transient or non-deterministic s/w bug
SymptomPermanenth/w fault,
needs repair!
No symptomFalse positive (iSWAT) orDeterministic s/w bug, send to s/w layer
16
Diagnosis Framework
Permanent fault
Microarchitecture-LevelDiagnosis
Unit X is faulty
Symptomdetected
Diagnosis
Softwarebug
Transientfault
17
Fault-Free CoreExecution
Faulty CoreExecution
Trace-Based Fault Diagnosis (TBFD)Permanent
fault detected
Invoke TBFD
DiagnosisAlgorithm
=?
18
Trace-Based Fault Diagnosis (TBFD)Permanent
fault detected
Invoke TBFD
Rollback faulty-core to checkpoint
Replay execution, collect info
=?
DiagnosisAlgorithm
Fault-Free CoreExecution
19
Trace-Based Fault Diagnosis (TBFD)Permanent
fault detected
Rollback faulty-core to checkpoint
Replay execution, collect info
=?
DiagnosisAlgorithm
Load checkpoint on fault-free core
Fault-free instruction exec
What info to collect?
What info to compare?What to do on divergence?
Invoke TBFD
20
Can a Divergent Instruction Lead to Diagnosis?
Simpler case: ALU fault
sub r6,r1,r2sub r6,r1,r2 2 1 72 x 9
FaultyFault-free HW usedresults
add r1,r3,r5add r1,r3,r5 0dec alu
1 12
dstpreg
5 x 3
Both divergent instructions used same ALU ALU1 faulty
21
r2 p20
p20 4
• Complex example: Fault in register alias table (RAT) entry
• Divergent instructions do not directly lead to faulty unit• Instead, look backward/forward in instruction stream
– Need to collect and analyze instruction trace
Can a Divergent Instruction Lead to Diagnosis?
r2 p20
r1
log phyp4
r3 p13
r5 p24
RAT
IA: r3 r2 + r2
phy valp20 4p24 3
Reg File
p4 8r3 p55
error!
r3 p24
r5 p24
p24 3p24 8
IB: r1 r5 * r2
r1 p4
p4 32
Fault-freer1=12
Diverged!
But IB does not use faulty HW…
22
Diagnosing Permanent Fault to µarch Granularity
• Trace-based fault diagnosis (TBFD)– Compare instruction trace of faulty vs. good execution – Divergence faulty hardware used diagnosis clues
• Diagnose faults to µarch units of processor– Check µarch-level invariants in several parts of processor– Front end, Meta-datapath, datapath faults– Diagnosis in out-of-order logic (meta-datapath) complex
• Results– 98% of the faults by SWAT successfully diagnosed– TBFD flexible for other detectors/granularity of repair
23
SWAT
4. Accurate Fault Modeling
2. Detectors w/ Software support [Sahoo et al., DSN ‘08]
3. Trace Based Fault Diagnosis [Li et al., DSN ‘08]
1. Detectors w/ Hardware support [ASPLOS ‘08]
Diagnosis
Fault Error Symptomdetected
Recovery
Repair
Checkpoint Checkpoint
24
SWATSim: Fast and Accurate Fault Models
• Need accurate µarch-level fault models– Gate level injections accurate but too slow– µarch (latch) level injections fast but inaccurate
• Can we achieve µarch-level speed at gate-level accuracy?
• Mix-mode (hierarchical) Simulation– µarch-level + Gate-level simulation– Simulate only faulty component at gate-level, on-demand– Invoke gate-level sim at online for permanent faults
Simulating fault effect with real-world vectors
25
SWAT-Sim: Gate-level Accuracy at µarch Speedsµarch simulation
r3 r1 op r2
Faulty UnitUsed?
Continue µarch simulation
µarch-LevelSimulation
NoInput
Output
Gate-LevelFault
Simulation
Stimuli
Response
Fault propagatedto output
Yes
r3
26
Results from SWAT-Sim• SWAT-sim implemented within full-system simulation
– NCVerilog + VPI for gate-level sim of ALU/AGEN modules
• SWAT-Sim: High accuracy at low overheads– 100,000x faster than gate-level, same modeling fidelity– 2x slowdown over µarch-level, at higher accuracy
• Accuracy of µarch models using SWAT coverage/latency– µarch stuck-at models generally inaccurate– Differences in activation rate, multi-bit flips
• Complex manifestations Hard to derive better models– Need SWAT-Sim, at least for now
27
SWAT Summary
• SWAT: SoftWare Anomaly Treatment– Handle all and only faults that matter– Low, amortized overheads– Holistic systems view enables novel solutions– Customizable and flexible
• Prior results:– Low-cost h/w detectors gave high coverage, low SDC rate
• This talk:– iSWAT: Higher coverage w/ software-assisted detectors– TBFD: µarch level fault diagnosis by synthesizing DMR– SWAT-Sim: Gate-level fault accuracy at µarch level speed
28
Future Work• Recovery: hybrid, application-specific• Aggressive use of software reliability techniques
– Leverage diagnosis mechanism• Multithreaded software• Off-core faults• Post-silicon debug and test
– Use faulty trace as fault-model oblivious test vector• Validation on FPGA (w/ Michigan)• Hardware assertions to complement software symptoms
BACKUP SLIDES
30
0%
20%
40%
60%
80%
100%
Decoder Int ALU
Reg Dbus Int reg
ROB RAT AGEN FP ALU
Avg no FP
Total injections
SDC
Symp>10M
High-OS
Hang-App
Hang-OS
FatalTrap-AppFatalTrap-OSApp-Mask
Arch-Mask
100% 98% 98% 96% 100% 100% 95% 98%27%
Breakup of Detections by SW symptoms
• 98% unmasked faults detected within 10M instr (w/o FPU) – Need HW support or SW monitoring for FPU
31
SW Components Corrupted
• 66% of faults corrupt system state before detection– Need to recover system state
0%
20%
40%
60%
80%
100%
Decoder INT ALU Reg Dbus
Int reg ROB RAT
AGEN FP ALU
Percentage of Injections
None
OS and maybe app
App only
32
Latency from Application mismatch
0%
20%
40%
60%
80%
100%
Decoder INT ALU Reg Dbus
Int reg ROB RAT
AGEN FP ALU
100000001000000100000100001000100101
• 86% of faults detected under 100k– 42% detected under 10k
33
0%
20%
40%
60%
80%
100%
Decoder INT ALU Reg Dbus
Int reg ROB RAT
AGEN FP ALU
100000001000000100000100001000100101
Latency from OS mismatch
• 99% of faults detected under 100k
34
iSWAT implementation
Training PhaseApplication
Compiler Pass in LLVM
- - - - - Application
- - - - -
Ranges i/p #1 . . . . Range
s i/p #n
Invariant Ranges
Invariant Monitoring
Code
Compiler Pass in LLVM
- - - - - Application
- - - - -
Invariant Checking
Code
Full System Simulation
Inject Faults
SWAT Diagnosis
InvariantViolation
False Positive(Disable Invariant)
Fault Detection
Fault Detection Phase
Test,train,
external inputs
Refinput
35
Trace-Based Fault Diagnosis (TBFD)Permanent
fault detected
Invoke diagnosis
Rollback faulty-core to checkpoint
Load checkpoint on fault-free core
Replay execution, collect µarch info
Fault-free instruction exec
TBFDFaults in Front-end
Meta-datapath Faults
Datapath Faults
Faulty trace Test trace=?
36
Fault Diagnosability
0%
20%
40%
60%
80%
100%
Decoder INT ALU Reg Dbus Int Reg ROB RAT AGEN Overall
Percentage of Detected Faults
Incorrect
NoMismatch
D-Other
D-Unique
• 98% of detected faults are diagnosed– 89% diagnosed to unique unit/array entry– Meta-datapath faults in out-of-order exec mislead TBFD
37
Accuracy of existing Fault Models
• SWAT-sim implemented within full-system simulator– NCVerilog + VPI to simulate gate-level ALU and AGEN
AGEN
0%
20%
40%
60%
80%
100%
uarch s@1 uarch s@0 Gate s@1 Gate s@0 Gate Delay
Percentage of Injections
Uarch-Mask Arch-Mask App-MaskDetected Detected>10M SDC
97.1% 94.0% 95.3% 95.5%96.0%Integer ALU
0%
20%
40%
60%
80%
100%
uarch s@1 uarch s@0 Gate s@1 Gate s@0 Gate Delay
Percentage of Injections
Uarch-Mask Arch-Mask App-MaskDetected Detected>10M SDC
100% 98.8% 94.4% 89.4%93.9%
• Existing µarch-level fault models inaccurate– Differences in activation rate, multi-bsit flips
• Accurate models hard to derive need SWAT-Sim!
38
Summary: SWAT Advantages• Handles all faults that matter
– Oblivious to low-level failure modes & masked faults
• Low, amortized overheads– Optimize for common case, exploit s/w reliability solutions
• Holistic systems view enables novel solutions– Invariant detectors use diagnosis mechanisms– Diagnosis uses recovery mechanisms
• Customizable and flexible– Firmware based control affords hybrid, app-specific recovery (TBD)
• Beyond hardware reliability– SWAT treats hardware faults as software bugs
Long-term goal: unified system (hw + sw) reliability at lowest cost– Potential applications to post-silicon test and debug
39
Transients Results
• 6400 transient faults injected across 8 structures• 83% unmasked faults detected within 10M instr• Only 0.4% of injected faults results in SDCs