SWAT: Designing Resilient Hardware by Treating Software Anomalies Lei Chen, Byn Choi, Xin Fu, Siva...
-
Upload
hortense-gardner -
Category
Documents
-
view
216 -
download
1
Transcript of SWAT: Designing Resilient Hardware by Treating Software Anomalies Lei Chen, Byn Choi, Xin Fu, Siva...
SWAT: Designing Resilient Hardware by
Treating Software Anomalies
Lei Chen, Byn Choi, Xin Fu, Siva Hari, Man-lap (Alex) Li, Pradeep Ramachandran, Swarup Sahoo, Rob Smolinski, Sarita Adve,
Vikram Adve, Shobha Vasudevan, Yuanyuan Zhou
Department of Computer Science
University of Illinois at Urbana-Champaign
2
Motivation
• Hardware will fail in-the-field due to several reasons
Þ Need in-field detection, diagnosis, recovery, repair
• Reliability problem pervasive across many markets
– Traditional redundancy solutions (e.g., nMR) too expensive
Need low-cost solutions for multiple failure sources* Must incur low area, performance, power overhead
Transient errors(High-energy particles )
Wear-out(Devices are weaker)
Design Bugs … and so on
3
Observations
• Need handle only hardware faults that propagate to software
• Fault-free case remains common, must be optimized
Watch for software anomalies (symptoms)
– Zero to low overhead “always-on” monitors
Diagnose cause after symptom detected
− May incur high overhead, but rarely invoked
SWAT: SoftWare Anomaly Treatment
4
SWAT Framework Components
• Detection: Symptoms of software misbehavior
• Recovery: Checkpoint and rollback
• Diagnosis: Rollback/replay on multicore
• Repair/reconfiguration: Redundant, reconfigurable hardware
• Flexible control through firmware
Fault Error Symptomdetected
Recovery
Diagnosis Repair
Checkpoint Checkpoint
5
Advantages of SWAT
• Handles all faults that matter– Oblivious to low-level failure modes and masked faults
• Low, amortized overheads– Optimize for common case, exploit SW reliability solutions
• Customizable and flexible– Firmware control adapts to specific reliability needs
• Holistic systems view enables novel solutions– Synergistic detection, diagnosis, recovery solutions
• Beyond hardware reliability– Long term goal: unified system (HW+SW) reliability
– Potential application to post-silicon test and debug
SWAT Contributions
In-situ diagnosis [DSN’08]
Very low-cost detectors [ASPLOS’08, DSN’08]
Low SDC rate, latency
Diagnosis
Fault Error Symptomdetected
Recovery
Repair
Checkpoint Checkpoint
Accurate fault modeling[HPCA’09]
Multithreaded workloads [MICRO’09]
Application-Aware SWATEven lower SDC, latency
7
Outline
• Motivation
• Detection
• Recovery analysis
• Diagnosis
• Conclusions and future work
8
Simple Fault Detectors [ASPLOS ’08]
• Simple detectors that observe anomalous SW behavior
• Very low hardware area, performance overhead
SWAT firmware
Fatal Traps
Division by zero,RED state, etc.
Kernel Panic
OS enters panicState due to fault
High OS
High contiguousOS activity
Hangs
Simple HW hangdetector
App Abort
Application abort due to fault
9
Evaluating Fault Detectors
• Simulate OpenSolaris on out-of-order processor
– GEMS timing models + Simics full-system simulator
• I/O- and compute-intensive applications
– Client-server – apache, mysql, squid, sshd
– All SPEC 2K C/C++ – 12 Integer, 4 FP
• µarchitecture-level fault injections (single fault model)
– Stuck-at, transient faults in 8 µarch units
– ~18,000 total faults statistically significant
10M instr
Timing simulation
If no symptom in 10M instr, run to completion
Functional simulation
Fault
Masked orPotential Silent Data Corruption (SDC)
10
Metrics for Fault Detection
• Potential SDC rate
– Undetected fault that changes app output
– Output change may or may not be important
• Detection Latency
– Latency from architecture state corruption to detection* Architecture state = registers + memory
* Will improve later
– High detection latency impedes recovery
SDC Rate of Simple Detectors: SPEC, permanents
• 0.6% potential SDC rate for permanents in SPEC, without FPU
• Faults in FPU need different detectors
– Mostly corrupt only data
De
co
de
r
INT
AL
U
Re
g D
bu
s
Int
reg
RO
B
RA
T
AG
EN
FP
AL
U
To
tal N
o F
P
0%
20%
40%
60%
80%
100%Permanent Faults
SDC
Detected
MaskedTo
tal i
nje
cti
on
s
0.4 0.7 0.6 0.6
12
Potential SDC Rate
• SWAT detectors highly effective for hardware faults
– Low potential SDC rates across workloads
Server SPEC Server SPECPermanents Transients
0%
20%
40%
60%
80%
100%
Potential SDC
Detected
Masked
Inje
cte
d F
au
lts
0.4 0.7 0.6 0.6
13
Detection Latency
• 90% of the faults detected in under 10M instructions
• Existing work claims these are recoverable w/ HW chkpting
– More recovery analysis follows later
<10K <100K <1M <10M >10M40%
50%
60%
70%
80%
90%
100% Server workloads
PermanentTransient
Detection Latency (Instructions)
De
tec
ted
Fa
ult
s
<10K <100K <1M <10M >10M40%
50%
60%
70%
80%
90%
100% SPEC workloads
Detection Latency (Instructions)
Exploiting Application Support for Detection
• Techniques inspired by software bug detection
– Likely program invariants: iSWAT* Instrumented binary, no hardware changes
* <5% performance overhead on x86 processors
– Detecting out-of-bounds addresses* Low hardware overhead, near-zero performance impact
• Exploiting application-level resiliency
Exploiting Application Support for Detection
• Techniques inspired by software bug detection
– Likely program invariants: iSWAT* Instrumented binary, no hardware changes
* <5% performance overhead on x86 processors
– Detecting out-of-bounds addresses* Low hardware overhead, near-zero performance impact
• Exploiting application-level resiliency
16
Low-Cost Out-of-Bounds Detector
• Sophisticated detector for security, software bugs
– Track object accessed, validate pointer accesses
– Require full-program analysis, changes to binary
• Bad addresses from HW faults more obvious
– Invalid pages, unallocated memory, etc.
• Low-cost out-of-bounds detector
– Monitor boundaries of heap, stack, globals
– Address beyond these bounds HW fault
- SW communicates boundaries to HW
- HW enforces checks on ld/st address
App Code
Globals
Heap
Stack
Libraries
Empty
Reserved
App Address Space
17
Impact of Out-of-Bounds Detector
Lower potential SDC rate in server workloads
– 39% lower for permanents, 52% for transients
For SPEC workloads, impact is on detection latency
Server Workloads SPEC Workloads
SWAT OoB SWAT OoBPermanents Transients
0%
20%
40%
60%
80%
100%
Potential SDC
Detect-Other
Detect-OoB
Masked
0.38 0.23 0.58 0.28
SWAT OoB SWAT OoBPermanents Transients
0%
20%
40%
60%
80%
100%
Inje
cte
d F
au
lts
0.67 0.63 0.65 0.65
18
Application-Aware SDC Analysis
• Potential SDC undetected faults that corrupt app output
• But many applications can tolerate faults
– Client may detect fault and retry request
– Application may perform fault-tolerant computations* E.g., Same cost place & route, acceptable PSNR, etc.
Þ Not all potential SDCs are true SDCs
- For each application, define notion of fault tolerance
• SWAT detectors cannot detect such acceptable changesshould not?
19
Application-Aware SDCs for Server
• 46% of potential SDCs are tolerated by simple retry
• Only 21 remaining SDCs out of 17,880 injected faults
– Most detectable through application-level validity checks
SWAT
SWAT +
OoB
w/ a
pp tole
rance
0
10
20
30
40
50
60Permanent Faults
Nu
mb
er
of
Fa
ult
s
34(0.38%)
21(0.23%)
12(0.13%)
SWAT
SWAT +
OoB
w/ a
pp tole
rance
0
10
20
30
40
50
60
Transient Faults52
(0.58%)
25(0.28%)
9(0.10%)
20
• Only 62 faults show >0% degradation from golden output
• Only 41 injected faults are SDCs at >1% degradation
– 38 from apps we conservatively classify as fault intolerant* Chess playing apps, compilers, parsers, etc.
Application-Aware SDCs for SPEC
SWAT+OoB >0% >0.01% >1%0
10
20
30
40
50
60
70Permanent Faults
Nu
mb
er o
f F
ault
s
56(0.6%)
16(0.2%) 8
(0.1%)11
(0.1%)
SWAT+OoB >0% >0.01% >1%0
10
20
30
40
50
60
70Transient Faults
58(0.6%)
46(0.5%)
33(0.4%)
37(0.4%)
21
Reducing Potential SDCs further (future work)
• Explore application-specific detectors
– Compiler-assisted invariants like iSWAT
– Application-level checks
• Need to fundamentally understand why, where SWAT works
– SWAT evaluation largely empirical
– Build models to predict effectiveness of SWAT
* Develop new low-cost symptom detectors
* Extract minimal set of detectors for given sets of faults
* Reliability vs overhead trade-offs analysis
22
• SWAT relies on checkpoint/rollback for recovery
• Detection latency dictates fault recovery
– Checkpoint fault-free fault recoverable
• Traditional defn. = arch state corruption to detection
• But software may mask some corruptions!
• New defn. = Unmasked arch state corruption to detection
Reducing Detection Latency: New Definition
Bad SW state
New Latency
Bad arch state
Old latency
FaultDetection
Recoverablechkpt
Recoverablechkpt
23
Measuring Detection Latency
• New detection latency = SW state corruption to detection
• But identifying SW state corruption is hard!
– Need to know how faulty value used by application
– If faulty value affects output, then SW state corrupted
• Measure latency by rolling back to older checkpoints
– Only for analysis, not required in real system
FaultDetection
Bad arch state Bad SW state
New latency
ChkptRollback &
Replay
SymptomChkpt Fault effectmasked
Rollback &Replay
24
Detection Latency - SPEC
<10k <100k <1m <10m >10m40%
50%
60%
70%
80%
90%
100%Permanent Faults in Server
Detection Latency (Instructions)
De
tec
ted
Fa
ult
s
<10k <100k <1m <10m >10m40%
50%
60%
70%
80%
90%
100%Transient Faults in Server
Old Latency SWAT
Detection Latency (Instructions)
De
tec
ted
Fa
ult
s
25
Detection Latency - SPEC
<10k <100k <1m <10m >10m40%
50%
60%
70%
80%
90%
100%Permanent Faults in Server
Detection Latency (Instructions)
De
tec
ted
Fa
ult
s
<10k <100k <1m <10m >10m40%
50%
60%
70%
80%
90%
100%Transient Faults in Server
New Latency SWAT
Old Latency SWAT
Detection Latency (Instructions)
De
tec
ted
Fa
ult
s
26
Detection Latency - SPEC
• Measuring new latency important to study recovery
• New techniques significantly reduce detection latency
- >90% of faults detected in <100K instructions
• Reduced detection latency impacts recoverability
<10k <100k <1m <10m >10m40%
50%
60%
70%
80%
90%
100%Permanent Faults in Server
Detection Latency (Instructions)
De
tec
ted
Fa
ult
s
<10k <100k <1m <10m >10m40%
50%
60%
70%
80%
90%
100%Transient Faults in Server
New Latency out-of-bounds
New Latency SWAT
Old Latency SWAT
Detection Latency (Instructions)
De
tec
ted
Fa
ult
s
27
Detection Latency - Server
• Measuring new latency important to study recovery
• New techniques significantly reduce detection latency
- >90% of faults detected in <100K instructions
• Reduced detection latency impacts recoverability
<10k <100k <1m <10m >10m40%
50%
60%
70%
80%
90%
100%Permanent Faults in Server
Detection Latency (Instructions)
De
tec
ted
Fa
ult
s
<10k <100k <1m <10m >10m40%
50%
60%
70%
80%
90%
100%Transient Faults in Server
New Latency out-of-bounds
New Latency SWAT
Old Latency SWAT
Detection Latency (Instructions)
De
tec
ted
Fa
ult
s
28
Implications for Fault Recovery
• Checkpointing
– Record pristine arch state for recovery
– Periodic registers snapshot, log memory writes
• I/O buffering
– Buffer external events until known to be fault-free
– HW buffer records device reads, buffers device writes
“Always-on” must incur minimal overhead
Checkpointing I/O buffering
Recovery
29
Overheads from Memory Logging
• New techniques reduce chkpt overheads by over 60%
– Chkpt interval reduced to 100K from millions of instrs.
10K 100K 1M 10M0
250
500
750
1000
1250
1500
1750
2000
2250
2500apache
sshd
squid
mysql
Checkpoint Interval (Instructions)
Me
mo
ry L
og
Siz
e (
in K
B)
30
Overheads from Output Buffering
• New techniques reduce output buffer size to near-zero
– <5KB buffer for 100K chkpt interval (buffer for 2 chkpts)
– Near-zero overheads at 10K interval
10K 100K 1M 10M0
2
4
6
8
10
12
14
16
18
20
apache
sshd
squid
mysql
Checkpoint Interval (Instructions)
Ou
tpu
t B
uff
er
siz
e (
in K
B)
31
Low Cost Fault Recovery (future work)
• New techniques significantly reduce recovery overheads
– 60% in memory logs, near-zero output buffer
• But still do not enable ultra-low cost fault recovery
– ~400KB HW overheads for memory logs in HW (SafetyNet)
– High performance impact for in-memory logs (ReVive)
• Need ultra low-cost recovery scheme at short intervals
– Even shorter latencies
– Checkpoint only state that matters
– Application-aware insights – transactional apps, recovery
domains for OS, …
Fault Diagnosis
• Symptom-based detection is cheap but
– May incur long latency from activation to detection
– Difficult to diagnose root cause of fault
• Goal: Diagnose the fault with minimal hardware overhead
– Rarely invoked higher perf overhead acceptable
SW Bug Transient Fault
PermanentFault
Symptom
?
SWAT Single-threaded Fault Diagnosis [Li et al., DSN ‘08]
• First, diagnosis for single threaded workload on one core
– Multithreaded w/ multicore later – several new challenges
Key ideas
• Single core fault model, multicore fault-free core available
• Chkpt/replay for recovery replay on good core, compare
• Synthesizing DMR, but only for diagnosis
Traditional DMR
P1 P2
=
Always on expensive
P1 P2
=
P1
Synthesized DMR
Fault-freeDMR only on fault
SW Bug vs. Transient vs. Permanent
• Rollback/replay on same/different core
• Watch if symptom reappears
No symptom Symptom
Deterministic s/w orPermanent h/w bug
Symptom detected
Faulty Good
Rollback on faulty core
Rollback/replay on good core
Continue Execution
Transient or non-deterministic s/w bug
Symptom
Permanenth/w fault,
needs repair!
No symptom
Deterministic s/w bug(send to s/w layer)
µarch-level Fault Diagnosis
Permanent fault
Microarchitecture-levelDiagnosis
Unit X is faulty
Symptomdetected
Diagnosis
Softwarebug
Transientfault
Trace Based Fault Diagnosis (TBFD)
• µarch-level fault diagnosis using rollback/replay
• Key: Execution caused symptom trace activates fault
– Deterministically replay trace on faulty, fault-free cores
– Divergence faulty hardware used diagnosis clues
• Diagnose faults to µarch units of processor
– Check µarch-level invariants in several parts of processor
– Diagnosis in out-of-order logic (meta-datapath) complex
Trace-Based Fault Diagnosis: Evaluation
• Goal: Diagnose faults at reasonable latency
• Faults diagnosed in 10 SPEC workloads
– ~8500 detected faults (98% of unmasked)
• Results
– 98% of the detection successfully diagnosed
– 91% diagnosed within 1M instr (~0.5ms on 2GHz proc)
SWAT Multithreaded Fault Diagnosis [Hari et al., MICRO ‘09]
• Challenge 1: Deterministic replay involves high overhead
• Challenge 2: Multithreaded apps share data among threads
• Symptom causing core may not be faulty
• No known fault-free core in system
Core 2
Fault
Core 1
Symptom Detectionon a fault-free core
Store
Memory
Load
mSWAT Diagnosis - Key Ideas
Challenges
Multithreaded applications
Full-system deterministic
replay
No known good core
Isolated deterministic
replayEmulated TMRKey Ideas
TA TB TC TD
TA
TA TB TC TD
TA
A B C D
TA
A B C D
mSWAT Diagnosis - Key Ideas
Challenges
Multithreaded applications
Full-system deterministic
replay
No known good core
Isolated deterministic
replayEmulated TMRKey Ideas
TA TB TC TD
TA
TA TB TC TD
A B C DA B C D
TD TA TB TC
TC TD TA TB
mSWAT Diagnosis: Evaluation
• Diagnose detected perm faults in multithreaded apps
– Goal: Identify faulty core, TBFD for µarch-level diagnosis
– Challenges: Non-determinism, no fault-free core known
– ~4% of faults detected from fault-free core
• Results
– 95% of detected faults diagnosed* All detections from fault-free core diagnosed
– 96% of diagnosed faults require <200KB buffers* Can be stored in lower level cache low HW overhead
• SWAT diagnosis can work with other symptom detectors
Summary: SWAT works!
In-situ diagnosis [DSN’08]
Very low-cost detectors [ASPLOS’08, DSN’08]
Low SDC rate, latency
Diagnosis
Fault Error Symptomdetected
Recovery
Repair
Checkpoint Checkpoint
Accurate fault modeling[HPCA’09]
Multithreaded workloads [MICRO’09]
Application-Aware SWATEven lower SDC, latency
Future Work
• Formalization of when/why SWAT works
• Near zero cost recovery
• More server/distributed applications
• App-level, customizable resilience
• Other core and off-core parts in multicore
• Other fault models
• Prototyping SWAT on FPGA w/ Michigan
• Interaction with safe programming
• Unifying with s/w resilience