DESIGN AND EVALUATION OF HYBRID FAULT-DETECTION SYSTEMS Qing Xu Kevin Wang.
-
Upload
haven-seman -
Category
Documents
-
view
217 -
download
1
Transcript of DESIGN AND EVALUATION OF HYBRID FAULT-DETECTION SYSTEMS Qing Xu Kevin Wang.
0 1
BACKGROUND Smaller and Faster Transistors
Lower threshold voltage Tighter noise margins Less reliable
Results Incorrect program execution
Recovery
Alpha Particle Transie
nt Faults
Software OnlyHardware Only
REDUNDENCY
Int main(){ cout << “Hello\n”;}
Int main(){ cout << “Hello\n”;}
MOTIVATION AND GOAL
Software Only
Inadequate coverage
Slow
Hardware Only Large Overhead/Area High cost
Hybrid Solution
Better Reliability and PerformanceLower Hardware
Area and Cost
KEY IDEA: COMPILER ASSISTED FAULT TOLERANCE (CRAFT) Characteristics:
- Based on software technique
- Minimal hardware adaptations
- Take advantages from Software and Hardware solution
Benefits:
- Nearly perfect reliability
- Low performance degradation
- Low hardware cost
Software
Hardware
CRAFT: HYBRID OF EXISTING METHODS
Hardware Method Software Method Redundant
Multithreading Technique (RMT)
Error Correcting Codes (ECC)
Software Implemented Fault Tolerance (SWIFT)
Error Detection by Duplicating Instructions (EDDI)
Advantages Almost-perfect fault coverage Low performance cost
Advantages High fault coverage Modest performance cost Zero hardware cost
EXISTING METHOD: HARDWARERMT
RMT makes use of SMT resource through loosely synchronized redundant threads
Components not covered by redundant execution must employ alternative techniques, such as Error Correction Code (ECC)
Original Thread
Checker Thread
Redundant Multi-threading (RMT)
EXISTING METHOD: SOFTWARESWIFT A compiler based
transformation Store instruction is the
synchronization point Assumes that Error
Correction Code (ECC) guards correctness of memory subsystem
ld r3 = [r4]
add r1 = r2, r3
st m[r1] = r2
(Original Code)
ld r3 = [r4]mov r3’ = r3
add r1 = r2, r3add r1’ = r2’, r3’
br Fault, r1 != r1’br Fault, r2 != r2’br Fault, r3 != r3’
st m[r1] = r2
(SWIFT Code)
CRAFT: SUITE OF THREE DETECTION SYSTEM
Preliminaries List of the Suite:
1. Checking Store Buffer (CSB)
2. Load Value Queue (LVQ)
3. CSB + LVQ
Assume Single Event Upset fault model
Architecturally Correct Execution (ACE)
Detected Unrecoverable Error (DUE)
Silent Data Corruption (SDC)
SUITE 1: CHECKING STORE BUFFER (CSB)
Solution:• Add a Store Buffer to perform
checks
Problem to Improve:• SWIFT: Vulnerable to faults in the
time interval between the validation and use of a register value
Use of validated valuesValidated values
Vulnerable to Faults
CSB # 0 1 2 3
Address -- -- 0xFF 0xEE
Value -- -- 0x8 0x1
Validated -- -- N N
0xFF
0x8
0xEE
0x2
Compiler duplicates storesst [r1] = r2 st1 [r1] = r2
st2 [r1’] = r2’
Not match, not OK to go to MEM
CSB : IMPLEMENTATIONBasic Idea: Commit a store when two copies of store data match Method : Create CSB to keep track of all original and duplicated instructions
Table will fill up and structural hazard
Insn duplicate #1
Insn duplicate #2
Y N
Store Value Checks Out! Send to MEM.
CSB : ADVANTAGES/ DISADVANTAGES Checking implemented in hardware level
No longer need validation code; reduces code size
Store instructions are no longer synchronization points (SWIFT)
Exploit more dynamic scheduling
Advantages
Disadvantages Additional compiler requirements: distance
between duplicated instruction should not exceed size of CSB
SUITE 2: LOAD VALUE QUEUE (LVQ)
Problem to Improve:• SWIFT: Window of vulnerability
between load instruction and value duplication.
Solution:• Add a load value queue
Vulnerable to Faults
Copying valuesLoading values
LVQ : IMPLEMENTATION PROCEDURE
Threadmill: Branch to TEST1
Basic Idea: Duplicate load to enable redundant computation Method : LVQ provides redundant load instruction execution
LVQ # 0 1 2 3
Address -- -- -- --
Value -- -- -- --
0xAA 0xAACompiler duplicates loadsld [r1] = r2 ld1 [r1] = r2
ld2 [r1’] = r2’
ld insn ld insn duplicate
0xAA
0x2
0x2 0x2
LVQ : ADVANTAGES/ DISADVANTAGESAdvantages
Disadvantages Extra hardware to enforce loads and their duplicates
access same entry in LVQ
Reduces window of vulnerability by issuing duplicated load instruction Keep memory traffic low by bypassing load value
EXPERIMENTAL EVALUATION Evaluation Method – Performance vs. Reliability:
Inject randomly chosen faults to detailed microarchitectural simulation
Each chosen bit-flip is tracked until completion of program
Analyze final result to determine:
- How much SDC is converted to DUE
- How much work (# of application) did program complete before encountering SDC
EXPERIMENTAL EVALUATION Results: Measures # of applications the program completed before encountering an SDC
Implementation
Performance
CSB Enable better performance as it eliminates scheduling constraints
LVQ Impact varies by benchmark
SUMMARY AND CONCLUSION
CRAFT, as compared to:
Hybrid technique can provide better reliability with relatively low cost
Software-only Technique Hardware-only Technique
Execution time reduction by 5%
Significantly reduce area overhead
SDC to DUE conversion rate increase by 75%
Maintain comparable reliability