3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani.
-
Upload
johnathan-joseph -
Category
Documents
-
view
215 -
download
2
Transcript of 3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani.
3. Hardware Redundancy
Reliable System Design 2010by: Amir M. Rahmani
matlab1.ir
Forms of Redundancy Hardware redundancy
• – add extra hardware for detection or tolerating faults
Software redundancy• – add extra software for detection and possibly
tolerating faults Information redundancy
• – extra information, i.e. codes Time redundancy
• – extra time for performing tasks for fault tolerance
matlab1.ir
Types of Hardware Redundancy
Fault Tolerance requires Redundancy1- Static Redundancy (that is Passive)
• • uses fault masking to hide occurrence of fault• • does not require reconfiguration• • Example: TMR, Voting
2- Dynamic Redundancy (that is Active)• • uses comparison for detection and/or diagnoses• • requires reconfiguration
• remove faulty hardware from system• • Example: Stand-by system
3- Hybrid Redundancy• • combination of static & dynamic redundancy
matlab1.ir
1- Static Redundancy
A class of redundancy techniques that can tolerate faults without reconfiguration (failover).
Static redundancy can be divided into two major subclasses:
• • Masking redundancy• • Active redundancy
matlab1.ir
Masking Redundancy
Uses majority voting to mask faults Requires 2f +1 modules to tolerate f faulty
modules
N-Modular Redundant system (NMR) N independent modules replicate the same function
• – parallelism• – results are voted on• – requirements: N >= 3
TMR (Triple Modular Redundancy)
matlab1.ir
Triple Modular Redundancy (TMR)
e.g. Majority voting. 1-bit majority voter (3 AND gates ORed)
matlab1.ir
Triple Modular Redundancy
(TMR)
matlab1.ir
Masking Redundancy
TMR with triple voting
matlab1.ir
Masking Redundancy
Multi-stage TMR
matlab1.ir
N-Modular Redundant system (NMR)
matlab1.ir
Active Redundancy
Two or more units are active and produce replicated results simultaneously
Relies on fail-stop units Fail-stop property: a unit produces correct
results or no results at all Requires f +1 modules to tolerate f faulty
modules
matlab1.ir
Fail-stop Nodes
Node 1 and 2 send their results individually to node 3 and 4
All nodes are fail-stop: They send correct results or no results at all
matlab1.ir
2- Dynamic Redundancy
Relies on error detection and reconfiguration Requires f +1 modules to tolerate f faulty
modules May require recovery of system or
application state May require outage time
matlab1.ir
Example: Duplicate and Compare
• – can only detect, but NOT diagnose• i.e. fault detection, no fault-tolerance
• – may order shutdown• – comparator is single point of failure
• simple implementation: 2 input XOR for single bit compare
matlab1.ir
Example: Stand-by System
• E.g. communications checksums and memory parity bits• – only one module is driving outputs• – other modules are:
• idle => hot spares• shut down => cold spares
• – error detection => switch to a new module (hot or cold spares)
matlab1.ir
Types of Stand-by Systems
Hot standby Warm standby Cold standby
matlab1.ir
Hot Stand-by
Characteristics• • Spare updated simultaneously with primary
module
+ Advantages• + Very short or no outage time• + Does not require recovery of application
- Drawbacks• - High failure rate (fault rate)• - High power consumption
matlab1.ir
Warm Stand-by Characteristics
• • Spare up and running• • Needs to recover application status
+ Advantages• + Does not require simultaneous up-dating of spare
and primary module - Drawbacks
• - Requires recovery of application state• - High fault rate• - High power consumption
matlab1.ir
Cold Stand-by
Characteristics• • Spare powered-down
+ Advantages• + Low failure rate (fault rate)• + Low power consumption
• Satellite application
- Drawbacks• - Very long outage time• - Needs to boot kernel/operating system and
recover application status.
matlab1.ir
3- Hybrid Redundancy
N-Modular Redundancy with spares• – N active + S spare modules (off-line)• – Voting and comparison• – Replaces erroneous module from spare pool
matlab1.ir
N-Modular Redundancy with spares
N-Modular Redundancy with spares
matlab1.ir
Coding checks / Exception checks
Coding checks Error detection codes are formed by the addition of check
bits to a data word. A cyclic redundancy code check was used in the disk
store of ESS. A parity bit was used in the RAMException checks Hardware constraints: Usually result from the inability of
the hardware to provide the better service needed by the software.
Examples• • Improper address alignment• • Unequipped memory locations• • Unused op-code• • Stack overflow
matlab1.ir
Watchdog Timers
So far, we’ve figured out how to detect when something is wrong … but how do we detect when we’re not doing anything at all?
Watchdog timer monitors a module and triggers a recovery if the module doesn’t do anything in a given amount of time
• – E.g., put a watchdog timer on a microprocessor bus Who watches the watchdog?
• – If we assume single fault scenario, then this usually isn’t a problem
• – But what if watchdog has hard fault that causes it to never timeout and trigger a recovery?