Fundamentals of Electromigration- Aware Integrated Circuit ...
Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration,...
Transcript of Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration,...
EE141
1
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1
Chapter 3Chapter 3
FaultFault--Tolerant DesignTolerant Design
EE141
2
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 2
What is this chapter about?What is this chapter about?
� Gives Overview of Fault-Tolerant Design
� Focus on
� Basic Concepts in Fault-Tolerant Design
� Metrics Used to Specify and Evaluate Dependability
� Review of Coding Theory
� Fault-Tolerant Design Schemes
– Hardware Redundancy
– Information Redundancy
– Time Redundancy
� Examples of Fault-Tolerant Applications in Industry
EE141
3
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 3
FaultFault--Tolerant DesignTolerant Design
� Introduction
� Fundamentals of Fault Tolerance
� Fundamentals of Coding Theory
� Fault Tolerant Schemes
� Industry Practices
� Concluding Remarks
EE141
4
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 4
IntroductionIntroduction
� Fault Tolerance
� Ability of system to continue error-free operation in
presence of unexpected fault
� Important in mission-critical applications
� E.g., medical, aviation, banking, etc.
� Errors very costly
� Becoming important in mainstream applications
� Technology scaling causing circuit behavior to
become less predictable and more prone to failures
� Needing fault tolerance to keep failure rate within
acceptable levels
EE141
5
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 5
FaultsFaults
� Permanent Faults
� Due to manufacturing defects, early life failures, wearout failures
� Wearout failures due to various mechanisms
– e.g., electromigration, hot carrier degradation, dielectric breakdown, etc.
� Temporary Faults
� Only present for short period of time
� Caused by external disturbance or marginal design parameters
EE141
6
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 6
Temporary FaultsTemporary Faults
� Transient Errors (Non-recurring errors)
� Cause by external disturbance
– e.g., radiation, noise, power disturbance, etc.
� Intermittent Errors (Recurring errors)
� Cause by marginal design parameters
� Timing problems
– e.g., races, hazards, skew
� Signal integrity problems
– e.g., crosstalk, ground bounce, etc.
EE141
7
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 7
RedundancyRedundancy
� Fault Tolerance requires some form of redundancy
� Time Redundancy
� Hardware Redundancy
� Information Redundancy
EE141
8
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 8
Time RedundancyTime Redundancy
� Perform Same Operation Twice
� See if get same result both times
� If not, then fault occurred
� Can detect temporary faults
� Cannot detect permanent faults
– Would affect both computations
� Advantage
� Little to no hardware overhead
� Disadvantage
� Impacts system or circuit performance
EE141
9
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 9
Hardware RedundancyHardware Redundancy
� Replicate hardware and compare outputs
� From two or more modules
� Detects both permanent and temporary faults
� Advantage
� Little or no performance impact
� Disadvantage
� Area and power for redundant hardware
EE141
10
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 10
Information RedundancyInformation Redundancy
� Encode outputs with error detecting or correcting code
� Code selected to minimize redundancy for
class of faults
� Advantage
� Less hardware to generate redundant
information than replicating module
� Drawback
� Added complexity in design
EE141
11
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 11
Failure RateFailure Rate
� λ(t) = Component failure rate
� Measured in FITS (failures per 109 hours)
Early
failures Wearout
failures
Random failures
Infant
mortality
Working life Wearout
Time
Fai
lure
rat
e
Overall curve
EE141
12
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 12
System Failure RateSystem Failure Rate
� System constructed from components
� No Fault Tolerance
� Any component fails, whole system fails
∑=
=k
i
icsys
1
,λλ
EE141
13
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 13
ReliabilityReliability
� If component working at time 0
� R(t) = Probability still working at time t
� Exponential Failure Law
� If failure rate assumed constant
– Good approximation if past infant mortality period
tetR
λ−=)(
EE141
14
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 14
Reliability for Series SystemReliability for Series System
� Series System
� All components need to work for system to
work
A B C
CBAsys RRRR =
EE141
15
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 15
System Reliability with RedundancySystem Reliability with Redundancy
� System reliability with component B in Parallel
� Can tolerate one component B failing
A
B
C
B
[ ]CBBACBAsys RRRRRRRR )2()1(1 22 −=−−=
EE141
16
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 16
MeanMean--TimeTime--toto--Failure (Failure (MTTFMTTF))
� Average time before system fails
� Equal to area under reliability curve
� For Exponential Failure Law
dttRMTTF ∫∞
=0
)(
λλ 1
0
== ∫∞
−dteMTTF
t
EE141
17
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 17
MaintainabilityMaintainability
� If system failed at time 0
� M(t) = Probability repaired and operational
at time t
� System repair time divided into
� Passive repair time
– Time for service engineer to travel to site
� Active repair time
– Time to locate failing component,
repair/replace, and verify system operational
– Can be improved through designing system so
easy to locate failed component and verify
EE141
18
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 18
Repair Rate and Repair Rate and MTTRMTTR
� µ = rate at which system repaired
� Analogous to failure rate λ
� Maintainability often modeled as
� Mean-Time-to-Repair (MTTR) = 1/µ
tetM
µ−−= 1)(
EE141
19
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 19
AvailabilityAvailability
� System Availability
� Fraction of time system is operational
t0 t1 t2 t3 t4 t
S
1
0
failures
Normal system operation
MTTRMTTF
MTTFilabilitysystem ava
+=
EE141
20
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 20
AvailabilityAvailability
� Telephone Systems
� Required to have system availability of
0.9999 (“four nines”)
� High-Reliability Systems
� May require 7 or more nines
� Fault-Tolerant Design
� Needed to achieve such high availability
from less reliable components
EE141
21
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 21
Coding TheoryCoding Theory
� Coding
� Using more bits than necessary to
represent data
� Provides way to detect errors
– Errors occur when bits get flipped
� Error Detecting Codes
� Many types
� Detect different classes of errors
� Use different amounts of redundancy
� Ease of encoding and decoding data varies
EE141
22
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 22
Block CodeBlock Code
� Message = Data Being Encoded
� Block code
� Encodes m messages with n-bit codeword
� If no redundancy
� m messages encoded with log2(m) bits
� minimum possible
( )n
mredundancy 2log
1−=
EE141
23
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 23
Block CodeBlock Code
� To detect errors, some redundancy needed
� Space of distinct 2n blocks partitioned into
codewords and non-codewords
� Can detect errors that cause codeword to become non-codeword
� Cannot detect errors that cause codeword to become another codeword
EE141
24
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 24
Separable Block CodeSeparable Block Code
� Separable
� n-bit blocks partitioned into
– k information bits directly representing message
– (n-k) check bits
� Denoted (n,k) Block Code
� Advantage
� k-bit message directly extracted without
decoding
� Rate of Separable Block Code = k/n
EE141
25
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 25
Example of Separable Block CodeExample of Separable Block Code
� (4,3) Parity Code
� Check bit is XOR of 3 message bits
� message 101 → codeword 1010
� Single Bit Parity
( )nn
kn
n
k
nn
mredundancy
k 11
)2(log1
log1 22 =
−=−=−=−=
n
n
n
krate
1−==
EE141
26
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 26
Example of NonExample of Non--Separable Block CodeSeparable Block Code
� One-Hot Code
� Each Codeword has single 1
� Example of 8-bit one-hot
– 10000000, 01000000, 00100000, 00010000 00001000, 00000100, 00000010, 00000001
� Redundancy = 1 - log2(8)/8 = 5/8
( )n
n
n
mredundancy
)(log1
log1 22 −=−=
EE141
27
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 27
Linear Block CodesLinear Block Codes
� Special class
� Modulo-2 sum of any 2 codewords also
codeword
� Null space of (n-k)xn Boolean matrix
– Called Parity Check Matrix, H
� For any n-bit codeword c
� cHT = 0
� All 0 codeword exists in any linear code
EE141
28
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 28
Linear Block CodesLinear Block Codes
� Generator Matrix, G
� kxn Matrix
� Codeword c for message m
� c = mG
� GHT = 0
EE141
29
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 29
Systematic Block CodeSystematic Block Code
� First k-bits correspond to message
� Last n-k bits correspond to check bits
� For Systematic Code
� G = [Ikxk : Pkx(n-k)]
� H = [I(n-k)x(n-k) : PT(n-k)xk]
� Example
[ ]1111=H
=
1
1
1
100
010
001
G
EE141
30
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 30
Distance of CodeDistance of Code� Distance between two codewords
� Number of bits in which they differ
� Distance of Code
� Minimum distance between any two
codewords in code
� If n=k (no redundancy), distance = 1
� Single-bit parity, distance = 2
� Code with distance d
� Detect d-1 errors
� Correct up to (d-1)/2 errors
EE141
31
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 31
Error Correcting CodesError Correcting Codes
� Code with distance 3
� Called single error correcting (SEC) code
� Code with distance 4
� Called single error correcting and double
error detecting (SEC-DED) code
� Procedure for constructing SEC code
� Described in [Hamming 1950]
� Any H-matrix with all columns distinct and
no all-0 column is SEC
EE141
32
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 32
Hamming CodeHamming Code
� For any value of n
� SEC code constructed by
– setting each column in H equal to binary representation of column number (starting from 1)
� Number of rows in H equal to log2(n+1)
� Example of SEC Hamming Code for n=7
=
1
1
1
010
100
111
101
110
000
H
EE141
33
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 33
Error Correction in Hamming CodeError Correction in Hamming Code
� Syndrome, s
� s = HvT for received vector v
� If v is codeword
– Syndrome = 0
� If v non-codeword and single-bit error
– Syndrome will match one of columns of H
– Will contain binary value of bit position in error
EE141
34
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 34
Example of Error CorrectionExample of Error Correction
� For (7,3) Hamming Code
� Suppose codeword 0110011 has one-bit
error changing it to 1110011
]001[
111
011
101
001
110
010
100
]1110011[ =
== TvHs
EE141
35
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 35
SECSEC--DEDDED CodeCode
� Make SEC Hamming Code SEC-DED
� By adding parity check over all bits
� Extra parity bit
– 1 for single-bit error
– 0 for double-bit error
� Makes possible to detect double bit error
– Avoid assuming single-bit error and
miscorrecting it
EE141
36
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 36
Example of Error CorrectionExample of Error Correction
� For (7,4) SEC-DED Hamming Code
� Suppose codeword 0110011 has two-bit
error changing it to 1010011
– Doesn’t match any column in H
]0010[
1
1
1
111
011
101
1001
1110
1010
1100
]1010011[ =
== TvHs
EE141
37
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 37
Hsiao CodeHsiao Code� Weight of column
� Number of 1’s in column
� Constructing n-bit SEC-DED Hsiao Code
� First use all possible weight-1 columns
– Then all possible weight-3 columns
– Then weight-5 columns, etc.
� Until n columns formed
� Number check bits is log2(n+1)
� Minimizes number of 1’s in H-matrix
– Less hardware and delay for computing syndrome
– Disadvantage: Correction logic more complex
EE141
38
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 38
Example of Hsiao CodeExample of Hsiao Code
� (7,3) Hsiao Code
� Uses weight-1 and weight-3 columns
=
1
0
1
1
1
1
0
1
1
1
1
0
0001
0010
0100
1000
H
EE141
39
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 39
Unidirectional ErrorsUnidirectional Errors
� Errors in block of data which only cause
0→1 or 1→0, but not both
� Any number of bits in error in one direction
� Example
� Correct codeword 111000
� Unidirectional errors could cause
– 001000, 000000, 101000 (only 1→0 errors)
� Non-unidirectional errors
– 101001, 011001, 011011 (both1→0 and 0→1)
EE141
40
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 40
Unidirectional Error Detecting CodesUnidirectional Error Detecting Codes
� All unidirectional error detecting (AUED) Codes
� Detect all unidirectional errors in codeword
� Single-bit parity is not AUED
– Cannot detect even number of errors
� No linear code is AUED
– All linear codes must contain all-0 vector, so
cannot detect all 1→0 errors
EE141
41
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 41
TwoTwo--Rail CodeRail Code
� Two-Rail Code
� One check bit for each information bit
– Equal to complement of information bit
� Two-Rail Code is AEUD
� 50% Redundancy
� Example of (6,3) Two-Rail Code
� Message 101 has Codeword 101010
� Set of all codewords
– 000111, 001110, 010101, 011100, 100110, 101010, 110001, 111000
EE141
42
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 42
Berger CodesBerger Codes
� Lowest redundancy of separable AUEDcodes
� For k information bits, log2(k+1) check bits
� Check bits equal to binary representation
of number of 0’s in information bits
� Example
� Information bits 1000101
– log2(7+1)=3 check bits
– Check bits equal to 100 (4 zero’s)
EE141
43
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 43
Berger CodesBerger Codes
� Codewords for (5,3) Berger Code
� 00011, 00110, 01010, 01101, 10010,
10101, 11001, 11100
� If unidirectional errors
� Contain 1→0 errors
– increase 0’s in information bits
– can only decrease binary number in check bits
� Contain 0→1 errors
– decrease 0’s in information bits
– can only increase binary number in check bits
EE141
44
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 44
Berger CodesBerger Codes
� If 8 information bits
� Berger code requires log28+1=4 check bits
� (16,8) Two-Rail Code
� Requires 50% redundancy
� Redundancy advantage of Berger Code
� Increases as k increased
( )%25
4
1
12
81
)2(log1
log1 22 ==−=−=−=
nn
mredundancy
k
EE141
45
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 45
Constant Weight CodesConstant Weight Codes
� Constant Weight Codes
� Non-separable, but lower redundancy than
Berger
� Each codeword has same number of 1’s
� Example 2-out-of-3 constant weight code
� 110, 011, 101
� AEUD code
� Unidirectional errors always change number
of 1’s
EE141
46
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 46
Constant Weight CodesConstant Weight Codes
� Number codewords in m-out-of-n code
� Codewords maximized when m close to n/2 as possible
� n/2-out-of-n when n even
� (n/2-0.5 or n/2+0.5)-out-of-n when n odd
� Minimizes redundancy of code
n
mC
EE141
47
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 47
ExampleExample
� 6-out-of-12 constant weight code
� 12-bit Berger Code
� Only 28 = 256 codewords
codewordsC 92412
6 =
( )%9.17
12
)924(log1
log1 22 =−=−=
n
mredundancy
( )%3.33
12
)2(log1
log1
8
22 =−=−=n
mredundancy
EE141
48
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 48
Constant Weight CodesConstant Weight Codes
� Advantage
� Less redundancy than Berger codes
� Disadvantage
� Non-separable
� Need decoding logic
– to convert codeword back to binary message
EE141
49
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 49
Burst ErrorBurst Error� Burst Error
� Common, multi-bit errors tend to be clustered
– Noise source affects contiguous set of bus lines
� Length of burst error
– number of bits between first and last error
� Wrap around from last to first bit of codeword
� Example: Original codeword 00000000
� 00111100 is burst error length 4
� 00110100 is burst error length 4
– Any number of errors between first and last error
EE141
50
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 50
Cyclic CodesCyclic Codes
� Special class of linear code
� Any codeword shifted cyclically is another
codeword
� Used to detect burst errors
� Less redundancy required to detect burst
error than general multi-bit errors
– Some distance 2 codes can detect all burst errors of length 4
– detecting all possible 4-bit errors requires distance 5 code
EE141
51
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 51
Cyclic Redundancy Check (Cyclic Redundancy Check (CRCCRC) Code) Code
� Most widely used cyclic code
� Uses binary alphabet based on GF(2)
� CRC code is (n,k) block code
� Formed using generator polynomial, g(x)
– called code generator
– degree n-k polynomial (same degree as number of check bits)
01
2
2...)( gxgxgxgxgkn
kn ++++= −
−
)()()( xgxmxc =
EE141
52
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 52
110011x5 + x4 + x + 1x2 + 1x3 + x2 + x + 11111
110110x5 + x4 + x2 + xx2 + 1x3 + x2 + x1110
111001x5 + x4 + x3 + 1x2 + 1x3 + x2 + 11101
111100x5 + x4 + x3 + x2x2 + 1x3 + x21100
100111x5 + x2 + x + 1x2 + 1x3 + x + 11011
100010x5 + xx2 + 1x3 + x1010
101101x5 + x3 + x2 + 1x2 + 1x3 + 11001
101000x5 + x3x2 + 1x31000
011011x4 + x3 + x + 1x2 + 1x2 + x + 10111
011110x4 + x3 + x2 + xx2 + 1x2 + x0110
010001x4 + 1x2 + 1x2 + 10101
010100x4 + x2x2 + 1x20100
001111x3 + x2 + x + 1x2 + 1x + 10011
001010x3 + xx2 + 1x0010
000101x2 + 1x2 + 110001
0000000x2 + 100000
Codewordc(x)g(x)m(x)Message
EE141
53
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 53
CRCCRC CodeCode
� Linear block code
� Has G-matrix and H-matrix
� G-matrix shifted version of generator
polynomial
=
−
−
−
01
01
01
.
0
0
.
0
0
...
.
0
...00
....
...0
...
gg
g
g
gg
ggg
G
kn
kn
kn
EE141
54
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 54
CRCCRC Code ExampleCode Example
� (6,4) CRC code generated by g(x)=x2+1
=
101000
010100
001010
000101
G
EE141
55
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 55
Systematic Systematic CRCCRC CodesCodes
� To obtain systematic CRC code
� codewords formed using Galois division
– nice because LFSR can be used for performing
division
)(
)()(
)()()(
xg
xxmofremainderxr
xrxxmxc
kn
kn
−
−
=
+=
EE141
56
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 56
Galois Division ExampleGalois Division Example
� Encode m(x)=x2+x with g(x)=x2+1
� Requires dividing m(x)xn-k =x4+x3 by g(x)
� Remainder r(x)=x+1
– c(x) = m(x)xn-k+r(x) = (x2+x)(x2)+x+1 = x4+x3+x+1
111
101 11000
101
110
101
110
101
11 remainder
EE141
57
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 57
111100x4 + x3 + x2 + x0x2 + 1x3 + x2 + x + 11111
111001x4 + x3 + x + 11x2 + 1x3 + x2 + x1110
110110x4 + x3 + x + 1xx2 + 1x3 + x2 + 11101
110011x4 + x3 + x + 1x + 1x2 + 1x3 + x21100
101101x4 + x3 + x + 11x2 + 1x3 + x + 11011
101000x4 + x3 + x + 10x2 + 1x3 + x1010
100111x4 + x3 + x + 1x + 1x2 + 1x3 + 11001
100010x4 + x3 + x + 1xx2 + 1x31000
011110x4 + x3 + x + 1xx2 + 1x2 + x + 10111
011011x4 + x3 + x + 1x + 1x2 + 1x2 + x0110
010100x4 + x20x2 + 1x2 + 10101
010001x4 + 11x2 + 1x20100
001111x3 + x2 + x + 1x + 1x2 + 1x + 10011
001010x3 + xxx2 + 1x0010
000101x2 + 11x2 + 110001
00000000x2 + 100000
Codewordc(x)r(x)g(x)m(x)Message
EE141
58
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 58
Generating Check Bits for Generating Check Bits for CRCCRC CodeCode
� Use LFSR
� With characteristic polynomial equal to g(x)
� Append n-k 0’s to end of message
� Example: m(x)=x2+x+1 and g(x)=x3+x+1
0 0 0 111000Appended 0’s
Message
0 1 0
Final state after shifting equals remainder
EE141
59
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 59
Checking Checking CRCCRC CodewordCodeword
� Checking Received Codeword for Errors
� Shift codeword into LFSR
– with same characteristic polynomial as used to
generate it
� If final state of LFSR non-zero, then error
0 0 0 111010codeword to check
EE141
60
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 60
Selecting Generator PolynomialSelecting Generator Polynomial
� Key issue for CRC Codes
� If first and last bit of polynomial are 1
– Will detect burst errors of length n-k or less
� If generator polynomial is mutliple of (x+1)
– Will detect any odd number of errors
� If g(x) = (x+1)p(x) where p(x) primitive of
degree n-k-1 and n < 2n-k-1
– Will detect single, double, triple, and odd errors
EE141
61
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 61
Commonly Used Commonly Used CRCCRC GeneratorsGenerators
x64+x4+x3+x+1CRC-64 (ISO)
x32+x26+x23+x22+x16+x12+x11+x10+x8
+x7+x5+x4+x+1CRC-32 (Ethernet)
x16+x12+x5+1CRC-16-CCITT (X25, Bluetooth)
x12+x11+x3+x2+x+1CRC-12 (Telecom systems)
x5+x2+1CRC-5 (USB token packets)
Generator PolynomialCRC code
EE141
62
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 62
Fault Tolerance SchemesFault Tolerance Schemes
� Adding Fault Tolerance to Design
� Improves dependability of system
� Requires redundancy
– Hardware
– Time
– Information
EE141
63
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 63
Hardware RedundancyHardware Redundancy
� Involves replicating hardware units
� At any level of design
– gate-level, module-level, chip-level, board-level
� Three Basic Forms
� Static (also called Passive)
– Masks faults rather than detects them
� Dynamic (also called Active)
– Detects faults and reconfigures to spare hardware
� Hybrid
– Combines active and passive approaches
EE141
64
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 64
Static RedundancyStatic Redundancy
� Masks faults so no erroneous outputs
� Provides uninterrupted operation
� Important for real-time systems
– No time to reconfigure or retry operation
� Simple self-contained
– No need to update or rollback system state
EE141
65
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 65
Triple Module Redundancy (Triple Module Redundancy (TMRTMR))
� Well-known static redundancy scheme
� Three copies of module
� Use majority voter to determine final output
� Error in one module out-voted by other two
Module
3
Module
2
Module
1
Majority
Voter
EE141
66
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 66
TMRTMR Reliability and Reliability and MTTFMTTF
� TMR works if any 2 modules work
� Rm = reliability of each module
� Rv = reliability of voter
� MTTF for TMR
)23()]1([ 3223
2
3
mmvmmmvTMR RRRRRCRRR −=−+=
vmvm
ttt
mmvTMRTMR dteeedtRRRdtRMTTF mmv
λλλλ
λλλ
+−
+=
−=−== ∫∫∫∞
−−−∞∞
3
2
2
3
)23()23(0
32
0
32
0
EE141
67
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 67
Comparison with SimplexComparison with Simplex
� Neglecting fault rate of voter
� TMR has lower MTTF, but
� Can tolerate temporary faults
� Higher reliability for short mission times
simplex
mmm
TMR MTTFMTTF6
51
6
5
3
2
2
3=
=−=
λλλ
EE141
68
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 68
Comparison with SimplexComparison with Simplex
� Crossover point
� RTMR > Rsimplex when
� Mission time shorter than 70% of MTTF
( ) simplex
m
ttt
simplexTMR
MTTFtSolve
eee
RR
mmm
7.02ln
2332
≈=⇒
=−
=
−−−
λ
λλλ
EE141
69
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 69
NN--Modular Redundancy (Modular Redundancy (NMRNMR))
� NMR
� N modules along with majority voter
– TMR special case
� Number of failed modules masked = (N-1)/2
� As N increases, MTTF decreases
– But, reliability for short missions increases
� If goal only to tolerate temporary faults
� TMR sufficient
EE141
70
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 70
Interwoven LogicInterwoven Logic
� Replace each gate
� with 4 gates using inconnection pattern
that automatically corrects errors
� Traditionally not as attractive as TMR
� Requires lots of area overhead
� Renewed interest by researchers
investigating emerging nanoelectronic
technologies
EE141
71
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 71
Interwoven Logic with 4 NOR GatesInterwoven Logic with 4 NOR Gates
++
+
+
+
X
Y
1
2
3
4
1b
+1c
+1d
+1a
+2b
+2c
+2d
+2a
+3b
+3c
+3d
+3a
+4b
+4c
+4d
+4a
X
Y
EE141
72
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 72
Example of Error on Third Y InputExample of Error on Third Y Input
+1b
+1c
+1d
+1a
+2b
+2c
+2d
+2a
+3b
+3c
+3d
+3a
+4b
+4c
+4d
+4a
X
Y
0
0
0
0
0
01
0
1
1
0
0
0
0
0
0
1
1
1
1
0
0
00
+
+
+
+
X
Y
1
2
3
4
EE141
73
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 73
Dynamic RedundancyDynamic Redundancy
� Involves
� Detecting fault
� Locating faulty hardware unit
� Reconfiguring system to use spare fault-free
hardware unit
EE141
74
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 74
UnpoweredUnpowered (Cold) Spares(Cold) Spares
� Advantage
� Extends lifetime of spares
� Equations
� Assume spare not failing until powered
� Perfect reconfiguration capability
λ
λ λ
2
)1(
_/
_/
=
+= −
sparecoldw
t
sparecoldw
MTTF
etR
EE141
75
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 75
UnpoweredUnpowered (Cold) Spares(Cold) Spares
� One cold spare doubles MTTF
� Assuming faults always detected and
reconfiguration circuitry never fails
� Drawback of cold spare
� Extra time to power and initialize
� Cannot be used to help in detecting faults
� Fault detection requires either
– periodic offline testing
– online testing using time or information redundancy
EE141
76
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 76
Powered (Hot) SparesPowered (Hot) Spares
� Can use spares for online fault detection
� One approach is duplicate-and-compare
� If outputs mismatch then fault occurred
– Run diagnostic procedure to determine which
module is faulty and replace with spare
� Any number of spares can be used
Module
B
Spare
Module
Module
A
Compare
Output
Agree/Disagree
EE141
77
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 77
PairPair--andand--aa--SpareSpare
� Avoids halting system to run diagnostic procedure when fault occurs
Module
B
Module
A
Compare
Output
Agree/Disagree
Module
D
Module
C
Compare
Output
Agree/Disagree
Switch
EE141
78
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 78
TMRTMR/Simplex/Simplex
� When one module in TMR fails
� Disconnect one of remaining modules
� Improves MTTF while retaining advantages
of TMR when 3 good modules
� TMR/Simplex
� Reliability always better than either TMR or
Simplex alone
EE141
79
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 79
Comparison of Reliability Comparison of Reliability vsvs TimeTime
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 1
NORMALIZED MISSION TIME (T/MTTF)
RE
LIA
BIL
ITY
SIMPLEX
TMR
TMR/SIMPLEX
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 1
NORMALIZED MISSION TIME (T/MTTF)
RE
LIA
BIL
ITY
SIMPLEX
TMR
TMR/SIMPLEX
EE141
80
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 80
Hybrid RedundancyHybrid Redundancy
� Combines both static and dynamic redundancy
� Masks faults like static
� Detects and reconfigures like dynamic
EE141
81
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 81
TMRTMR with Spareswith Spares
� If TMR module fails
� Replace with spare
– can be either hot or cold spare
� While system has three working modules
– TMR will provide fault masking for
uninterrupted operation
EE141
82
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 82
SelfSelf--Purging RedundancyPurging Redundancy
� Uses threshold voter instead of majority voter
� Threshold voter outputs 1 if number of
input that are 1 greater than threshold
– Otherwise outputs 0
� Requires hot spares
EE141
83
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 83
SelfSelf--Purging RedundancyPurging Redundancy
Module
3
Module
2
Module
1
Threshold
Voter
≥2
Elem.
Switch
Elem.
Switch
Elem.
Switch
Module
4
Elem.
Switch
Module
5
Elem.
Switch VoterModule
Flip
Flop
&
⊕RS
Initialization
Elementary Switch
EE141
84
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 84
SelfSelf--Purging RedundancyPurging Redundancy
� Compared with 5MR
� Self-purging with 5 modules
– Tolerate up to 3 failing modules (5MR cannot)
– Cannot tolerate two modules simultaneously
failing (5MR can)
� Compared with TMR with 2 spares
� Self-purging with 5 modules
– simpler reconfiguration circuitry
– requires hot spares (3MR w/spares can use
either hot or cold spares)
EE141
85
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 85
Time RedundancyTime Redundancy
� Advantage
� Less hardware
� Drawback
� Cannot detect permanent faults
� If error detected
� System needs to rollback to known good
state before resuming operation
EE141
86
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 86
Repeated ExecutionRepeated Execution
� Repeat operation twice
� Simplest time redundancy approach
� Detects temporary faults occurring during
one execution (but not both)
– Causes mismatch in results
� Can reuse same hardware for both
executions
– Only one copy of functional hardware needed
EE141
87
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 87
Repeated ExecutionRepeated Execution
� Requires mechanism for storing and comparing results of both executions
� In processor, can store in memory or on
disk and use software to compare
� Main cost
� Additional time for redundant execution
and comparison
EE141
88
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 88
MultiMulti--threaded Redundant Executionthreaded Redundant Execution
� Can use in processor-based system that can run multiple threads
� Two copies of thread executed concurrently
� Results compared when both complete
� Take advantage of processor’s built-in
capability to exploit processing resources
– Reduce execution time
– Can significantly reduce performance penalty
EE141
89
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 89
Multiple Sampling of Multiple Sampling of OuputsOuputs
� Done at circuit-level
� Sample once at end of normal clock cycle
� Same again after delay of ∆t
� Two samples compared to detect mismatch
– Indicates error occurred
� Detect fault whose duration is less than ∆t
� Performance overhead depends on
– Size of ∆t relative to normal clock period
EE141
90
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 90
Multiple Sampling of OutputsMultiple Sampling of Outputs
� Simple approach using two latches
Clk
Main
Latch
Clk+∆t
Shadow
Latch
⊕ ErrorSignal
EE141
91
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 91
Multiple Sampling of OutputsMultiple Sampling of Outputs
� Approach using stability checker at output
Normal
Clock Period ∆t
Normal
Clock Period ∆t
Stability
Checking
Period
Stability
Checking
Period
&
&
+
+
& Error
Checking
Period
Signal
EE141
92
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 92
Diverse Diverse RecomputationRecomputation
� Use same hardware, but perform computation differently second time
� Can detect permanent faults that affects
only one computation
� For arithmetic or logical operations
� Shift operands when performing second
computation [Patel 1982]
� Detects permanent fault affecting only one
bit-slice
EE141
93
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 93
Information RedundancyInformation Redundancy
� Based on Error Detecting and Correcting Codes
� Advantage
� Detects both permanent and temporary
faults
� Implemented with less hardware overhead
than using multiple copies of module
� Disadvantage
� More complex design
EE141
94
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 94
Error DetectionError Detection
� Error detecting codes used to detect errors
� If error detected
– Rollback to previous known error-free state
– Retry operation
EE141
95
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 95
RollbackRollback
� Requires adding storage to save previous state
� Amount of rollback depends on latency of
error detection mechanism
� Zero-latency error detection
– rollback implemented by preventing system state from updating
� If errors detected after n cycles
– need rollback restoring system to state at least
n clock cycles earlier
EE141
96
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 96
CheckpointCheckpoint
� Execution divided into set of operations
� Before each operation executed
– checkpoint created where system state saved
� If any error detected during operation
– rollback to last checkpoint and retry operation
� If multiple retries fail
– operation halts and system flags that
permanent fault has occurred
EE141
97
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 97
Error DetectionError Detection
� Encode outputs of circuit with error detecting code
� Non-codeword output indicates error
m
m
k
c
Inputs
Checker
Functional
Logic
Check Bit
Generator
k
Outputs
Error
Indication
EE141
98
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 98
SelfSelf--Checking CheckerChecking Checker
� Has two outputs
� Normal error-free case (1,0) or (0,1)
� If equal to each other, then error (0,0) or (1,1)
� Cannot have single error indicator output
– Stuck-at 0 fault on output could never be detected
EE141
99
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 99
Totally SelfTotally Self--Checking CheckerChecking Checker
� Requires three properties
� Code Disjoint
– all codeword inputs mapped to codeword outputs
� Fault Secure
– for all codeword inputs, checker in presence of
fault will either procedure correct codeword output
or non-codeword output (not incorrect codeword)
� Self-Testing
– For each fault, at least one codeword input gives error indication
EE141
100
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 100
DuplicateDuplicate--andand--CompareCompare
� Equality checker indicates error
� Undetected error can occur only if
common-mode fault affecting both copies
� Only faults after stems detected
� Over 100% overhead (including checker)
Functional
Logic
Functional
Logic
Stems
Equality
CheckerError
Indication
Primary
Inputs
EE141
101
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 101
SingleSingle--Bit Parity CodeBit Parity Code
� Totally self-checking checker formed by removing final gate from XOR tree
EI0⊕
⊕
⊕
⊕
⊕
Functional
Logic
Parity
Prediction
EI1
EE141
102
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 102
SingleSingle--Bit Parity CodeBit Parity Code
� Cannot detect even bit errors
� Can ensure no even bit errors by
generating each output with independent
cone of logic
– Only single bit errors can occur due to single point fault
– Typically requires a lot of overhead
EE141
103
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 103
ParityParity--Check CodesCheck Codes
� Each check bit is parity for some set of output bits
� Example: 6 outputs and 3 check bits
Z1 Z2 Z3 Z4 Z5 Z6 c1 c2 c3
Parity Group 1 1 0 0 1 1 0 1 0 0
Parity Group 2 0 1 1 0 0 0 0 1 0
Parity Group 3 0 0 0 0 0 1 0 0 1
EE141
104
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 104
ParityParity--Check CodesCheck Codes
� For c check bits and k functional outputs
� 2ck possible parity check codes
� Can choose code based on structure of
circuit to minimize undetected error
combinations
� Fanouts in circuit determine possible error
combinations due to single-point fault
EE141
105
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 105
Checker for ParityChecker for Parity--Check CodesCheck Codes
� Constructed from single-bit parity checkers and two-rail checkers
Parity
Checker
Two-Rail
Checker
Z1
Z4
Z5
c1
Parity
Checker
Z2
Z3
c2
Parity
Checker
Z6
c3
Two-Rail
Checker
E0
E1
EE141
106
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 106
TwoTwo--Rail CheckersRail Checkers
� Totally self-checking two-rail checker
C0+
&
&
+
&
&
C1
A0
B0
A1
B1
EE141
107
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 107
Berger CodesBerger Codes
� Inverter-free circuit
� Inverters only at primary inputs
� Can be synthesized using only algebraic
factoring [Jha 1993]
� Only unidirectional errors possible for
single point faults
– Can use unidirectional code
– Berger code gives 100% coverage
EE141
108
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 108
Constant Weight CodesConstant Weight Codes
� Non-separable with lower redundancy
� Drawback: need decoding logic to convert
codeword back to its original binary value
� Can use for encoding states of FSM
– No need for decoding logic
EE141
109
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 109
Error CorrectionError Correction
� Information redundancy can also be used to mask errors
� Not as attractive as TMR because logic for
predicting check bits very complex
� However, very good for memories
– Check bits stored with data
– Error do not propagate in memories as in logic
circuits, so SEC-DED usually sufficient
EE141
110
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 110
Error CorrectionError Correction
� Memories very dense and prone to errors
� Especially due to single-event upsets (SEUs)
from radiation
� SEC-DED check bits stored in memory
� 32-bit word, SEC-DED requires 7 check bits
– Increases size of memory by 7/32=21.9%
� 64-bit word, SEC-DED requires 8 check bits
– Increases size of memory by 8/64=12.5%
EE141
111
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 111
Memory Memory ECCECC ArchitectureArchitecture
Generate
Check
Bits Memory
Generate
SyndromeCorrect
Data
Calculated
Check Bits
Write
Check Bits
Read Data Word
Write Data Word
Data Word
In
Read
Check Bits
Data Word
Out
EE141
112
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 112
Hamming Code for Hamming Code for ECCECC RAMRAM
RAM
Core
N words
Z+c+1
bits/word
Z
c
Input Data
Parity Bit
Generator
Z
c
Hamming
Check Bit
Generator
Parity
Check
Hamming
Check c
Bit Error
Correction Circuit Output
Data
Generate Detect/Correct
Hamming
Check Bit
Generator
Parity Bit
Generator
Z
Error Type Condition
No bit error Hamming check bits match, no parity error
Single-bit correctable error Hamming check bits mismatch, parity error
Double-bit error detection Hamming check bits mismatch, no parity error
Z1 Z2 Z3 Z4 Z5 Z6 Z7 Z8 c1 c2 c3 c4
Parity Group 1 1 1 0 1 1 0 1 0 1 0 0 0
Parity Group 2 1 0 1 1 0 1 1 0 0 1 0 0
Parity Group 3 0 1 1 1 0 0 0 1 0 0 1 0
Parity Group 4 0 0 0 0 1 1 1 1 0 0 0 1
EE141
113
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 113
Memory Memory ECCECC
� SEC-DED generally very effective
� Memory bit-flips tend to be independent
and uniformly distributed
� If bit-flip occurs, gets corrected next time
memory location accessed
� Main risk is if memory word not access for
long time
– Multiple bit-flips could accumulate
EE141
114
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 114
Memory ScrubbingMemory Scrubbing
� Every location in memory read on periodic basis
� Reduces chance of multiple errors
accumulating in a memory word
� Can be implemented by having memory
controller cycle through memory during idle
periods
EE141
115
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 115
MultipleMultiple--Bit Upsets (Bit Upsets (MBUMBU))� Can occur due to single SEU
� Typically occur in adjacent memory cells
� Memory interleaving used
� To prevent MBUs from resulting in multiple
bit errors in same word
Word1 Word2 Word3 Word4 Word1 Word2 Word3 Word4 Word1 Word2 Word3 Word4
Bit1 Bit1 Bit1 Bit1 Bit2 Bit2 Bit2 Bit2 Bit3 Bit3 Bit3 Bit3
Memory
EE141
116
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 116
Often None; Memory ECC; Bus
Parity; Changing as Technology Scales
Consumer Electronics Personal Computers
Meet Failure Rate Expectationsat Low Cost
Reasonable Level of Failures Acceptable
Mainstream Low-Cost Systems
Checkpointing,Time Redundancy; ECC; Redundant
Disks
BankingTransaction ProcessingDatabase
HighData Integrity
Data CorruptionVery Costly
High Integrity Systems
No Single Point of Failure;
Self-Checking Pairs; Fault Isolation
Reservation SystemStock Exchange
Telephone Systems
HighAvailability
DowntimeVery Costly
High Availability
Systems
TMRAircraftNuclear Power PlantAir Bag Electronics
Radar
Fault Masking Capability
Error or Delay Catastrophic
ReliableReal-TimeSystems
DynamicRedundancy
SatellitesSpacecraft
Implanted Biomedical
MaximizeMTTF
Difficult orExpensive to Repair
Long-LifeSystems
TechniquesExamplesGoalIssuesType
EE141
117
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 117
Concluding RemarksConcluding Remarks
� Many different fault-tolerant schemes
� Choosing scheme depends on
� Types of faults to be tolerated
– Temporary or permanent
– Single or multiple point failures
– etc.
� Design constraints
– Area, performance, power, etc.
EE141
118
System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 118
Concluding RemarksConcluding Remarks
� As technology scales
� Circuits increasingly prone to failure
� Achieving sufficient fault tolerance will be
major design issue