Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration,...

118
EE141 1 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 Chapter 3 Chapter 3 Fault Fault - - Tolerant Design Tolerant Design

Transcript of Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration,...

Page 1: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

1

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1

Chapter 3Chapter 3

FaultFault--Tolerant DesignTolerant Design

Page 2: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

2

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 2

What is this chapter about?What is this chapter about?

� Gives Overview of Fault-Tolerant Design

� Focus on

� Basic Concepts in Fault-Tolerant Design

� Metrics Used to Specify and Evaluate Dependability

� Review of Coding Theory

� Fault-Tolerant Design Schemes

– Hardware Redundancy

– Information Redundancy

– Time Redundancy

� Examples of Fault-Tolerant Applications in Industry

Page 3: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

3

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 3

FaultFault--Tolerant DesignTolerant Design

� Introduction

� Fundamentals of Fault Tolerance

� Fundamentals of Coding Theory

� Fault Tolerant Schemes

� Industry Practices

� Concluding Remarks

Page 4: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

4

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 4

IntroductionIntroduction

� Fault Tolerance

� Ability of system to continue error-free operation in

presence of unexpected fault

� Important in mission-critical applications

� E.g., medical, aviation, banking, etc.

� Errors very costly

� Becoming important in mainstream applications

� Technology scaling causing circuit behavior to

become less predictable and more prone to failures

� Needing fault tolerance to keep failure rate within

acceptable levels

Page 5: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

5

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 5

FaultsFaults

� Permanent Faults

� Due to manufacturing defects, early life failures, wearout failures

� Wearout failures due to various mechanisms

– e.g., electromigration, hot carrier degradation, dielectric breakdown, etc.

� Temporary Faults

� Only present for short period of time

� Caused by external disturbance or marginal design parameters

Page 6: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

6

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 6

Temporary FaultsTemporary Faults

� Transient Errors (Non-recurring errors)

� Cause by external disturbance

– e.g., radiation, noise, power disturbance, etc.

� Intermittent Errors (Recurring errors)

� Cause by marginal design parameters

� Timing problems

– e.g., races, hazards, skew

� Signal integrity problems

– e.g., crosstalk, ground bounce, etc.

Page 7: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

7

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 7

RedundancyRedundancy

� Fault Tolerance requires some form of redundancy

� Time Redundancy

� Hardware Redundancy

� Information Redundancy

Page 8: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

8

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 8

Time RedundancyTime Redundancy

� Perform Same Operation Twice

� See if get same result both times

� If not, then fault occurred

� Can detect temporary faults

� Cannot detect permanent faults

– Would affect both computations

� Advantage

� Little to no hardware overhead

� Disadvantage

� Impacts system or circuit performance

Page 9: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

9

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 9

Hardware RedundancyHardware Redundancy

� Replicate hardware and compare outputs

� From two or more modules

� Detects both permanent and temporary faults

� Advantage

� Little or no performance impact

� Disadvantage

� Area and power for redundant hardware

Page 10: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

10

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 10

Information RedundancyInformation Redundancy

� Encode outputs with error detecting or correcting code

� Code selected to minimize redundancy for

class of faults

� Advantage

� Less hardware to generate redundant

information than replicating module

� Drawback

� Added complexity in design

Page 11: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

11

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 11

Failure RateFailure Rate

� λ(t) = Component failure rate

� Measured in FITS (failures per 109 hours)

Early

failures Wearout

failures

Random failures

Infant

mortality

Working life Wearout

Time

Fai

lure

rat

e

Overall curve

Page 12: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

12

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 12

System Failure RateSystem Failure Rate

� System constructed from components

� No Fault Tolerance

� Any component fails, whole system fails

∑=

=k

i

icsys

1

,λλ

Page 13: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

13

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 13

ReliabilityReliability

� If component working at time 0

� R(t) = Probability still working at time t

� Exponential Failure Law

� If failure rate assumed constant

– Good approximation if past infant mortality period

tetR

λ−=)(

Page 14: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

14

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 14

Reliability for Series SystemReliability for Series System

� Series System

� All components need to work for system to

work

A B C

CBAsys RRRR =

Page 15: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

15

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 15

System Reliability with RedundancySystem Reliability with Redundancy

� System reliability with component B in Parallel

� Can tolerate one component B failing

A

B

C

B

[ ]CBBACBAsys RRRRRRRR )2()1(1 22 −=−−=

Page 16: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

16

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 16

MeanMean--TimeTime--toto--Failure (Failure (MTTFMTTF))

� Average time before system fails

� Equal to area under reliability curve

� For Exponential Failure Law

dttRMTTF ∫∞

=0

)(

λλ 1

0

== ∫∞

−dteMTTF

t

Page 17: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

17

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 17

MaintainabilityMaintainability

� If system failed at time 0

� M(t) = Probability repaired and operational

at time t

� System repair time divided into

� Passive repair time

– Time for service engineer to travel to site

� Active repair time

– Time to locate failing component,

repair/replace, and verify system operational

– Can be improved through designing system so

easy to locate failed component and verify

Page 18: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

18

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 18

Repair Rate and Repair Rate and MTTRMTTR

� µ = rate at which system repaired

� Analogous to failure rate λ

� Maintainability often modeled as

� Mean-Time-to-Repair (MTTR) = 1/µ

tetM

µ−−= 1)(

Page 19: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

19

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 19

AvailabilityAvailability

� System Availability

� Fraction of time system is operational

t0 t1 t2 t3 t4 t

S

1

0

failures

Normal system operation

MTTRMTTF

MTTFilabilitysystem ava

+=

Page 20: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

20

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 20

AvailabilityAvailability

� Telephone Systems

� Required to have system availability of

0.9999 (“four nines”)

� High-Reliability Systems

� May require 7 or more nines

� Fault-Tolerant Design

� Needed to achieve such high availability

from less reliable components

Page 21: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

21

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 21

Coding TheoryCoding Theory

� Coding

� Using more bits than necessary to

represent data

� Provides way to detect errors

– Errors occur when bits get flipped

� Error Detecting Codes

� Many types

� Detect different classes of errors

� Use different amounts of redundancy

� Ease of encoding and decoding data varies

Page 22: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

22

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 22

Block CodeBlock Code

� Message = Data Being Encoded

� Block code

� Encodes m messages with n-bit codeword

� If no redundancy

� m messages encoded with log2(m) bits

� minimum possible

( )n

mredundancy 2log

1−=

Page 23: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

23

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 23

Block CodeBlock Code

� To detect errors, some redundancy needed

� Space of distinct 2n blocks partitioned into

codewords and non-codewords

� Can detect errors that cause codeword to become non-codeword

� Cannot detect errors that cause codeword to become another codeword

Page 24: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

24

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 24

Separable Block CodeSeparable Block Code

� Separable

� n-bit blocks partitioned into

– k information bits directly representing message

– (n-k) check bits

� Denoted (n,k) Block Code

� Advantage

� k-bit message directly extracted without

decoding

� Rate of Separable Block Code = k/n

Page 25: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

25

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 25

Example of Separable Block CodeExample of Separable Block Code

� (4,3) Parity Code

� Check bit is XOR of 3 message bits

� message 101 → codeword 1010

� Single Bit Parity

( )nn

kn

n

k

nn

mredundancy

k 11

)2(log1

log1 22 =

−=−=−=−=

n

n

n

krate

1−==

Page 26: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

26

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 26

Example of NonExample of Non--Separable Block CodeSeparable Block Code

� One-Hot Code

� Each Codeword has single 1

� Example of 8-bit one-hot

– 10000000, 01000000, 00100000, 00010000 00001000, 00000100, 00000010, 00000001

� Redundancy = 1 - log2(8)/8 = 5/8

( )n

n

n

mredundancy

)(log1

log1 22 −=−=

Page 27: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

27

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 27

Linear Block CodesLinear Block Codes

� Special class

� Modulo-2 sum of any 2 codewords also

codeword

� Null space of (n-k)xn Boolean matrix

– Called Parity Check Matrix, H

� For any n-bit codeword c

� cHT = 0

� All 0 codeword exists in any linear code

Page 28: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

28

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 28

Linear Block CodesLinear Block Codes

� Generator Matrix, G

� kxn Matrix

� Codeword c for message m

� c = mG

� GHT = 0

Page 29: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

29

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 29

Systematic Block CodeSystematic Block Code

� First k-bits correspond to message

� Last n-k bits correspond to check bits

� For Systematic Code

� G = [Ikxk : Pkx(n-k)]

� H = [I(n-k)x(n-k) : PT(n-k)xk]

� Example

[ ]1111=H

=

1

1

1

100

010

001

G

Page 30: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

30

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 30

Distance of CodeDistance of Code� Distance between two codewords

� Number of bits in which they differ

� Distance of Code

� Minimum distance between any two

codewords in code

� If n=k (no redundancy), distance = 1

� Single-bit parity, distance = 2

� Code with distance d

� Detect d-1 errors

� Correct up to (d-1)/2 errors

Page 31: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

31

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 31

Error Correcting CodesError Correcting Codes

� Code with distance 3

� Called single error correcting (SEC) code

� Code with distance 4

� Called single error correcting and double

error detecting (SEC-DED) code

� Procedure for constructing SEC code

� Described in [Hamming 1950]

� Any H-matrix with all columns distinct and

no all-0 column is SEC

Page 32: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

32

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 32

Hamming CodeHamming Code

� For any value of n

� SEC code constructed by

– setting each column in H equal to binary representation of column number (starting from 1)

� Number of rows in H equal to log2(n+1)

� Example of SEC Hamming Code for n=7

=

1

1

1

010

100

111

101

110

000

H

Page 33: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

33

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 33

Error Correction in Hamming CodeError Correction in Hamming Code

� Syndrome, s

� s = HvT for received vector v

� If v is codeword

– Syndrome = 0

� If v non-codeword and single-bit error

– Syndrome will match one of columns of H

– Will contain binary value of bit position in error

Page 34: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

34

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 34

Example of Error CorrectionExample of Error Correction

� For (7,3) Hamming Code

� Suppose codeword 0110011 has one-bit

error changing it to 1110011

]001[

111

011

101

001

110

010

100

]1110011[ =

== TvHs

Page 35: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

35

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 35

SECSEC--DEDDED CodeCode

� Make SEC Hamming Code SEC-DED

� By adding parity check over all bits

� Extra parity bit

– 1 for single-bit error

– 0 for double-bit error

� Makes possible to detect double bit error

– Avoid assuming single-bit error and

miscorrecting it

Page 36: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

36

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 36

Example of Error CorrectionExample of Error Correction

� For (7,4) SEC-DED Hamming Code

� Suppose codeword 0110011 has two-bit

error changing it to 1010011

– Doesn’t match any column in H

]0010[

1

1

1

111

011

101

1001

1110

1010

1100

]1010011[ =

== TvHs

Page 37: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

37

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 37

Hsiao CodeHsiao Code� Weight of column

� Number of 1’s in column

� Constructing n-bit SEC-DED Hsiao Code

� First use all possible weight-1 columns

– Then all possible weight-3 columns

– Then weight-5 columns, etc.

� Until n columns formed

� Number check bits is log2(n+1)

� Minimizes number of 1’s in H-matrix

– Less hardware and delay for computing syndrome

– Disadvantage: Correction logic more complex

Page 38: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

38

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 38

Example of Hsiao CodeExample of Hsiao Code

� (7,3) Hsiao Code

� Uses weight-1 and weight-3 columns

=

1

0

1

1

1

1

0

1

1

1

1

0

0001

0010

0100

1000

H

Page 39: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

39

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 39

Unidirectional ErrorsUnidirectional Errors

� Errors in block of data which only cause

0→1 or 1→0, but not both

� Any number of bits in error in one direction

� Example

� Correct codeword 111000

� Unidirectional errors could cause

– 001000, 000000, 101000 (only 1→0 errors)

� Non-unidirectional errors

– 101001, 011001, 011011 (both1→0 and 0→1)

Page 40: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

40

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 40

Unidirectional Error Detecting CodesUnidirectional Error Detecting Codes

� All unidirectional error detecting (AUED) Codes

� Detect all unidirectional errors in codeword

� Single-bit parity is not AUED

– Cannot detect even number of errors

� No linear code is AUED

– All linear codes must contain all-0 vector, so

cannot detect all 1→0 errors

Page 41: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

41

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 41

TwoTwo--Rail CodeRail Code

� Two-Rail Code

� One check bit for each information bit

– Equal to complement of information bit

� Two-Rail Code is AEUD

� 50% Redundancy

� Example of (6,3) Two-Rail Code

� Message 101 has Codeword 101010

� Set of all codewords

– 000111, 001110, 010101, 011100, 100110, 101010, 110001, 111000

Page 42: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

42

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 42

Berger CodesBerger Codes

� Lowest redundancy of separable AUEDcodes

� For k information bits, log2(k+1) check bits

� Check bits equal to binary representation

of number of 0’s in information bits

� Example

� Information bits 1000101

– log2(7+1)=3 check bits

– Check bits equal to 100 (4 zero’s)

Page 43: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

43

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 43

Berger CodesBerger Codes

� Codewords for (5,3) Berger Code

� 00011, 00110, 01010, 01101, 10010,

10101, 11001, 11100

� If unidirectional errors

� Contain 1→0 errors

– increase 0’s in information bits

– can only decrease binary number in check bits

� Contain 0→1 errors

– decrease 0’s in information bits

– can only increase binary number in check bits

Page 44: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

44

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 44

Berger CodesBerger Codes

� If 8 information bits

� Berger code requires log28+1=4 check bits

� (16,8) Two-Rail Code

� Requires 50% redundancy

� Redundancy advantage of Berger Code

� Increases as k increased

( )%25

4

1

12

81

)2(log1

log1 22 ==−=−=−=

nn

mredundancy

k

Page 45: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

45

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 45

Constant Weight CodesConstant Weight Codes

� Constant Weight Codes

� Non-separable, but lower redundancy than

Berger

� Each codeword has same number of 1’s

� Example 2-out-of-3 constant weight code

� 110, 011, 101

� AEUD code

� Unidirectional errors always change number

of 1’s

Page 46: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

46

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 46

Constant Weight CodesConstant Weight Codes

� Number codewords in m-out-of-n code

� Codewords maximized when m close to n/2 as possible

� n/2-out-of-n when n even

� (n/2-0.5 or n/2+0.5)-out-of-n when n odd

� Minimizes redundancy of code

n

mC

Page 47: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

47

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 47

ExampleExample

� 6-out-of-12 constant weight code

� 12-bit Berger Code

� Only 28 = 256 codewords

codewordsC 92412

6 =

( )%9.17

12

)924(log1

log1 22 =−=−=

n

mredundancy

( )%3.33

12

)2(log1

log1

8

22 =−=−=n

mredundancy

Page 48: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

48

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 48

Constant Weight CodesConstant Weight Codes

� Advantage

� Less redundancy than Berger codes

� Disadvantage

� Non-separable

� Need decoding logic

– to convert codeword back to binary message

Page 49: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

49

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 49

Burst ErrorBurst Error� Burst Error

� Common, multi-bit errors tend to be clustered

– Noise source affects contiguous set of bus lines

� Length of burst error

– number of bits between first and last error

� Wrap around from last to first bit of codeword

� Example: Original codeword 00000000

� 00111100 is burst error length 4

� 00110100 is burst error length 4

– Any number of errors between first and last error

Page 50: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

50

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 50

Cyclic CodesCyclic Codes

� Special class of linear code

� Any codeword shifted cyclically is another

codeword

� Used to detect burst errors

� Less redundancy required to detect burst

error than general multi-bit errors

– Some distance 2 codes can detect all burst errors of length 4

– detecting all possible 4-bit errors requires distance 5 code

Page 51: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

51

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 51

Cyclic Redundancy Check (Cyclic Redundancy Check (CRCCRC) Code) Code

� Most widely used cyclic code

� Uses binary alphabet based on GF(2)

� CRC code is (n,k) block code

� Formed using generator polynomial, g(x)

– called code generator

– degree n-k polynomial (same degree as number of check bits)

01

2

2...)( gxgxgxgxgkn

kn ++++= −

)()()( xgxmxc =

Page 52: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

52

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 52

110011x5 + x4 + x + 1x2 + 1x3 + x2 + x + 11111

110110x5 + x4 + x2 + xx2 + 1x3 + x2 + x1110

111001x5 + x4 + x3 + 1x2 + 1x3 + x2 + 11101

111100x5 + x4 + x3 + x2x2 + 1x3 + x21100

100111x5 + x2 + x + 1x2 + 1x3 + x + 11011

100010x5 + xx2 + 1x3 + x1010

101101x5 + x3 + x2 + 1x2 + 1x3 + 11001

101000x5 + x3x2 + 1x31000

011011x4 + x3 + x + 1x2 + 1x2 + x + 10111

011110x4 + x3 + x2 + xx2 + 1x2 + x0110

010001x4 + 1x2 + 1x2 + 10101

010100x4 + x2x2 + 1x20100

001111x3 + x2 + x + 1x2 + 1x + 10011

001010x3 + xx2 + 1x0010

000101x2 + 1x2 + 110001

0000000x2 + 100000

Codewordc(x)g(x)m(x)Message

Page 53: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

53

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 53

CRCCRC CodeCode

� Linear block code

� Has G-matrix and H-matrix

� G-matrix shifted version of generator

polynomial

=

01

01

01

.

0

0

.

0

0

...

.

0

...00

....

...0

...

gg

g

g

gg

ggg

G

kn

kn

kn

Page 54: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

54

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 54

CRCCRC Code ExampleCode Example

� (6,4) CRC code generated by g(x)=x2+1

=

101000

010100

001010

000101

G

Page 55: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

55

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 55

Systematic Systematic CRCCRC CodesCodes

� To obtain systematic CRC code

� codewords formed using Galois division

– nice because LFSR can be used for performing

division

)(

)()(

)()()(

xg

xxmofremainderxr

xrxxmxc

kn

kn

=

+=

Page 56: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

56

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 56

Galois Division ExampleGalois Division Example

� Encode m(x)=x2+x with g(x)=x2+1

� Requires dividing m(x)xn-k =x4+x3 by g(x)

� Remainder r(x)=x+1

– c(x) = m(x)xn-k+r(x) = (x2+x)(x2)+x+1 = x4+x3+x+1

111

101 11000

101

110

101

110

101

11 remainder

Page 57: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

57

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 57

111100x4 + x3 + x2 + x0x2 + 1x3 + x2 + x + 11111

111001x4 + x3 + x + 11x2 + 1x3 + x2 + x1110

110110x4 + x3 + x + 1xx2 + 1x3 + x2 + 11101

110011x4 + x3 + x + 1x + 1x2 + 1x3 + x21100

101101x4 + x3 + x + 11x2 + 1x3 + x + 11011

101000x4 + x3 + x + 10x2 + 1x3 + x1010

100111x4 + x3 + x + 1x + 1x2 + 1x3 + 11001

100010x4 + x3 + x + 1xx2 + 1x31000

011110x4 + x3 + x + 1xx2 + 1x2 + x + 10111

011011x4 + x3 + x + 1x + 1x2 + 1x2 + x0110

010100x4 + x20x2 + 1x2 + 10101

010001x4 + 11x2 + 1x20100

001111x3 + x2 + x + 1x + 1x2 + 1x + 10011

001010x3 + xxx2 + 1x0010

000101x2 + 11x2 + 110001

00000000x2 + 100000

Codewordc(x)r(x)g(x)m(x)Message

Page 58: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

58

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 58

Generating Check Bits for Generating Check Bits for CRCCRC CodeCode

� Use LFSR

� With characteristic polynomial equal to g(x)

� Append n-k 0’s to end of message

� Example: m(x)=x2+x+1 and g(x)=x3+x+1

0 0 0 111000Appended 0’s

Message

0 1 0

Final state after shifting equals remainder

Page 59: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

59

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 59

Checking Checking CRCCRC CodewordCodeword

� Checking Received Codeword for Errors

� Shift codeword into LFSR

– with same characteristic polynomial as used to

generate it

� If final state of LFSR non-zero, then error

0 0 0 111010codeword to check

Page 60: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

60

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 60

Selecting Generator PolynomialSelecting Generator Polynomial

� Key issue for CRC Codes

� If first and last bit of polynomial are 1

– Will detect burst errors of length n-k or less

� If generator polynomial is mutliple of (x+1)

– Will detect any odd number of errors

� If g(x) = (x+1)p(x) where p(x) primitive of

degree n-k-1 and n < 2n-k-1

– Will detect single, double, triple, and odd errors

Page 61: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

61

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 61

Commonly Used Commonly Used CRCCRC GeneratorsGenerators

x64+x4+x3+x+1CRC-64 (ISO)

x32+x26+x23+x22+x16+x12+x11+x10+x8

+x7+x5+x4+x+1CRC-32 (Ethernet)

x16+x12+x5+1CRC-16-CCITT (X25, Bluetooth)

x12+x11+x3+x2+x+1CRC-12 (Telecom systems)

x5+x2+1CRC-5 (USB token packets)

Generator PolynomialCRC code

Page 62: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

62

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 62

Fault Tolerance SchemesFault Tolerance Schemes

� Adding Fault Tolerance to Design

� Improves dependability of system

� Requires redundancy

– Hardware

– Time

– Information

Page 63: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

63

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 63

Hardware RedundancyHardware Redundancy

� Involves replicating hardware units

� At any level of design

– gate-level, module-level, chip-level, board-level

� Three Basic Forms

� Static (also called Passive)

– Masks faults rather than detects them

� Dynamic (also called Active)

– Detects faults and reconfigures to spare hardware

� Hybrid

– Combines active and passive approaches

Page 64: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

64

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 64

Static RedundancyStatic Redundancy

� Masks faults so no erroneous outputs

� Provides uninterrupted operation

� Important for real-time systems

– No time to reconfigure or retry operation

� Simple self-contained

– No need to update or rollback system state

Page 65: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

65

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 65

Triple Module Redundancy (Triple Module Redundancy (TMRTMR))

� Well-known static redundancy scheme

� Three copies of module

� Use majority voter to determine final output

� Error in one module out-voted by other two

Module

3

Module

2

Module

1

Majority

Voter

Page 66: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

66

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 66

TMRTMR Reliability and Reliability and MTTFMTTF

� TMR works if any 2 modules work

� Rm = reliability of each module

� Rv = reliability of voter

� MTTF for TMR

)23()]1([ 3223

2

3

mmvmmmvTMR RRRRRCRRR −=−+=

vmvm

ttt

mmvTMRTMR dteeedtRRRdtRMTTF mmv

λλλλ

λλλ

+−

+=

−=−== ∫∫∫∞

−−−∞∞

3

2

2

3

)23()23(0

32

0

32

0

Page 67: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

67

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 67

Comparison with SimplexComparison with Simplex

� Neglecting fault rate of voter

� TMR has lower MTTF, but

� Can tolerate temporary faults

� Higher reliability for short mission times

simplex

mmm

TMR MTTFMTTF6

51

6

5

3

2

2

3=

=−=

λλλ

Page 68: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

68

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 68

Comparison with SimplexComparison with Simplex

� Crossover point

� RTMR > Rsimplex when

� Mission time shorter than 70% of MTTF

( ) simplex

m

ttt

simplexTMR

MTTFtSolve

eee

RR

mmm

7.02ln

2332

≈=⇒

=−

=

−−−

λ

λλλ

Page 69: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

69

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 69

NN--Modular Redundancy (Modular Redundancy (NMRNMR))

� NMR

� N modules along with majority voter

– TMR special case

� Number of failed modules masked = (N-1)/2

� As N increases, MTTF decreases

– But, reliability for short missions increases

� If goal only to tolerate temporary faults

� TMR sufficient

Page 70: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

70

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 70

Interwoven LogicInterwoven Logic

� Replace each gate

� with 4 gates using inconnection pattern

that automatically corrects errors

� Traditionally not as attractive as TMR

� Requires lots of area overhead

� Renewed interest by researchers

investigating emerging nanoelectronic

technologies

Page 71: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

71

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 71

Interwoven Logic with 4 NOR GatesInterwoven Logic with 4 NOR Gates

++

+

+

+

X

Y

1

2

3

4

1b

+1c

+1d

+1a

+2b

+2c

+2d

+2a

+3b

+3c

+3d

+3a

+4b

+4c

+4d

+4a

X

Y

Page 72: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

72

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 72

Example of Error on Third Y InputExample of Error on Third Y Input

+1b

+1c

+1d

+1a

+2b

+2c

+2d

+2a

+3b

+3c

+3d

+3a

+4b

+4c

+4d

+4a

X

Y

0

0

0

0

0

01

0

1

1

0

0

0

0

0

0

1

1

1

1

0

0

00

+

+

+

+

X

Y

1

2

3

4

Page 73: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

73

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 73

Dynamic RedundancyDynamic Redundancy

� Involves

� Detecting fault

� Locating faulty hardware unit

� Reconfiguring system to use spare fault-free

hardware unit

Page 74: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

74

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 74

UnpoweredUnpowered (Cold) Spares(Cold) Spares

� Advantage

� Extends lifetime of spares

� Equations

� Assume spare not failing until powered

� Perfect reconfiguration capability

λ

λ λ

2

)1(

_/

_/

=

+= −

sparecoldw

t

sparecoldw

MTTF

etR

Page 75: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

75

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 75

UnpoweredUnpowered (Cold) Spares(Cold) Spares

� One cold spare doubles MTTF

� Assuming faults always detected and

reconfiguration circuitry never fails

� Drawback of cold spare

� Extra time to power and initialize

� Cannot be used to help in detecting faults

� Fault detection requires either

– periodic offline testing

– online testing using time or information redundancy

Page 76: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

76

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 76

Powered (Hot) SparesPowered (Hot) Spares

� Can use spares for online fault detection

� One approach is duplicate-and-compare

� If outputs mismatch then fault occurred

– Run diagnostic procedure to determine which

module is faulty and replace with spare

� Any number of spares can be used

Module

B

Spare

Module

Module

A

Compare

Output

Agree/Disagree

Page 77: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

77

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 77

PairPair--andand--aa--SpareSpare

� Avoids halting system to run diagnostic procedure when fault occurs

Module

B

Module

A

Compare

Output

Agree/Disagree

Module

D

Module

C

Compare

Output

Agree/Disagree

Switch

Page 78: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

78

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 78

TMRTMR/Simplex/Simplex

� When one module in TMR fails

� Disconnect one of remaining modules

� Improves MTTF while retaining advantages

of TMR when 3 good modules

� TMR/Simplex

� Reliability always better than either TMR or

Simplex alone

Page 79: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

79

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 79

Comparison of Reliability Comparison of Reliability vsvs TimeTime

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 1

NORMALIZED MISSION TIME (T/MTTF)

RE

LIA

BIL

ITY

SIMPLEX

TMR

TMR/SIMPLEX

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 1

NORMALIZED MISSION TIME (T/MTTF)

RE

LIA

BIL

ITY

SIMPLEX

TMR

TMR/SIMPLEX

Page 80: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

80

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 80

Hybrid RedundancyHybrid Redundancy

� Combines both static and dynamic redundancy

� Masks faults like static

� Detects and reconfigures like dynamic

Page 81: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

81

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 81

TMRTMR with Spareswith Spares

� If TMR module fails

� Replace with spare

– can be either hot or cold spare

� While system has three working modules

– TMR will provide fault masking for

uninterrupted operation

Page 82: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

82

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 82

SelfSelf--Purging RedundancyPurging Redundancy

� Uses threshold voter instead of majority voter

� Threshold voter outputs 1 if number of

input that are 1 greater than threshold

– Otherwise outputs 0

� Requires hot spares

Page 83: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

83

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 83

SelfSelf--Purging RedundancyPurging Redundancy

Module

3

Module

2

Module

1

Threshold

Voter

≥2

Elem.

Switch

Elem.

Switch

Elem.

Switch

Module

4

Elem.

Switch

Module

5

Elem.

Switch VoterModule

Flip

Flop

&

⊕RS

Initialization

Elementary Switch

Page 84: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

84

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 84

SelfSelf--Purging RedundancyPurging Redundancy

� Compared with 5MR

� Self-purging with 5 modules

– Tolerate up to 3 failing modules (5MR cannot)

– Cannot tolerate two modules simultaneously

failing (5MR can)

� Compared with TMR with 2 spares

� Self-purging with 5 modules

– simpler reconfiguration circuitry

– requires hot spares (3MR w/spares can use

either hot or cold spares)

Page 85: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

85

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 85

Time RedundancyTime Redundancy

� Advantage

� Less hardware

� Drawback

� Cannot detect permanent faults

� If error detected

� System needs to rollback to known good

state before resuming operation

Page 86: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

86

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 86

Repeated ExecutionRepeated Execution

� Repeat operation twice

� Simplest time redundancy approach

� Detects temporary faults occurring during

one execution (but not both)

– Causes mismatch in results

� Can reuse same hardware for both

executions

– Only one copy of functional hardware needed

Page 87: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

87

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 87

Repeated ExecutionRepeated Execution

� Requires mechanism for storing and comparing results of both executions

� In processor, can store in memory or on

disk and use software to compare

� Main cost

� Additional time for redundant execution

and comparison

Page 88: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

88

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 88

MultiMulti--threaded Redundant Executionthreaded Redundant Execution

� Can use in processor-based system that can run multiple threads

� Two copies of thread executed concurrently

� Results compared when both complete

� Take advantage of processor’s built-in

capability to exploit processing resources

– Reduce execution time

– Can significantly reduce performance penalty

Page 89: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

89

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 89

Multiple Sampling of Multiple Sampling of OuputsOuputs

� Done at circuit-level

� Sample once at end of normal clock cycle

� Same again after delay of ∆t

� Two samples compared to detect mismatch

– Indicates error occurred

� Detect fault whose duration is less than ∆t

� Performance overhead depends on

– Size of ∆t relative to normal clock period

Page 90: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

90

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 90

Multiple Sampling of OutputsMultiple Sampling of Outputs

� Simple approach using two latches

Clk

Main

Latch

Clk+∆t

Shadow

Latch

⊕ ErrorSignal

Page 91: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

91

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 91

Multiple Sampling of OutputsMultiple Sampling of Outputs

� Approach using stability checker at output

Normal

Clock Period ∆t

Normal

Clock Period ∆t

Stability

Checking

Period

Stability

Checking

Period

&

&

+

+

& Error

Checking

Period

Signal

Page 92: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

92

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 92

Diverse Diverse RecomputationRecomputation

� Use same hardware, but perform computation differently second time

� Can detect permanent faults that affects

only one computation

� For arithmetic or logical operations

� Shift operands when performing second

computation [Patel 1982]

� Detects permanent fault affecting only one

bit-slice

Page 93: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

93

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 93

Information RedundancyInformation Redundancy

� Based on Error Detecting and Correcting Codes

� Advantage

� Detects both permanent and temporary

faults

� Implemented with less hardware overhead

than using multiple copies of module

� Disadvantage

� More complex design

Page 94: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

94

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 94

Error DetectionError Detection

� Error detecting codes used to detect errors

� If error detected

– Rollback to previous known error-free state

– Retry operation

Page 95: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

95

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 95

RollbackRollback

� Requires adding storage to save previous state

� Amount of rollback depends on latency of

error detection mechanism

� Zero-latency error detection

– rollback implemented by preventing system state from updating

� If errors detected after n cycles

– need rollback restoring system to state at least

n clock cycles earlier

Page 96: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

96

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 96

CheckpointCheckpoint

� Execution divided into set of operations

� Before each operation executed

– checkpoint created where system state saved

� If any error detected during operation

– rollback to last checkpoint and retry operation

� If multiple retries fail

– operation halts and system flags that

permanent fault has occurred

Page 97: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

97

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 97

Error DetectionError Detection

� Encode outputs of circuit with error detecting code

� Non-codeword output indicates error

m

m

k

c

Inputs

Checker

Functional

Logic

Check Bit

Generator

k

Outputs

Error

Indication

Page 98: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

98

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 98

SelfSelf--Checking CheckerChecking Checker

� Has two outputs

� Normal error-free case (1,0) or (0,1)

� If equal to each other, then error (0,0) or (1,1)

� Cannot have single error indicator output

– Stuck-at 0 fault on output could never be detected

Page 99: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

99

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 99

Totally SelfTotally Self--Checking CheckerChecking Checker

� Requires three properties

� Code Disjoint

– all codeword inputs mapped to codeword outputs

� Fault Secure

– for all codeword inputs, checker in presence of

fault will either procedure correct codeword output

or non-codeword output (not incorrect codeword)

� Self-Testing

– For each fault, at least one codeword input gives error indication

Page 100: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

100

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 100

DuplicateDuplicate--andand--CompareCompare

� Equality checker indicates error

� Undetected error can occur only if

common-mode fault affecting both copies

� Only faults after stems detected

� Over 100% overhead (including checker)

Functional

Logic

Functional

Logic

Stems

Equality

CheckerError

Indication

Primary

Inputs

Page 101: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

101

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 101

SingleSingle--Bit Parity CodeBit Parity Code

� Totally self-checking checker formed by removing final gate from XOR tree

EI0⊕

Functional

Logic

Parity

Prediction

EI1

Page 102: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

102

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 102

SingleSingle--Bit Parity CodeBit Parity Code

� Cannot detect even bit errors

� Can ensure no even bit errors by

generating each output with independent

cone of logic

– Only single bit errors can occur due to single point fault

– Typically requires a lot of overhead

Page 103: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

103

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 103

ParityParity--Check CodesCheck Codes

� Each check bit is parity for some set of output bits

� Example: 6 outputs and 3 check bits

Z1 Z2 Z3 Z4 Z5 Z6 c1 c2 c3

Parity Group 1 1 0 0 1 1 0 1 0 0

Parity Group 2 0 1 1 0 0 0 0 1 0

Parity Group 3 0 0 0 0 0 1 0 0 1

Page 104: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

104

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 104

ParityParity--Check CodesCheck Codes

� For c check bits and k functional outputs

� 2ck possible parity check codes

� Can choose code based on structure of

circuit to minimize undetected error

combinations

� Fanouts in circuit determine possible error

combinations due to single-point fault

Page 105: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

105

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 105

Checker for ParityChecker for Parity--Check CodesCheck Codes

� Constructed from single-bit parity checkers and two-rail checkers

Parity

Checker

Two-Rail

Checker

Z1

Z4

Z5

c1

Parity

Checker

Z2

Z3

c2

Parity

Checker

Z6

c3

Two-Rail

Checker

E0

E1

Page 106: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

106

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 106

TwoTwo--Rail CheckersRail Checkers

� Totally self-checking two-rail checker

C0+

&

&

+

&

&

C1

A0

B0

A1

B1

Page 107: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

107

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 107

Berger CodesBerger Codes

� Inverter-free circuit

� Inverters only at primary inputs

� Can be synthesized using only algebraic

factoring [Jha 1993]

� Only unidirectional errors possible for

single point faults

– Can use unidirectional code

– Berger code gives 100% coverage

Page 108: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

108

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 108

Constant Weight CodesConstant Weight Codes

� Non-separable with lower redundancy

� Drawback: need decoding logic to convert

codeword back to its original binary value

� Can use for encoding states of FSM

– No need for decoding logic

Page 109: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

109

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 109

Error CorrectionError Correction

� Information redundancy can also be used to mask errors

� Not as attractive as TMR because logic for

predicting check bits very complex

� However, very good for memories

– Check bits stored with data

– Error do not propagate in memories as in logic

circuits, so SEC-DED usually sufficient

Page 110: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

110

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 110

Error CorrectionError Correction

� Memories very dense and prone to errors

� Especially due to single-event upsets (SEUs)

from radiation

� SEC-DED check bits stored in memory

� 32-bit word, SEC-DED requires 7 check bits

– Increases size of memory by 7/32=21.9%

� 64-bit word, SEC-DED requires 8 check bits

– Increases size of memory by 8/64=12.5%

Page 111: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

111

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 111

Memory Memory ECCECC ArchitectureArchitecture

Generate

Check

Bits Memory

Generate

SyndromeCorrect

Data

Calculated

Check Bits

Write

Check Bits

Read Data Word

Write Data Word

Data Word

In

Read

Check Bits

Data Word

Out

Page 112: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

112

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 112

Hamming Code for Hamming Code for ECCECC RAMRAM

RAM

Core

N words

Z+c+1

bits/word

Z

c

Input Data

Parity Bit

Generator

Z

c

Hamming

Check Bit

Generator

Parity

Check

Hamming

Check c

Bit Error

Correction Circuit Output

Data

Generate Detect/Correct

Hamming

Check Bit

Generator

Parity Bit

Generator

Z

Error Type Condition

No bit error Hamming check bits match, no parity error

Single-bit correctable error Hamming check bits mismatch, parity error

Double-bit error detection Hamming check bits mismatch, no parity error

Z1 Z2 Z3 Z4 Z5 Z6 Z7 Z8 c1 c2 c3 c4

Parity Group 1 1 1 0 1 1 0 1 0 1 0 0 0

Parity Group 2 1 0 1 1 0 1 1 0 0 1 0 0

Parity Group 3 0 1 1 1 0 0 0 1 0 0 1 0

Parity Group 4 0 0 0 0 1 1 1 1 0 0 0 1

Page 113: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

113

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 113

Memory Memory ECCECC

� SEC-DED generally very effective

� Memory bit-flips tend to be independent

and uniformly distributed

� If bit-flip occurs, gets corrected next time

memory location accessed

� Main risk is if memory word not access for

long time

– Multiple bit-flips could accumulate

Page 114: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

114

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 114

Memory ScrubbingMemory Scrubbing

� Every location in memory read on periodic basis

� Reduces chance of multiple errors

accumulating in a memory word

� Can be implemented by having memory

controller cycle through memory during idle

periods

Page 115: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

115

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 115

MultipleMultiple--Bit Upsets (Bit Upsets (MBUMBU))� Can occur due to single SEU

� Typically occur in adjacent memory cells

� Memory interleaving used

� To prevent MBUs from resulting in multiple

bit errors in same word

Word1 Word2 Word3 Word4 Word1 Word2 Word3 Word4 Word1 Word2 Word3 Word4

Bit1 Bit1 Bit1 Bit1 Bit2 Bit2 Bit2 Bit2 Bit3 Bit3 Bit3 Bit3

Memory

Page 116: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

116

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 116

Often None; Memory ECC; Bus

Parity; Changing as Technology Scales

Consumer Electronics Personal Computers

Meet Failure Rate Expectationsat Low Cost

Reasonable Level of Failures Acceptable

Mainstream Low-Cost Systems

Checkpointing,Time Redundancy; ECC; Redundant

Disks

BankingTransaction ProcessingDatabase

HighData Integrity

Data CorruptionVery Costly

High Integrity Systems

No Single Point of Failure;

Self-Checking Pairs; Fault Isolation

Reservation SystemStock Exchange

Telephone Systems

HighAvailability

DowntimeVery Costly

High Availability

Systems

TMRAircraftNuclear Power PlantAir Bag Electronics

Radar

Fault Masking Capability

Error or Delay Catastrophic

ReliableReal-TimeSystems

DynamicRedundancy

SatellitesSpacecraft

Implanted Biomedical

MaximizeMTTF

Difficult orExpensive to Repair

Long-LifeSystems

TechniquesExamplesGoalIssuesType

Page 117: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

117

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 117

Concluding RemarksConcluding Remarks

� Many different fault-tolerant schemes

� Choosing scheme depends on

� Types of faults to be tolerated

– Temporary or permanent

– Single or multiple point failures

– etc.

� Design constraints

– Area, performance, power, etc.

Page 118: Chapter 03 Fault Tolerant slides 110407 - Elsevier · 2013-06-03 · – e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present

EE141

118

System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 118

Concluding RemarksConcluding Remarks

� As technology scales

� Circuits increasingly prone to failure

� Achieving sufficient fault tolerance will be

major design issue