Post on 29-Apr-2018
Impact of Intermittent Faults on Nanocomputing Devices
Cristian Constantinescu
June 28th, 2007
Dependable Systems and Networks
Impact of Intermittent Faults on Nanocomputing Devices2 June 28th, 2007
Outline
•Fault classes– Permanent faults
– Transient faults
– Intermittent faults
•Field fault/error data collection
•Intermittent faults– Impact of scaling
•Mitigation techniques– HW vs. SW solutions
•Summary
•Q&A
Impact of Intermittent Faults on Nanocomputing Devices3 June 28th, 2007
Fault Classes
•Permanent faults, e.g. stuck-at, bridges, opens– Reflect irreversible physical changes– Occur at the same location, are always active
•Transient faults, e.g. particle induced SEU, noise, ESD– Induced by temporary environmental conditions– Occur at different locations, at random time instances
•Intermittent faults, e.g. manufacturing residues, oxide breakdown
– Occur due to unstable, marginal hardware– Occur at the same location– May be activated and deactivated– Induce bursts of errors
Impact of Intermittent Faults on Nanocomputing Devices4 June 28th, 2007
Fault/Error Data Collection
Impact of Intermittent Faults on Nanocomputing Devices5 June 28th, 2007
Fault/Error Data Collection Study
•Servers from two manufacturers were instrumented to collect errors
– Manufacturer A: 193 servers, 16 months
– Manufacturer B: 64 servers, 10 months
•Examples of reported errors– Memory
– Front side bus
•Failure analysis performed when possible
Source: C. Constantinescu, SELSE 2006
Impact of Intermittent Faults on Nanocomputing Devices6 June 28th, 2007
Server Instrumentation
HAL – hardware abstraction layer
MCH – machine check handler
CI – component instrumentation
Instrumentation validated by fault injection
C P U C H IP S E T
C I D e v ic eD rive rM C H
C I S e rv ic e
E ve n t L o g
H A L
Impact of Intermittent Faults on Nanocomputing Devices7 June 28th, 2007
020406080
100120140
01 t
o 5
6 to 1
011
to 50
51 to
100
101 t
o 100
0>1
000
NUMBER OF SINGLE-BIT ERRORS
NU
MB
ER
OF
SY
STE
MS
Corrected Memory Errors
•310.7 server years
•Servers experiencing intermittent faults: 16 out of 257, i.e. 6.2 %
•Corrected single-bit errors (SBE) induced by intermittent faults: 12990 out of 16069, i.e. 80.8 %
Impact of Intermittent Faults on Nanocomputing Devices8 June 28th, 2007
Typical Signature of Memory Intermittent Faults
Daily number of corrected SBE
0
20
40
60
80
100
120
80 86 89 92 95 135
138
344
445
448
Time (days)
SBE
Failure analysis: SBE induced intermittently by poly residue, within memory chips
Source: Hynix Semiconductor
Impact of Intermittent Faults on Nanocomputing Devices9 June 28th, 2007
Processor Front Side Bus Errors
•Front side bus (FSB) errors– Bursts of single-bit errors (SBE) on data path – SBE detected and corrected (data path protected by ECC)
•Servers experiencing FSB intermittent faults: 2 out of 64 (3%)– Burst duration examples: 7104 errors in 3 sec; 3264 errors in 18 sec
•Failure analysis– Intermittent contacts at solder joints
Server 1 Server 2 P0 P1 P2 P3 P0 P1 P2 P3
3264 15 0 0 108 121 97 101 7104 20 0 0 - - - -
Impact of Intermittent Faults on Nanocomputing Devices10 June 28th, 2007
More on Intermittent Faults
Impact of Intermittent Faults on Nanocomputing Devices11 June 28th, 2007
Timing Violations
BLM delamination
•Timing violations due to increased resistance; slow raise and fall times
– Intermittent behavior occurs before the fault becomes permanent - specific for 90nm node and beyond
– Permanent failures for previous technology nodes
Source: C. Constantinescu, SELSE 2006
Impact of Intermittent Faults on Nanocomputing Devices12 June 28th, 2007
Crosstalk Induced Errors
•Pulse induced by the affecting line into a victim line
•Timing violations due to crosstalk– Signal speedup or delay
Signal speedup – two adjacent lines switch in the same direction
Signal delay – two adjacent lines switch in opposite directions
•Process, voltage and temperature (PVT) variations amplify crosstalk induced skew
•Crosstalk increases with interconnect scaling and higher clock frequencies
Impact of Intermittent Faults on Nanocomputing Devices13 June 28th, 2007
Ultra-thin Oxide Faults
•Ultrathin oxide reliability – Rate of defect generation decreases with supply voltage – Tunnel current increases exponentially with decreasing gate oxide
thickness
•Soft breakdown (SBD) – Intermittent fluctuating current, high leakage– SBD examples
Erratic erasure of flash memory cells
Erratic fluctuations of Vmin in SRAM
0.5
0.6
0.7
0.8
0 300 600 900 1200 1500
Time [s]
Vm
in [V
]
SRAM Vmin90 nm technology
Source: M. Agostinelli et al, IEDM 2005
Impact of Intermittent Faults on Nanocomputing Devices14 June 28th, 2007
Scaling Trend of the Vmin Sensitivity
Vmin sensitivity to gate leakage
0
4
8
12
16
1.00E+051.00E+061.00E+07
Rg [Ohms]
Vmin
[a.u
.]
45nm65nm90nm
Incresed cell sensitivity
Source: M. Agostinelli et al, IEDM 2005
Impact of Intermittent Faults on Nanocomputing Devices15 June 28th, 2007
Impact of Process Variations
•Increasingly difficult to accurately control device parameters
– Channel length and width
– Oxide thickness
– Doping profile
•Intra-die variations, e.g., different transistor voltage threshold within the same SRAM cell
– Intermittent failure of read/write operations
•Impact of process variations is increasing with scaling
Impact of Intermittent Faults on Nanocomputing Devices16 June 28th, 2007
Activation of Intermittent Faults
Voltage and frequency shmoo– Voltage
– Frequency
– Temperature
– Workload
1.70V |***************************************** ||***************************************** ||***************************************** ||***************************************** |
1.45V |***************************************** ||****D************************************ ||HVMWV**ZYZ******************************||LH*NDNPQRFST ****************************|
1.20V |ABCDEADFGHIJC *************************** |40ns 50ns 60ns 70ns 80ns
Impact of Intermittent Faults on Nanocomputing Devices17 June 28th, 2007
Mitigation Techniques
Impact of Intermittent Faults on Nanocomputing Devices18 June 28th, 2007
HW Solutions: IBM G5/G6 CPU
•Mirrored Instruction and Execution units
•Comparator and register unit
•Compare outputs in n-1 instruction pipeline stage
– No error: update checkpoint array (register content and instruction address into R-unit) in last pipeline stage and continue normal execution
– Error detected: Reset CPU (except R-unit), purge cache and its directory, reload last correct state from checkpoint array, retry
Source: L. Spainhower, T. A. Greg, IBM JR&D,1999
•Transient faults are recovered from
•Error threshold can be used for intermittent faults
•Permanent faults require activation of a spare CPU under OS control
R- UNIT
COMPARATOR
I &
E-UN
ITS
I &
E-UN
ITS
CACHE
Impact of Intermittent Faults on Nanocomputing Devices19 June 28th, 2007
HW Solutions: IBM G5/G6 CPU
•Pros– Lower design complexity
– Shorter development and validation time
– No performance penalty (compare and detect cycles are overlapped)
•Cons– Total circuit overhead about 40%
– It may not scale well with frequency
Impact of Intermittent Faults on Nanocomputing Devices20 June 28th, 2007
SW Solutions: AR-SMT
•Active-stream/Redundant-stream Simultaneous Multithreading (AR-SMT)
– Two copies of the same program run concurrently, using the SMT micro architecture
– Results of the two threads are compared
– A-STREAM errors are detected with a delay
– R-STREAM errors are detected before commit
– Recovery from transient faults (e.g. particle induced soft error) is possible
Use committed state of R-STREAMA- STREAM R- STREAM
R- STREAM A- STREAM
COMMITFERCH
DELAY BUFFER
Source: E. Rotenberg, FTCS, 1999
Impact of Intermittent Faults on Nanocomputing Devices21 June 28th, 2007
SW Solutions: AR-SMT
•Pros– AR-SMT relies on existing micro-architectural features, e.g. SMT
– No HW overhead
•Cons– Increased execution time, 10% - 30%
– Increased performance penalty or even failure in the case of bursts of high frequency errors
Impact of Intermittent Faults on Nanocomputing Devices22 June 28th, 2007
Comparing Fault/Error Handling Techniques
•HW implementations are fast (e.g. ECC) - can handle bursts of errors induced by intermittent faults
•SW detection and recovery is slower – Performance penalty in the case of large bursts of errors
– Near coincident fault scenario, in the case of high rate bursts of errors => SW fault/error handling may fail before recovery is completed
•SW solutions are better suited for failure prediction and resource reconfiguration
Impact of Intermittent Faults on Nanocomputing Devices23 June 28th, 2007
Summary
•Semiconductor technology is a two edge sword– Lower dimensions and voltages and higher frequencies have led to
tremendous performance gains– Intermittent and transient faults have become a serious challenge to
developers and manufacturers
•Designing for particle induced soft errors is too narrowly focused
•Software only techniques cannot effectively handle bursts of errors occurring at a high rate
FAULT TOLERANT CHIPS ARE THE FUTURE
Impact of Intermittent Faults on Nanocomputing Devices24 June 28th, 2007
Q & A
Performance
Dependability