Impact of Intermittent Faults on Nanocomputing...

Impact of Intermittent Faults on Nanocomputing Devices

Cristian Constantinescu

June 28th, 2007

Dependable Systems and Networks

Impact of Intermittent Faults on Nanocomputing Devices2 June 28th, 2007

Outline

•Fault classes– Permanent faults

– Transient faults

– Intermittent faults

•Field fault/error data collection

•Intermittent faults– Impact of scaling

•Mitigation techniques– HW vs. SW solutions

•Summary

•Q&A

Fault Classes

•Permanent faults, e.g. stuck-at, bridges, opens– Reflect irreversible physical changes– Occur at the same location, are always active

•Transient faults, e.g. particle induced SEU, noise, ESD– Induced by temporary environmental conditions– Occur at different locations, at random time instances

•Intermittent faults, e.g. manufacturing residues, oxide breakdown

– Occur due to unstable, marginal hardware– Occur at the same location– May be activated and deactivated– Induce bursts of errors

Fault/Error Data Collection

Fault/Error Data Collection Study

•Servers from two manufacturers were instrumented to collect errors

– Manufacturer A: 193 servers, 16 months

– Manufacturer B: 64 servers, 10 months

•Examples of reported errors– Memory

– Front side bus

•Failure analysis performed when possible

Source: C. Constantinescu, SELSE 2006

Server Instrumentation

HAL – hardware abstraction layer

MCH – machine check handler

CI – component instrumentation

Instrumentation validated by fault injection

C P U C H IP S E T

C I D e v ic eD rive rM C H

C I S e rv ic e

E ve n t L o g

020406080

100120140

6 to 1

NUMBER OF SINGLE-BIT ERRORS

Corrected Memory Errors

•310.7 server years

•Servers experiencing intermittent faults: 16 out of 257, i.e. 6.2 %

•Corrected single-bit errors (SBE) induced by intermittent faults: 12990 out of 16069, i.e. 80.8 %

Typical Signature of Memory Intermittent Faults

Daily number of corrected SBE

80 86 89 92 95 135

Time (days)

Failure analysis: SBE induced intermittently by poly residue, within memory chips

Source: Hynix Semiconductor

Processor Front Side Bus Errors

•Front side bus (FSB) errors– Bursts of single-bit errors (SBE) on data path – SBE detected and corrected (data path protected by ECC)

•Servers experiencing FSB intermittent faults: 2 out of 64 (3%)– Burst duration examples: 7104 errors in 3 sec; 3264 errors in 18 sec

•Failure analysis– Intermittent contacts at solder joints

Server 1 Server 2 P0 P1 P2 P3 P0 P1 P2 P3

3264 15 0 0 108 121 97 101 7104 20 0 0 - - - -

More on Intermittent Faults

Timing Violations

BLM delamination

•Timing violations due to increased resistance; slow raise and fall times

– Intermittent behavior occurs before the fault becomes permanent - specific for 90nm node and beyond

– Permanent failures for previous technology nodes

Source: C. Constantinescu, SELSE 2006

Crosstalk Induced Errors

•Pulse induced by the affecting line into a victim line

•Timing violations due to crosstalk– Signal speedup or delay

Signal speedup – two adjacent lines switch in the same direction

Signal delay – two adjacent lines switch in opposite directions

•Process, voltage and temperature (PVT) variations amplify crosstalk induced skew

•Crosstalk increases with interconnect scaling and higher clock frequencies

Ultra-thin Oxide Faults

•Ultrathin oxide reliability – Rate of defect generation decreases with supply voltage – Tunnel current increases exponentially with decreasing gate oxide

thickness

•Soft breakdown (SBD) – Intermittent fluctuating current, high leakage– SBD examples

Erratic erasure of flash memory cells

Erratic fluctuations of Vmin in SRAM

0 300 600 900 1200 1500

Time [s]

SRAM Vmin90 nm technology

Source: M. Agostinelli et al, IEDM 2005

Scaling Trend of the Vmin Sensitivity

Vmin sensitivity to gate leakage

1.00E+051.00E+061.00E+07

Rg [Ohms]

45nm65nm90nm

Incresed cell sensitivity

Source: M. Agostinelli et al, IEDM 2005

Impact of Process Variations

•Increasingly difficult to accurately control device parameters

– Channel length and width

– Oxide thickness

– Doping profile

•Intra-die variations, e.g., different transistor voltage threshold within the same SRAM cell

– Intermittent failure of read/write operations

•Impact of process variations is increasing with scaling

Activation of Intermittent Faults

Voltage and frequency shmoo– Voltage

– Frequency

– Temperature

– Workload

1.70V |***************************************** ||***************************************** ||***************************************** ||***************************************** |

1.45V |***************************************** ||****D************************************ ||HVMWV**ZYZ******************************||LH*NDNPQRFST ****************************|

1.20V |ABCDEADFGHIJC *************************** |40ns 50ns 60ns 70ns 80ns

Mitigation Techniques

HW Solutions: IBM G5/G6 CPU

•Mirrored Instruction and Execution units

•Comparator and register unit

•Compare outputs in n-1 instruction pipeline stage

– No error: update checkpoint array (register content and instruction address into R-unit) in last pipeline stage and continue normal execution

– Error detected: Reset CPU (except R-unit), purge cache and its directory, reload last correct state from checkpoint array, retry

Source: L. Spainhower, T. A. Greg, IBM JR&D,1999

•Transient faults are recovered from

•Error threshold can be used for intermittent faults

•Permanent faults require activation of a spare CPU under OS control

R- UNIT

COMPARATOR

HW Solutions: IBM G5/G6 CPU

•Pros– Lower design complexity

– Shorter development and validation time

– No performance penalty (compare and detect cycles are overlapped)

•Cons– Total circuit overhead about 40%

– It may not scale well with frequency

SW Solutions: AR-SMT

•Active-stream/Redundant-stream Simultaneous Multithreading (AR-SMT)

– Two copies of the same program run concurrently, using the SMT micro architecture

– Results of the two threads are compared

– A-STREAM errors are detected with a delay

– R-STREAM errors are detected before commit

– Recovery from transient faults (e.g. particle induced soft error) is possible

Use committed state of R-STREAMA- STREAM R- STREAM

R- STREAM A- STREAM

COMMITFERCH

DELAY BUFFER

Source: E. Rotenberg, FTCS, 1999

SW Solutions: AR-SMT

•Pros– AR-SMT relies on existing micro-architectural features, e.g. SMT

– No HW overhead

•Cons– Increased execution time, 10% - 30%

– Increased performance penalty or even failure in the case of bursts of high frequency errors

Comparing Fault/Error Handling Techniques

•HW implementations are fast (e.g. ECC) - can handle bursts of errors induced by intermittent faults

•SW detection and recovery is slower – Performance penalty in the case of large bursts of errors

– Near coincident fault scenario, in the case of high rate bursts of errors => SW fault/error handling may fail before recovery is completed

•SW solutions are better suited for failure prediction and resource reconfiguration

Summary

•Semiconductor technology is a two edge sword– Lower dimensions and voltages and higher frequencies have led to

tremendous performance gains– Intermittent and transient faults have become a serious challenge to

developers and manufacturers

•Designing for particle induced soft errors is too narrowly focused

•Software only techniques cannot effectively handle bursts of errors occurring at a high rate

FAULT TOLERANT CHIPS ARE THE FUTURE

Performance

Dependability

Impact of Intermittent Faults on Nanocomputing...

Documents

Transcript of Impact of Intermittent Faults on Nanocomputing...

Towards Understanding The Effects of Intermittent Hardware Faults on Programs

Detecting Intermittent Faults in Aircraft Electrical Wire by Utilizing Power Line Communication

Intermittent androgen deprivation therapy for prostate ... · Intermittent androgen deprivation therapy for ... intermittent androgen deprivation therapy ... Intermittent androgen

11.2A Folds, Faults, and Mountains Folds and Faults.

Graduate - ece.ufl.edu · computing, advanced computing architecture, biologically inspired nanocomputing, distributed information processing systems. Jianbo Gao. PhD, University

Application of TLM and Cassie-Mayr Arc model on ... · 2.2 Arc Model When the insulation degrades further, some intermittent arcs i.e. incipient faults begin to persist within the

Distributed Systems 23. Fault Tolerancepxk/417/notes/content/23-ft-slides.pdf · Faults • Three categories –transient faults –intermittent faults –permanent faults • Any

Intermittent Fault Location on Live Electrical Wiring Systemscfurse/Center of Excellence/wiring_papers... · Intermittent wiring faults are among the most fru-strating, time consuming,

Intermittent xt

Hardware-Software Integrated Diagnosis for Intermittent ...blogs.ubc.ca/karthik/files/2014/04/DSN-Majid.pdf · Hardware-Software Integrated Diagnosis for Intermittent Hardware Faults

Formation of waterfalls by intermittent burial of active ... · Formation of waterfalls by intermittent burial of active faults ... gradation on alluvial fans can ... Formation of

Segmentation Faults, Page Faults, Processes, Threads, and Tasks

CHARACTERISATION OF INTERMITTENT FAULTS IN LOW-VOLTAGE … · 2017-11-01 · 23rd International Conference on Electricity Distribution Lyon, 15-18 June 2015 Paper 0510 CIRED 2015

Static stress changes-- Coulomb. Key concepts: Source faults Receiver faults Optimally oriented faults Assume receiver faults are close to failure Triggering.

Forecasting Intermittent Demand Patterns with Time Series ...matthewalanham.com/Students/2018_MWDSI_Forecasting Intermittent... · Forecasting Intermittent Demand Patterns with Time

Transition Faults and Transition Path Delay Faults: Test ...

Intermittent Compression

Intermittent/transient faults in digital systems

Intermittent Pond

AN INTRODUCTION TO NANOCOMPUTING COPYRIGHTED …€¦ · AN INTRODUCTION TO NANOCOMPUTING Elaine Ann Ebreo Cara, Stephen Chu, Mary Mehrnoosh Eshaghian-Wilner, Eric Mlinar, Alireza