16. Circuit Pitfalls - cerc.utexas.edujaa/lectures/16-1.pdf · Series resistance of D driver, ... R...

44
16. Circuit Pitfalls Jacob Abraham Department of Electrical and Computer Engineering The University of Texas at Austin VLSI Design Fall 2017 October 30, 2017 ECE Department, University of Texas at Austin Lecture 16. Circuit Pitfalls Jacob Abraham, October 30, 2017 1 / 43

Transcript of 16. Circuit Pitfalls - cerc.utexas.edujaa/lectures/16-1.pdf · Series resistance of D driver, ... R...

16. Circuit Pitfalls

Jacob Abraham

Department of Electrical and Computer EngineeringThe University of Texas at Austin

VLSI DesignFall 2017

October 30, 2017

ECE Department, University of Texas at Austin Lecture 16. Circuit Pitfalls Jacob Abraham, October 30, 2017 1 / 43

Bad Circuit 1

Circuit

2:1 multiplexer

Symptom

Mux works whenselected D is 0 but not 1Or fails at low VDDOr fails in SFSF corner

Principle: Threshold drop

X never rises above VDD − VtVt is raised by the body effectThe threshold drop is most serious as Vt becomes a greaterfraction of VDD

Solution: Use transmission gates, not pass transistors

ECE Department, University of Texas at Austin Lecture 16. Circuit Pitfalls Jacob Abraham, October 30, 2017 1 / 43

Bad Circuit 2Circuit

Latch

Symptom

Load a 0 into QSet φ = 0Eventually Qspontaneously flips to 1

Principle: Leakage

X is a dynamic node holding a value as charge on the nodeEventually, subthreshold leakage may disturb charge

Solution: Staticize node withfeedback

Or, periodically refresh node (thisrequires a fast clock, and is notpractical for processes with bigleakage)

ECE Department, University of Texas at Austin Lecture 16. Circuit Pitfalls Jacob Abraham, October 30, 2017 2 / 43

Bad Circuit 3Circuit

Domino AND gate

Symptom

Precharge gate (Y = 0)Then evaluateEventually Yspontaneously flips to 1

Principle: Leakage

X is a dynamic node holdingvalue as charge on the nodeEventually subthreshold leakagemay disturb charge

Solution: Keeper

ECE Department, University of Texas at Austin Lecture 16. Circuit Pitfalls Jacob Abraham, October 30, 2017 3 / 43

Bad Circuit 4

Circuit

Pseudo-nMOS OR

Symptom

When only one input istrue, Y = 0Perhaps only happensin SF corner

Principle: Ratio Failure

nMOS and pMOS fight each otherIf the pMOS is too strong, nMOS cannot pull X low enough

Solution: Check that ratio is satisfied in all corners

ECE Department, University of Texas at Austin Lecture 16. Circuit Pitfalls Jacob Abraham, October 30, 2017 4 / 43

Bad Circuit 5Circuit

LatchSymptom

Q stuck at 1May only happen forcertain latches whereinput is driven by asmall gate located faraway

Principle: Ratio failure (again)

Series resistance of D driver, wire resistance, and transmissiongate gate must be much less than weak feedback inverter

Solution: Check relative strengths

Avoid unbuffered diffusion inputswhere driver is unknown

ECE Department, University of Texas at Austin Lecture 16. Circuit Pitfalls Jacob Abraham, October 30, 2017 5 / 43

Bad Circuit 6

Circuit

Domino AND gate

Symptom

Precharge gate while A= B = 0, so Z = 0Set φ= 1A risesZ is observed tosometimes rise

Principle: Charge sharing

If X was low, it shares chargewith Y

Solution: Limit charge sharing

Safe if CY >> CXOr, precharge node X too VX = VY =

CY

CX + CYVDD

ECE Department, University of Texas at Austin Lecture 16. Circuit Pitfalls Jacob Abraham, October 30, 2017 6 / 43

Bad Circuit 7

Circuit

Dynamic gate + latch

Symptom

Precharge gate whiletransmission gate latchis opaqueEvaluateWhen latch becomestransparent, X falls

Principle: Charge sharing

If Y was low, it shares charge with X

Solution: Buffer dynamic nodes before driving transmissiongate

ECE Department, University of Texas at Austin Lecture 16. Circuit Pitfalls Jacob Abraham, October 30, 2017 7 / 43

Bad Circuit 8

Circuit

Latch

Symptom

Q changes while latchis opaqueEspecially if D comesfrom a far-away driver

Principle: Diffusion Input Noise Sensitivity

If VD < −Vt, transmission gate turns onMost likely because of power supply noise or coupling on D

Solution: Buffer D locally

ECE Department, University of Texas at Austin Lecture 16. Circuit Pitfalls Jacob Abraham, October 30, 2017 8 / 43

Bad Circuit 9

Circuit

Anything

Symptom

Some gates are slowerthan expected

Principle: Hot Spots and Power Supply Noise

ECE Department, University of Texas at Austin Lecture 16. Circuit Pitfalls Jacob Abraham, October 30, 2017 9 / 43

NoiseSources

Power supply noise/ground bounceCapacitive couplingSubstrate couplingCharge sharingLeakageNoise feedthrough

ConsequencesIncreased delay (for noise to settle out)Or incorrect computations

Source: electronicproducts.comLine-to-substrate coupling

Visualization of substrate noise

ECE Department, University of Texas at Austin Lecture 16. Circuit Pitfalls Jacob Abraham, October 30, 2017 10 / 43

Electromigration

“Electron wind” causes movement of metal atoms along wires

Excessive electromigration leads to open circuitsMost significant for unidirectional currents (DC)

Depends on current density Jdc (current/area)Exponential dependence on temperatureBlack’s Equation:

MTTF ∝ eEakT

Jndc,

where Ea is the activation energy (empirically determined bystress testing at high temperatures), and n is typically 2Typical limits: Jdc < 1− 2 mA/µm2

ECE Department, University of Texas at Austin Lecture 16. Circuit Pitfalls Jacob Abraham, October 30, 2017 11 / 43

Source: Cheung and Tao, UC Berkeley

Self Heating

Current through wire resistance generates heatOxide surrounding wires is a thermal insulatorHeat tends to build up in wiresHotter wires are more resistive, slower

Self-heating limits AC current densities for reliability

Irms =

√∫ T0 I(t)2dt

T

Typical limits: Jrms < 15 mA/µm2

Self heating a problem for SOI circuits and 3-D systems

Modeling self heating, Silvaco

ECE Department, University of Texas at Austin Lecture 16. Circuit Pitfalls Jacob Abraham, October 30, 2017 12 / 43

Latchup

Latchup: positive feedback leading to VDD – GND short

Major problem for 1970s CMOS processes before it was wellunderstood

Avoid by minimizing resistance of body to GND/VDD

Use plenty of substrate and well taps

ECE Department, University of Texas at Austin Lecture 16. Circuit Pitfalls Jacob Abraham, October 30, 2017 13 / 43

Guard Rings

Latchup risk greatest when diffusion-to-substrate diodes couldbecome forward-biased

Surround sensitive region with guard ring to collect injectedcharge

ECE Department, University of Texas at Austin Lecture 16. Circuit Pitfalls Jacob Abraham, October 30, 2017 14 / 43

Overvoltage

High voltages can damage transistorsElectrostatic discharge (ESD)Oxide arcingPunchthroughTime-dependent dielectric breakdown (TDDB)

Accumulated wear from tunneling currents

Requires low VDD for thin oxides and short channelsUse ESD protection structures where chip meets real world

Transient suppression device specifications

for automotive applications

Source: dev.emcelettronica.com

ECE Department, University of Texas at Austin Lecture 16. Circuit Pitfalls Jacob Abraham, October 30, 2017 15 / 43

Hot Carriers

Electric fields across channel impart high energies to somecarriers

These “hot” carriers may be blasted into the gate oxide wherethey become trappedAccumulation of charge in oxide causes shift in Vt over timeEventually Vt shifts too far for devices to operate correctly

Choose VDD to achieve reasonable product lifetimeWorst problems for inverters and NORs with slow input risetime and long propagation delays

Source: Kiethley Application Note 2535

ECE Department, University of Texas at Austin Lecture 16. Circuit Pitfalls Jacob Abraham, October 30, 2017 16 / 43

Bias Temperature Instability

Mechanism

Even when no carriers moving from source to drain, the gatevoltage can cause charges to migrate into the insulating gateoxide

Phenomenon is partly reversible

Charges leave the oxide after the gate voltage is removed

Source: Keane and Kim, IEEE Spectrum, May 2011

ECE Department, University of Texas at Austin Lecture 16. Circuit Pitfalls Jacob Abraham, October 30, 2017 17 / 43

NBTI Mechanism and Degradation

Hole interaction with oxide causes Si-Hbonds to break

Change is positive charge density in trapsincreases Vt

When stress is removed, H atoms diffuseback to interface and anneal the brokenbond

DC stress causes much shorter lifetime

Source: Peters, Semiconductor International, March 1, 2004ECE Department, University of Texas at Austin Lecture 16. Circuit Pitfalls Jacob Abraham, October 30, 2017 18 / 43

Transistor Aging and Failure Prediction

Guardband violation due to transistor aging

Example of an aging sensor

Agarwal et al., VTS 2007ECE Department, University of Texas at Austin Lecture 16. Circuit Pitfalls Jacob Abraham, October 30, 2017 19 / 43

Oxide Breakdown

Mechanism

Voltage across the gate can also cause electrically activedefects within the oxide layer

The defects can trap charges

If enough of the charges accumulate, they can create a short,causing a catastrophic failure of the transistor

Source: Keane and Kim, IEEE Spectrum, May 2011

ECE Department, University of Texas at Austin Lecture 16. Circuit Pitfalls Jacob Abraham, October 30, 2017 20 / 43

Summary

Static CMOS gates are very robust

Will settle to correct value if designed with sufficient marginsand if you wait long enough

Other circuits suffer from a variety of pitfalls

Tradeoff between performance and robustness

Very important to check circuits for pitfalls

For large chips, you need an automatic checkerDesign rules aren’t worth the paper they are printed on unlessyou back them up with a tool

ECE Department, University of Texas at Austin Lecture 16. Circuit Pitfalls Jacob Abraham, October 30, 2017 21 / 43

Classes of Dependable Systems

Systems Designed for Very Long LifeSpacecraft with multiyear missions, inaccessible systemsTechniques: Replication (spares), error coding, monitoring,shielding

Safety-Critical SystemsFlight control computers, nuclear-plant shutdown, medicalmonitoring, automobile braking controlTechniques: Replication with voting, time redundancy, designdiversity

High-Availability SystemsTelephone switching centers, server farms, banking systems,e-commerceTechniques: Hardware and Information redundancy, backupschemes, hot-swap, recovery

Consumer Products?PCs, PDAs, smart phonesTechniques: parity checks for memories, intrusion tolerance,virus detection, low cost of replacement

ECE Department, University of Texas at Austin Lecture 16. Circuit Pitfalls Jacob Abraham, October 30, 2017 22 / 43

Historical Perspective

Dionysius Lardner

“The most certain and effectual check upon errors which arisein the process of computation, is to cause the samecomputations to be made by separate and independentcomputers; and this check is rendered still more decisive ifthey make their computations by different methods,”Edinburgh Review, No. CXX, July 1834

Key Papers in 1956

Moore and Shannon, “Reliable circuits using less reliablerelays,” Bell System Technical Journal

von Neumann, “Probabilistic logic and synthesis of reliableorganism from unreliable components,” Annals ofmathematical studies, Princeton University Press

ECE Department, University of Texas at Austin Lecture 16. Circuit Pitfalls Jacob Abraham, October 30, 2017 23 / 43

Reliable Relay Networks

Shannon, 1956

Can be applied to MOS transistors

A single open or short of a transistor (relay) will be masked bythe network

Many multiple faults will also be tolerated

ECE Department, University of Texas at Austin Lecture 16. Circuit Pitfalls Jacob Abraham, October 30, 2017 24 / 43

NAND Multiplexing – Massive Redundancy

Inspired by 1956 von Neumann paper for logic implemented innanotechnologies

Approach

Similar to NMR, but voting carried out in a bundle

Executive stage – performs operations

Restorative stage – reduces degradation caused by errors fromthe executive stage, acting as output “amplifier”

ECE Department, University of Texas at Austin Lecture 16. Circuit Pitfalls Jacob Abraham, October 30, 2017 25 / 43

Error-Detecting and Correcting Codes

One way of detecting (and correcting) errors in data transmissionand storage, is to encode data, with a subset of the words beingcode words

Reasonable errors will change a code word to a non-code word, andthe errors will be detectable

Errors which transform one code word into another will not bedetectable

“Error Models” relate likely physical faults to the errors that theycould cause

Distance between two code words is the number of distinctchanges needed to change one code word into the other

Example: Parity codes

ECE Department, University of Texas at Austin Lecture 16. Circuit Pitfalls Jacob Abraham, October 30, 2017 26 / 43

Distance Properties

The Hamming Weight of a vector, X, W (X) is the number ofnon-zero components of X

The Hamming Distance between two vectors, X and Y , d(X,Y ),is the number of components in which they differ

The minimum distance of a code is the minimum of Hammingdistances between all pairs of code words

To detect d-bit errors, need a code with distance d+ 1, to correctd-bit errors, need a code with distance 2d+ 1.Example:

ECE Department, University of Texas at Austin Lecture 16. Circuit Pitfalls Jacob Abraham, October 30, 2017 27 / 43

Self-Checking Circuits

Self-Checking Circuits – encoded inputs and outputs, outputchecker

ECE Department, University of Texas at Austin Lecture 16. Circuit Pitfalls Jacob Abraham, October 30, 2017 28 / 43

Boeing 777 Primary Flight Computer

ECE Department, University of Texas at Austin Lecture 16. Circuit Pitfalls Jacob Abraham, October 30, 2017 29 / 43

Reliability, Availability, Safety

Reliability (R(t))Conditional probability that a system provides continuousproper service in the interval [0,t] given that it provided desiredservice at time 0Simple Reliability function (exponential): R(t) = e−λt,Constant Failure Rate λ

Mean Time to Failure, MTTFMTTF =

∫∞0R(t)dt

For an exponential reliability function, MTTF = 1/λ

Availability A(t)Fraction of time that system is in the operational state(providing service) during the interval [0,t]Function of both failure rate (λ) and repair rate (µ)Steady-State Availability, A = MTTF

MTTF+MTTR = λλ+µ

“Markov Chain” for a simplesystem with repair

ECE Department, University of Texas at Austin Lecture 16. Circuit Pitfalls Jacob Abraham, October 30, 2017 30 / 43

Reliability

View a system as providing a service

Faults =⇒ Errors =⇒ Failures

Fault: an anomalous physical condition

Error: an incorrect logic value as a consequence of the fault

Failure: the condition where the system does not provide theexpected service

ECE Department, University of Texas at Austin Lecture 16. Circuit Pitfalls Jacob Abraham, October 30, 2017 31 / 43

Are “Fault Tolerance” and “Resilience” the Same?

Fault Tolerance

Errors (due to faults)detected and corrected, faultlocated, reconfigurationaround faulty unit

System designed to tolerateclasses of faults

User does not see anythingwrong (except perhaps anadditional delay)

Service does not suffer anydown time

Resilience

User may see errors duringthe service, but the finalresults are correct

System requires on-line errordetection, but may usecheckpoints, retry, etc., toachieve resilience

Ability to deal with“unknown” faults

Service may be downintermittently

ECE Department, University of Texas at Austin Lecture 16. Circuit Pitfalls Jacob Abraham, October 30, 2017 32 / 43

Errors Not Repeatable – “Heisenbugs”

ECE Department, University of Texas at Austin Lecture 16. Circuit Pitfalls Jacob Abraham, October 30, 2017 33 / 43

Example of Resilience – Single Engine Airplane

From Malibu Jetprop pilot’s operating handbook

If loss of power occurs at altitude, trim the aircraft for bestgliding angle (90 KIAS) and look for a suitable field.

At best glide angle, no wind, with the engine stopped and thepropeller feathered, the aircraft will travel approximately 2miles for each thousand feet of altitude.

ECE Department, University of Texas at Austin Lecture 16. Circuit Pitfalls Jacob Abraham, October 30, 2017 34 / 43

Achieving Resilience

Start with fault-free hardware

Testing after manufacturing

On-line tests to detect wearout and degradation

Detection is key

Detect errors in results of computations

Application-level results are, ultimately, what are important

Ensure correct results at the application level

Appropriate checks at different levels of the design

High-level checks tend to have lower overheads

ECE Department, University of Texas at Austin Lecture 16. Circuit Pitfalls Jacob Abraham, October 30, 2017 35 / 43

Application-Level Fault Tolerance

Reduce the cost of fault tolerance by looking at computations at ahigher level

Algorithm-Based Fault Tolerance (ABFT), (Huang and Abraham,1984)

Encode data at a high level (application level)

Design algorithm to operate on encoded input data andproduce encoded output data

Distribute computation tasks among multiple computationunits, so that failure of a unit affects only a portion of theoutput data, enabling the correct data to be recovered

Very general fault model: A computation unit can produce anyarbitrary logical output under failure

Communication paths checked using coding techniques

ECE Department, University of Texas at Austin Lecture 16. Circuit Pitfalls Jacob Abraham, October 30, 2017 36 / 43

Illustration of Application to Matrix Operations

ECE Department, University of Texas at Austin Lecture 16. Circuit Pitfalls Jacob Abraham, October 30, 2017 37 / 43

Checksum Calculations in High Performance Computing

ABFT applied to DGEMM

Source: Bosilca, Delmas, Dongarra and Langou, 2008.

ECE Department, University of Texas at Austin Lecture 16. Circuit Pitfalls Jacob Abraham, October 30, 2017 38 / 43

Performance Under Failure

Performance (GFLOPS/sec/proc) of PBLAS PDGEMM, ABFTBLAS PDGEMM (0 failure), and ABFT BLAS PDGEMM (1failure)

4 25 100 225 400 625 10240

0.5

1

1.5

2

2.5

3

3.5

4

4.4nloc=3000

GF

LOP

S/p

roc

# procs

PBLAS PDGEMMABFTBLAS PDGEMM (0 failure)ABFTBLAS PDGEMM (1 failure)

ECE Department, University of Texas at Austin Lecture 16. Circuit Pitfalls Jacob Abraham, October 30, 2017 39 / 43

Error Resilience in Non-Linear Control Systems

ECE Department, University of Texas at Austin Lecture 16. Circuit Pitfalls Jacob Abraham, October 30, 2017 40 / 43

Brake by Wire

ECE Department, University of Texas at Austin Lecture 16. Circuit Pitfalls Jacob Abraham, October 30, 2017 41 / 43

Error Detection and Correction in Non-Linear ControlSystem

Brake by Wire algorithm executing on embedded processor

ECE Department, University of Texas at Austin Lecture 16. Circuit Pitfalls Jacob Abraham, October 30, 2017 42 / 43

Error Detection and Correction in Brake by Wire System

Approach to dealing with soft errors

When transient error is detected, results of the control loop(output to actuator) ignored for a few cycles till no error isseen.

ECE Department, University of Texas at Austin Lecture 16. Circuit Pitfalls Jacob Abraham, October 30, 2017 43 / 43