1. Introduction 1.Faults and their manifestation (4) 2.Analysis of faults (12) 3.Classification of...
-
Upload
evangeline-conley -
Category
Documents
-
view
234 -
download
0
Transcript of 1. Introduction 1.Faults and their manifestation (4) 2.Analysis of faults (12) 3.Classification of...
1. Introduction
1. Faults and their manifestation (4)
2. Analysis of faults (12)
3. Classification of tests (5)
4. Fault coverage requirements (3)
5. Test economics (4)
1.1 Faults and their manifestation
Definition of the terms: Failure, Error and Fault
Failure: A system failure is present when the service of the system
differs from the expected service
A failure is caused by an error
Error: There is an error in the system when its state differs from the
state required to deliver the expected service
An error is caused by a fault
Fault: A fault is present when there is a physical difference between
the correct system and the current system
1.2 Faults and their manifestation: Example
Example: A car cannot be used due to a flat tire
Failure: The car cannot be driven due a flat tire
I.e., the service differs from the expected service
The failure is caused by an error
Error: The air pressure has an erroneous state
An error is caused by a fault
Fault: A puncture, causing an erroneous air-pressure-state
I..e, the puncture is the difference between the correct system and the current system
Note: A fault may not immediately result in a failure; e.g., as will be the case with a slowly leaking tire
1.3 Fault manifestation
According to the way faults manifest themselves in time, they
can be divided into permanent and non-permanent faults
Permanent fault: Affects the system’s functional behavior permanently
Permanent faults are also referred to as solid or hard faults
Examples: Broken wires, functional design errors, etc.
Non-permanent fault: Affects the system’s functional behavior only part
of the time
1.4 Non-permanent faultsNon-permanent faults are only present part of the time
• They occur at random moments and affect the system behavior for finite
periods of time
• Therefore, their detection and localization is difficult
These faults consist of the groups
• Transient faults
– Caused by environmental conditions
– They are also referred to as soft errors
Examples: cosmic rays, -particles, temperature, pressure, vibration
• Intermittent faults
– Caused by non-environmental conditions
Examples: Loose connections, deteriorating or aging components
2.1 Analysis of faults
The following topics explain this subject
• Analyze the frequency of occurrence of faults
• Analyze system failure rate over its life time
• Show failure rates of series and parallel systems
• Explain physical and electrical causes of faults
There are referred to as failure mechanisms
2.2 Frequency of occurrence of faults (1)Can be explained using reliability theory
The point in time t at which a fault occurs can be considered a random variable u
The probability of a failure before time t , F(t), is the unreliabilty of the system
The reliability of a system, R(t), is the probability of a correct functioning system at time t. , or alternatively:
It is assumed that:
F(0) = 0: Initially the system will be operable
F() = 1: Ultimately the system will fail
: System is either operable or failing
)()( tuPtF
)(1)( tFtR
1)()( tRtF
0 at time components of #
at time surviving components failing of #)(
ttR
2.3 Frequency of occurrence of faults (2)The derivative of F(t), f(t), is called the failure probability density
function
Hence:
and
The failure rate , z(t), is defined as the conditional probability that the system fails during the period (t, t+t); given that the system was operational at time t
Alternatively, z(t) can be expressed as follows:
t
dttftF0
)()(
t
dttftR )()(
)(
)(
)(
1*
)(
)(
1*
)()()( lim
0 tR
tf
tRdt
tdF
tRt
tFttFtz
t
t
ttz
at time components surviving of #
at time unit timeper components failing of #)(
dt
tdR
dt
tdFtF
)()()(
2.4 Frequency of occurrence of faults (3)R(t) can be expressed in terms of z(t) as follows
or,
The average lifetime of a system, , can be expressed as the mathematical expectation of t to be
For a non-maintained system, , is called the Mean Time To Failure, MTTF. Using partial integration, and assuming
tdttz
eRtR 0)(
)0()(
)0(
)(ln
)(
)()()(
0 0
)(
)0( R
tR
tR
tdRdt
dt
tfdttz
t t tR
R
0
)(*)( dttftt
0
)(0 )(0
)(*lim dttRT
TMTTF dttR
TtRt
T
0)(*lim
TRTT
2.5 Frequency of occurrence of faults (4)Given a system with the following reliability
The failure rate, z(t), of that system is computed below, and has a constant value
Assuming failures occur randomly with a constant rate , the MTTF can be expressed as
tetR )(
tttt
eeedt
edtR
dt
tdF
tR
tftz //
)1()(/
)(
)(
)()(
1
0
dteMTTF t
Example: R(t) & F(t) of Dutch male population(over years: 1976– 1980)
Note: # of people > 100 yrs old too small
2.6 Frequency of occurrence of faults (5)
R(t) & F(t) of Dutch male population
z(t)z(t)
Note: Increase of z(t) & f(t) between ages 18—20 due to driving accidents
f(t)
Note: Infant mortality rate
2.7 Failure rate over product lifetime (1)A well-know graphical representation of the failure rate, z(t), is the
bathtub curve. It consists of three regions:1. Infant mortality
Failures in this region are termed infant mortalities. They are attributed to poor quality due to variations in the production process
2. Working life; Constant failure rate: z(t) = Failures are considered to occur randomly in time
3. Wear out; Increasing failure rateThis represents the end-of-life period of a system
It should be clear that a system should be shipped after it has passed the infant mortality period, in order to reduce the # of field returns.
z(t) Dutch males
2.8 Failure rate over product lifetime (2)Shipping a system after the infant mortality period can be done by:1. Aging the system for that period (this can be several months)2. Aging the system under stress
– This accelerates the aging process
An important stress condition is increased temperature: Burn-InThe accelerating effect of temperature follows Arrhenius’ equation
• T1 and T2 are absolute temperatures (in degrees Kelvin, K) T1 and T2 are the failure rates at T1 and T2, respectively• Ea is the activation energy; constant expressed in electron-volts, eV• k is Boltzmann’s constant k = 8.617*10-5 eV/K
The equation shows that the failure rate is exponentially dependent on the temperature
)/)/1/1(( 21
12* kTTE
TTae
2.9 Failure rate over product lifetime (3)Example of use of Arrhenius equation
Assume Burn-In takes place at 150 oC = 423 oK; i.e., T2 = 423
Note: Room temperature is 30 oC = 303 oK; i.e., T1 = 303
Given that the Ea for the targeted failure rate is: Ea = 0.6 eV
Then the acceleration factor is: 678
This means that the 150 oC temperature stress reduces the aging time by a factor of 678.
678/5
12
10*617.8/4230/1303/1(6.0 eTT
Failure mechanism Ea: Activation energy Corrosion of metallization 0.3 – 0.6 eV Electrolytic corrosion 0.8 – 1.0 eV Electromigration 0.4 – 0.8 eV Bonding (purple plague) 1.0 – 2.2 eV Ionic contamination 0.5 – 1.0 eV Alloying (contact migration) 1.7 – 1.8 eV
Note: Every failure mechanism has its typical Ea value
2.10 Failure rates of series and parallel systems
A series system is a system of which all components have to be operational in order for the system to be operational
Consider that the system consists of n components with reliability Ri(t), then the reliability of the system, R(t), is:
It can be shown that
A parallel system is a system which is operational as long as one of its n components is operational. The unreliability is:
The reliability is:
n
i is tRtR1
)()(
n
i is tztz1
)()(
n
i ip tFtF1
)()(
n
i ip tFtR1
)(1)(
2.11 Failure mechanisms
Failure mechanisms describe the physical and electrical causes for faults. They can be divided into 3 classes:
1. Electrical stress
Poor design leading to electrical overstress, or careless handling causing static damage
2. Intrinsic failure mechanisms
Inherent to the semiconductor material itself.
Examples: Crystal defects, dislocations and processing defects
3. Extrinsic failure mechanisms
Originate in the packaging and interconnection process
Examples: Poor bonding, corrosion, etc.
2.12 Failure mechanisms
Failure mechanism class
Electrical stress
Intrinsic failure mechanisms
Extrinsic failure mechanisms
Electrical overstressElectrostatic dischargeGate oxide breakdownIonic contaminationSurface charge spreadingCharge effects
•Slow rapping•Hot electrons•Secondary slow trapping
PipingDislocations
PackagingMetallization
•Corrosion•Electromigration•Contact migration•Microcracks
Bonding (purple plague)Die attachments failureParticle contaminationRadiation
•External•Intrinsic
3.1 Classification of tests
A test is a procedure which allows one to distinguish
between good and bad parts
Tests can be classified according to:
1. The technology they are designed for
2. The parameters they measure
3. The purpose for which the test results are used
4. The test application method
3.2 Technology aspectsThe type of test depends heavily on the technology of the
circuit to be tested:1. Analog tests
The domain of input and output signal values is analog; i.e., they can take on any value within a given range (Ex.: a range of 0 – 5 V)Analog tests aim at determining the values of analog parameters such as voltage and current levels, frequency response, bandwidth, etc. The generation of the input stimuli and the measurement of the responses is inherently imprecise. Therefore, a range of values is used to determine the operational correctness
2. Digital testsThe input and output signals are digital (0 or 1); hence, precise. The test are called logical or digital tests.
3. Mixed signal testsThe domain of either the input or the output values is analog, while the other is digital. Typically used for testing digital-to-analog and analog-to-digital converters
3.3 Measured parameter aspectsThe nature of the measured parameter can be:1. Logical: Logical tests aim at detecting faults causing a change in the
logical behavior of the system ( a 0 is expected, while a 1 is measured)2. Electrical: Electrical tests measure the values of electrical parameters
(voltage and current levels) as well as their behavior over time; they can be divided into Parametric and Dynamic tests
Parametric testsAre concerned with the external behavior of the circuitEx.: Voltage & current levels & delays on the input & output pins– DC parametric tests are concerned with the with time-independent
properties of the input and output valuesIDDQ tests are a special class of DC parametric tests; they are concerned with the leakage currents during the quiescent state of the circuit
– AC parametric tests are concerned with the with time-dependent properties of the input and output values
Dynamic tests aim at faults which are time-dependent and internal to the chip
3.4 Purpose of test resultsThe most obvious use of the test results is to distinguish between good
and bad parts. This can be done with a test which detects faults. In case of repair, a test capable of locating faults is required.
Testing can be done during normal use of the system; referred to as concurrent testing; for example, parity checking is a simple for of concurrent testing. Alternatively, non-concurrent tests cannot be performed during normal use of the system, because they do not preserve the application data. However, they usually have a higher fault detection capability.
Design-for-Testability (DFT) includes extra circuitry on the to-be-tested chip; it allows non-concurrent tests to be performed faster and/or with a higher fault coverage.
Built-in-Self Test (BIST) includes extra circuitry on the to-be-tested chip, to the extent that the complete test function can be performed on chip, without external tester support.
3.5 Test application methodsTests can also be classified according to the way the test stimuli
are applied and the test responses are evaluated• External test: Automatic Test Equipment ‘ATE’ is used to apply the test
stimuli and evaluate the test responsesAt the board level the stimuli can be applied :
– Via the regular board connectorsAllows for a simple interface with the ATE and for at-speed
testing. However, the nt all circuits are easy to reach. Manual test program design is required, called functional tests
– Via special fixture (set of connectors)That way each components pins becomes accessible. Structural tests, which can be generated automatically, can now be used.
• Internal test (BIST)The ATE function is completely integrated on the to-be-tested chip. This requires extra silicon area, however, no ATE is required and the chip can be tested at speed.
4.1 Fault coverage requirements (1)Given a chip with potential defects, the question can be raised on how
extensive the tests have to be?This question can be answered in terms of the chips defect level and the
yield of the fabrication process.• Defect Level ‘DL’ is the fraction of bad parts that passes all tests
– Values for DL are usually expressed in Parts Per Million ‘PPM’• Process Yield ‘Y’ is the fraction of the manufactured parts that is
fault free. Exact value hard to establish. Therefore, Y approximated as follows:
• Fault Coverage ‘FC’ is a measure of the quality of a test. It is defined as:
In practice it is impossible to have a complete test (FC=1), because of:1. Imperfect fault modeling: An actual fault may not correspond with a
modeled fault2. Data dependency of faults (e.g., the carry function in an ALU)3. Testability limitations (e.g., ATE pin and/or speed limitations)
parts of # total
parts defective-not of#Y
faults of # total
faults detected of# actualFC
4.2 Fault coverage requirements (2)Because tests may not be complete, a defective chip may pass the tests.
Assume that a chip has exactly n Stuck-At Faults ‘SAFs’ – A SA0 fault causes a 0 value on a line; a SA1 fault causes a 1 value
Let m be the number of detected faults (m n)
Assume that the probability of a fault is independent of the occurrence of another fault (i.e., there is no fault clustering) and that all faults are equally likely with probability p
Assume that: A is the event that a part is free of defects, and B that a part has been tested for m defects while none were found. Then:
• The Fault Coverage of a test is defined as:
• The Process Yield is defined as:
•
• DL can now be expressed as:
nmFC /)()1( APpY n
mpBP )1()( npAPBAP )1()()(
mn ppBPBAPBAP )1/()1()(/)()( )1()/1()1( FCnmn Yp
)1(1)(1 FCYBAPDL
DL is expressed as (see figure):For large values of Y (i.e., a manufacturing process with a high yield), it
approaches a straight line
Example: Assume a manufacturing process with Y = 0.5 and a TC = 0.8, then:
This means that 12.95% of the shipped parts are defective!
If a DL=200 PPM (i.e., DL = 0.0002)is required, given Y = 0.5, then:
This is a FC of 99.971%
4.3 Fault coverage requirements (3))1(1)(1 FCYBAPDL
1295.05.01 )8.01( DL
99971.0)log/)1(log(1 YDLFC
5.1 Test economicsRepair cost during the product phases
A move from one product phase to the next causes the volume of parts and the test & repair cost to increase by a factor of 10This is the rule-of-ten
Economics and liability of testing. Good tests• reduce test & repair cost (see above rule-of-ten)• can reduce development time & time-to market• can reduce field maintenance costs• reduce personal injury and law suits
There is an optimum in test development cost and its contribution to profit: Too many tests require a long test development time and test cost
Optimum
5.2 Total profitThe life time of a product has several economic phases• The development phase
– Product design takes place– No income; only expenses– Area under zero-line is development cost
• The market growth phase– Market acceptance increases with time
• The market decline phase– Product becomes less attractive– Market share decreases– Price may have to be reduced
The total profit over the life time of a product is the area above the zero-line (revenue) – area below the zero-line (development cost)
In case of a delay ‘D’ in product development, the development cost is higher, while the revenue is reduced, because the obsolescence point will not change
5.3 Product development delay cost
)*(*)2(*2
1M
W
DWDWRDP
RDPERLR
2
22
2
)3(***
2
*32*
W
DWDERM
W
DWDWMWLR
MWMWER **2*2
1
Assuming M is the maximum market growth, which is reached after time W, the revenue lost due to a delay D (hatched area) can be computed as follows:
• The Expected Revenue ‘ER’ is:
• The Revenue of the Delayed Product ‘RDP’ is:
• The Lost Revenue ‘LR’ is:
5.4 Life-cycle costThe cost of a product over its life time, consists of:1. The design cost
This typically is on the order of 5% of the product cost
2. The manufacturing cost
This is the cost associated with the production and sales of the product
3. The maintenance cost
The cost associated with repair, calibration, etc.
This may be the largest cost factor
Note: Product life is 30 years; e.g., for a telephone exchange