Amendments to the Accelerated Filer and Large Accelerated ...
Engineering the right accelerated life tests for reliability qualification: customer use conditions...
-
Upload
toby-simon -
Category
Documents
-
view
214 -
download
0
Transcript of Engineering the right accelerated life tests for reliability qualification: customer use conditions...
Engineering the right accelerated life tests for reliability qualification: customer use conditions vs. industry
standards based approaches
Presenter: Sudarshan Rangaraj ([email protected])Hardware Reliability Manager – Amazon Lab126, Sunnyvale CA
Based largely on papers authored at IRPS, IITC, ECTC and review of literature
Acknowledgements: current and former colleagues at Intel and Amazon Lab126
1
2/11
Motivation and Relevance
• Industry standards e.g. JEDEC, AEC, MIL provide qualification criteria, e.g. – HAST: 130C/85%RH/96 hours 0 Fails / 45 Tested– 150C 1000 hours Bake
• Blanket qualification criteria without knowledge of product use conditions (UC) can be undesirable:– Over-design, extra cost for reliability margin most customers will not use– Field failures: negative to user experience and company brand
• Goal of reliability engineering:– Start with the customer– Use field intelligence to develop UC models, compare them to standards– Strive to meet the higher bar, reliability can be a marketing advantage!
2
3/11
Advantages of standards based testing
• Allows suppliers and their customers a speak a common language
• Helps overcome differences in reliability certification methodology, helps clarify expectations
• Guarantees a consistent reliability bar
• Valuable in well established industries
3
4/11
Importance of understanding usage conditions• A robust reliability qualification process protects the customer i.e. ensures
sufficient reliability while optimizing cost for the manufacturer
• Three elements of robust reliability engineering:1. Quantified understanding of customer usage patterns and use conditions
2. Well designed accelerated life tests
3. Acceleration models (of sufficiently high confidence) that link the two
• Pitfalls of not making an accurate link between stress and use conditions– Over design leading to added cost and impact to bottom line– Under design high customer returns, poor experience erodes brand
4
5/11
Talk outline
• Overview of common failure mechanisms in IC components
• Analysis of field use condition data….review one example
• Contrast use condition knowledge based qualification to standards based qualification using 2 case studies1. Moisture and voltage bias induced failures in IC components2. Temperature cycling failures in IC components
5
6/11
IC component – package stack-upSilicon substrate
Devices: front-end
Metals/via: back-end with ultra low-k ILD
Metals/via: far-back-end with polymer ILD
Bumps: C4 with Cu – Pb-free solder
Images from proceedings of IITC 2013
Package: metals/via
6
7/11
Some common failure modes in IC components and associated extreme use conditions
Reliability failure mechanism Extreme use condition
1 Front end: transistor gate di-electric reliability
- High power states at high voltage, frequency, temperature and current
2 Backend: Di-electric breakdown
3 Backend & bumps: Electro-migration
4 Backend: stress voiding - Sustained operation at high temperatures
5 Moisture ingress: De-lamination, electro-chemical corrosion, metal migration, pop-corning etc.
- Low power modes like OFF/Stand-by- High humidity and temperature ambient
conditions e.g. 25C 80% RH6 Temperature cycling: Cracking
and de-lamination- Repeated cold temperature exposures
when part may be OFF- Power cycles when part is ON
• Dominant failure modes for an IC used in a server, cell-phone and a wearable device will be very different because usage is different!
7
8/11
Chip operating states
Effective RH vs. temperature at the part surface
OFF and STAND-BY modes are critical states for moisture absorption into chip/package: highest RH at part surface
OFF state: low T, high RH
STAND-BY: higher T, lower RH
ON state: high T, low RH
• OFF mode: chip and package at ambient T, ambient RH at part surface• STAND-BY mode: ambient T + self-heating (~10C) from few “always ON” IO pins• ON state: chip at high T, low RH at the part surface
68
9/11
Use conditions by product segment: risk from moisture
Market segment ON time as fraction of product lifetime
OFF/STAND-BY events, durations
Ambient environments
Servers, High Performance computing & high end Desktop
Very large Very few events of short duration
Controlled T, RH in data centers and server farms
Desktop enterprise Lower Sizeable Indoor T, RH
Mobile - laptop Lower Sizeable number of longer duration events
Some outdoor T, RH exposureWorse in hot humid GEOs
Ultra-mobile: Tablet, smartphone
Lower Sizeable number of longer duration events
Often outdoor T, RH Worse in hot humid GEOs
Wearables/IoT A new set of applications, still being understood?
Incr
ea
sin
g m
ois
ture
ris
k
9
10/11
Events leading to moisture exposure
• Packaging/Assembly operations……factory floor
• Customer warehouses during storage
• Customer factories during surface mount
• Usage by end customer especially in hot + humid locations
10
11/11
Failure modes due to moisture and temperature cycling
blister
Package blistering and cracking between copper traces after surface mount on to system motherboard, a.k.a. “pop-corning” [Literature]
Edge de-lamination after temp-cycle B (125 to -55C) on very early 22nm silicon process Proceedings of ECTC 2013
11
12/11
Moisture diffusion under a 25C 80% RH ambient exposure
Time at 25C 80% RH
Finite element modeling
7 days 50 days
C/C
SA
T
Time (days)
• Under sustained exposure, moisture confined to edge 1mm of chip/package
• Consistent with empirical failure observations
Through underfill
Through PKG
Chip
Package
7 days 50 days
12
13/11
Mining use conditions: data collection and analysis
• Customer profile data from ~2000 worldwide laptop users for one year
• OFF (shutdown), STAND-BY and HIBERNATE times recorded data used to generate distributions
User ID OFF time STAND-by time
1 {-, -, -,……..} {-, -, -,……..}
2 {-, -, -,……..} {-, -, -,……..}
3 {-, -, -,……..} {-, -, -,……..}
…
2123 {-, -, -,……..} {-, -, -,……..}
• Distributions combining all data from all users
• Distribution of Max{off times} and Max{Stand-by time} per users
Format of user data:
13
14/11
Moisture exposure in use condition: user data
0 10 20 30 40 50 60 70 80 90 1000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Time (hrs)
Cum
ulat
ive
prob
abili
ty
non-S0 duration distribution
Cu
mu
lati
ve p
rob
abil
ity
Time (hours)
All data from 2000 users
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Cu
mu
lativ
e p
roba
bili
ty
0 25 50 75 100 125 150 175 200 225
Time (days)
Cu
mu
lati
ve p
rob
abil
ity
Time (days)
Max {OFF time} i.e. 100th %tile per user
99th percentile 4 days99.5th percentile 7 days 95th percentile 50 days
Standby/Off times: Nominal = 7 days, Worst case = 50 daysConservative ambient condition: 25C 80% RH, 20% of cities in the world experience this for 5% of the year i.e. a 95th percentile condition from surveys
14
15/11
Phenomenological Acceleration Model for dominant moisture induced chip – package failure modes
Variable Range used in study Acceleration factor
Temperature 85 – 130C Ea = 0.71 eV (90% CL lower bound)
RH 65 – 85% n = 4 (best estimate)
Voltage (V) 1.2 – 3.3V m = 0.5 (best estimate)Vt = 1.4V
Peck’s law fits empirically observed HAST fails
15
• Temperature – strongest variable• Relative humidity and voltage – relatively weaker effects
16/11
Accelerated life testing: failure rate data for a “typical” failure mode
10000100010010
99
95
90
80
70605040
30
20
10
5
1
Time to Failure
Pe
rce
nt
7.11990 0.658992 25.184
5.50666 0.658992 14.9165.92739 0.658992 55.5094.36013 0.658992 10.373
Loc Scale AD*Table of Statistics
85 85110 85130 65130 85
temp RH
Probability Plot (Fitted Arrhenius, Fitted Ln) for start readout
Arbitrary Censoring - ML EstimatesLognormal
Relation plot (Temp vs MTTF)
1
10
100
1000
10000
100000
1000000
10000000
100000000
0 10 20 30 40 50 60 70 80 90 100 110 120 130 140
Temp (C)
MT
TF
(H
r)
UC: 25C
Ea=0.44
Ea=0.71
Ea=1.1
EA-1
EA-2
EA-AVG
• Thermal acceleration different in the 130 – 110C and 110 – 85C ranges• Epoxy glass transition ~120C, over accelerated moisture diffusion above 120C• Stressing recommended below glass transition of packaging polymers, T < TG is
what is relevant for use condition anyway
16
17/11
HAST stress durations: use conditions vs. JEDEC JESD22-A110 standard requirements
Stress condition
Stress time equivalent to 7 days at 25C 80% RH (hrs)
Stress time equivalent to 50 days at 25C 80% RH (hrs)
JEDEC JESD 22 A110 equivalent readout (hrs)
130C 85% RH <1 5.7 96
110C 85% RH 2.5 18 264
85C 85% RH 17 121 1000
• Conservative worst case (50 days @ 25C 80% RH): JEDEC requirements +8 times higher than use condition based requirements
• Intel uses a “test to fail” approach during process development. These gating readouts go beyond use condition based requirements
17
Nominal Worst case JEDEC Std.
18/11
Some thoughts about temperature cycling
JEDEC standard for temp-cycle
• Having to demonstrate reliability down to -55 or -65C may need trade-off between reliability and performance/yield• Di-electric constant (electrical performance) vs. fracture toughness• Epoxy flow characteristics vs. fracture toughness
Most common: TCB 125 to -55C, 700 cycles
18
19/11
Some examples of cold-side effects: material response
0
0.25
0.5
0.75
1
1.25
1.5
1.75
2
Cra
ck d
rivi
ng
ene
rgy
(no
rma
lize
d)
-60 -50 -40 -30 -20 -10 0 10 20 30
Cold side temperature (C)
Crack driving energy (F.E. modeling) rises sharply below -20C
Str
ain
to
fail
0.05
0.06
0.07
0.08
0.09
0.1
0.11
0.12
-55 -30 25
Temperature (C)
Measured strain-to-fail drops 2X from 25C to -55C for passivation polymer
Solder fracture toughness drops precipitously below -25C [Literature]
If T < -25C was not relevant for the use condition of the component, by using TCB for qual., we might be solving problems not relevant to customer usage
19
20/11
Risk of over or under-assessing field reliabilityNumber of cycles at various operating DT equivalent to TCB 700 cycles (JEDEC standard)
A simple temp-cycle model (Coffin-Manson):
{Nf1/Nf2} = {DT2/DT1}n
• For an always ON server in a controlled environment TCB 700 cycles may be over-kill• No cold exposures, -55C is not relevant• At DT of 50C, TCB 700 represents 10 – 50 cycles/day for 5 years
• For a part that may get used in an COMMS application with outdoor exposures in Alaska with 10 year life requirement TCB 700 under-assesses field reliability
Desktop & Servers Highly mobile devices
20
Example use condition requirement
[Tmax-Tmin]
21/11
Key messages
Important to pick stress conditions that are relevant to worst case usage to avoid artifacts not relevant to worst case use e.g. embrittlement
Standards offer a guideline or starting point. Qualification plans should be based on knowledge of use conditions
Limiting failure modes in the components that comprise a system will likely be very different for various applications….standards don’t directly address that
25