Reliability as a challenge & opportunity to … Salishan Maiz.pdfReliability as a challenge &...
Transcript of Reliability as a challenge & opportunity to … Salishan Maiz.pdfReliability as a challenge &...
Reliability as a challenge & opportunity to technology
scaling
Jose MaizJose MaizFellow, Technology and Manufacturing GroupDirector of Logic Technology Quality & Reliability
2007 Salishan HPC ConferenceApril 26, 2007
Salishan HPC 2007, J. Maiz 2
Key MessagesKey MessagesTechnology scaling continues according to MooreTechnology scaling continues according to Moore’’s Laws Law
–– 2X increase in functionality every 2 years2X increase in functionality every 2 years–– In the form of cores, integrated functionality or bothIn the form of cores, integrated functionality or both–– 65nm in 2005, 45nm 2007, 32nm 200965nm in 2005, 45nm 2007, 32nm 2009
Technology & Reliability Challenges are many, but so are Technology & Reliability Challenges are many, but so are the opportunitiesthe opportunities
–– Many new device types and materialsMany new device types and materials–– A challenge as well as an opportunityA challenge as well as an opportunity
High RAS will require global fault management strategies High RAS will require global fault management strategies along with robust circuit designalong with robust circuit design
–– Better understanding needed on RAS requirementsBetter understanding needed on RAS requirements–– Research and Cost effectiveness of proposed options Research and Cost effectiveness of proposed options
Salishan HPC 2007, J. Maiz 3
Technology scaling on a 2 Year Cycle180 nm 130 nm 90 nm 65 nm 45 nm
1999 2001 2003 2005 2007
200mm100nm LG
CoSi2
6 AlSiOF
300mm70nm LG
CoSi2
6 CuSiOF
SiGeSiGe
300mm50nm LG
NiSiStrain Si
7 CuLow-k
300mm35nm LG
NiSi2nd Strain
8 CuLow-k
Details Coming!
Transistor
Interconnect
Courtesy: Intel
Salishan HPC 2007, J. Maiz 4
Moore’s Law Delivers Value to the End User
With permission from IC Knowledge LLC
Twice the functionality at the same cost every 2 years
Salishan HPC 2007, J. Maiz 5
Performance/ Watt
Performance/ watt improvement for both Performance/ watt improvement for both integer and Floating pointinteger and Floating point
Trend in Performance/ Watt relative to i386
1
10
100
1000
1985
1988
1992
1994
1996
1998
2000
2002
2004
2006
2006
2006
Year
MIP
S/ W
att (
Rel
ativ
e) Int MIPS/WFP MIPS/W
i386
i486 P5 P6 PI
I
P4
Core® 21/2Core
PIII
Salishan HPC 2007, J. Maiz 6
Lead in 45nm Technology and ProductsLead in 45nm Technology and Products
153 Mbit SRAM0.346 μm2 cell
119 mm2 chip size>1B transistors
Functional in Jan 2006
Intel® PenrynCoreTM2 family processor
410/820 M transistors (2C/4C)World’s first working 45 nm CPU
Salishan HPC 2007, J. Maiz 7
45 nm Yield Improvement Trend45 nm Yield Improvement Trend
Excellent Yield learning and good reliability tooOn track for production ramp in 2H ‘07
2 years
2000 2001 2002 2003 2004 2005 2006 2007 2008
Defect density (log scale)0.13 um 90 nm 65 nm 45 nm
Salishan HPC 2007, J. Maiz 8
SRAM Cell Size TrendSRAM Cell Size Trend
The trend continues
M. BohrIntel
45nm 6-transistor SRAM
cell area of 0.346 mm2
Salishan HPC 2007, J. Maiz 9
Gate Leakage scaling
0.01
0.1
1
10
100
1000
1 1.2 1.4 1.6 1.8 2 2.2
Electrical Tox (nm)
Gat
e le
akag
e A
/cm
2)
SiO2HK
HighHigh--k + Metal Gate Transistorsk + Metal Gate Transistors
Low Resistance Layer
Integrated 45 nm CMOS process
High performance
Low leakage
Meets reliability requirements
Manufacturablein high volume
High-k Dielectric Hafnium based
Silicon Substrate
Work Function MetalDifferent for NMOS and PMOS
Cooler chips>100x reduction
Gate Leakage
Faster transistors60% greaterCapacitance
BenefitHigh-k vs. SiO2
Salishan HPC 2007, J. Maiz 10
Very High Innovation RateVery High Innovation Rate
Platform Platform IntegrationIntegration
Chip Chip ArchitectureArchitecture
Transistor Transistor ArchitectureArchitecture
MaterialsMaterials
Power & form factor optimizationPower & form factor optimization
Efficient Performance/Power with Efficient Performance/Power with CoreTM2MultiCoreMultiCoreMonolythicMonolythic integration of Graphics, integration of Graphics, MemMem. Controller etc.. Controller etc.
Novel transistor architectures for Novel transistor architectures for HighKHighK--MGMGTriGateTriGate XtorsXtors and IIIand III--V integration in the future V integration in the future
HighKHighK--MG MG XtorsXtors for performance & Low Powerfor performance & Low PowerLowKLowK ILDsILDs for interconnectfor interconnectNovel materials for strain and electrical Novel materials for strain and electrical PformancePformance
Salishan HPC 2007, J. Maiz 11
Many Reliability ChallengesMany Reliability ChallengesIncreased Electric FieldsThe shrinking Vmax-Vmin windowThe development of robust High K/ Metal Gate transistorsDimensional scaling of interconnects and their linersThermo-Mechanical limitations of very LowK ILDsSoft ErrorsDefectivity with scaled technologyTransient and intermittent errorsFault tolerance
Innovation needed more than ever
Salishan HPC 2007, J. Maiz 12
Gate Dielectric Field Trend
Gate Dielectric Electric Field
0
2
4
6
8
10
1000 700 500 350 250 180 130 90 65
Technology node (nm)
Efie
ld(M
V/c
m)
SiON
SiON w HK
HK
45
Substantial increases in Efield enabled by HK/MG
HK+MG Transistor
Silicon substrate
S D
Low R Layer
High-KMetal Gate
Salishan HPC 2007, J. Maiz 13
VVmaxmax--VVminmin window challengewindow challenge
Vmax :Scailing for density, performance & powerVmin :Transistor variability & increase bit count
Vmax-Vmin trend
180 130 90 65 45 32 25
Technology node (nm)
Vol
tage
(V)
Vmax
SRAM Cell type 1
SRAM Cell type 2
Shrinking Operating Window
Salishan HPC 2007, J. Maiz 14
05
10152025303540
0 5 10 15 201/sqrt(Gox Area) in um-1
Sig(
time0
VT)
in m
VP1264 PMOSP1264 NMOSP1266 PMOS
P1264 LVP1266 LV
P1268 LV
Dec-19-2006, S. Pae
R94 UF
0 5 10 15 201/sqrt(Gox Area) in um-1
Vtst
anda
rd d
evia
tion
65nm45nm NMOS45nm PMOS
Technology Scaling
• Due to random dopant fluctuation and other process parameters• Develop design techniques that can handle variability
Transistor variability impacts Vmin
Salishan HPC 2007, J. Maiz 15
VVminmin impacted by bit countimpacted by bit count
Impacted by variability and Impacted by variability and defectivitydefectivity
Improved process and cell upsizing helpsImproved process and cell upsizing helpsNeed Robust manufacturing process now and Fault
Tolerance techniques in the future
Defective bits vs Voltage
1
10
100
1000
Vcc
# D
efec
tive
bits
6 M
Byt
eca
che
# bad bits
100 mV
Increased Bit count
Upsizing & improved
technology
Salishan HPC 2007, J. Maiz 16
Transistor degrades during use
Slow but continuous processAddressed by variation tolerant design and Frequency Guardbands at test
10
100
Log (time)
PMO
S N
BTI
ΔVT
[mV]
n ~ 0.20
Salishan HPC 2007, J. Maiz 17
… and so does Product operating frequency
Test Test GuardbandsGuardbands used to eliminate customer used to eliminate customer impactimpact
% F
max
shift
18 48 168 336 500
Fmax degradation in Burn-in
Burn-in Time (hr)
End of life
Salishan HPC 2007, J. Maiz 18
Transistor degradation
Process improvements are a must to counter the Process improvements are a must to counter the effect of increased E fieldseffect of increased E fields
PMOS NBTI degradation
0
20
40
60
80
E-field
Vt-s
hift
[ -m
V]
Process improvements
Efield increases with scaling
Salishan HPC 2007, J. Maiz 19
Transistor architecture & materials are changing
Many new materialsMany new materials–– HighKHighK/Metal Gates for gate leakage control and performance /Metal Gates for gate leakage control and performance
scaling scaling Integration and reliability challengesIntegration and reliability challenges–– Low K Low K ILDsILDs ThermomechanicalThermomechanical risk may slow their risk may slow their
introductionintroduction–– Lead free BumpsLead free Bumps
Clever changes in planar transistorsClever changes in planar transistors–– Strain, Strain, epiaxialepiaxial Source/ Drain layers Source/ Drain layers
Novel Transistor architectures like triNovel Transistor architectures like tri--gategate
Exotic options explored: from Carbon Exotic options explored: from Carbon NanotubesNanotubes and and semiconductor semiconductor nanowiresnanowires to IIIto III--V compounds V compounds
Salishan HPC 2007, J. Maiz 20
Tri-Gate Transistors
Gate
Source Drain
• Transistor gate wraps around 3 sides of Si channel (Tri-Gate)
• Transistor channel is “fully depleted”, unlike normal bulk CMOS
• Fully depleted operation reduces leakage current by up to 10x
Gate OxideSource DrainGate
Channel
Salishan HPC 2007, J. Maiz 21
Increasing Electron Mobility
Increased electron mobility leads to higher performance and less energy consumption
505033338811InSbInSbInAsInAsGaAsGaAsSiSi
Compound SemiconductorsCompound Semiconductorsnn--MobilityMobility
The challenge is integrating them with Silicon and improving Hole mobility
Salishan HPC 2007, J. Maiz 22
Scaling of the interconnectScaling of the interconnectCu Resistivity vs Cu Width (mid)
0.015
0.020
0.025
0.030
0.035
0.040
0.05 0.15 0.25 0.35 0.45 0.55 0.65
Copper CD (um)
ohm
-um
P1260 X1S
P1262 X3 Z145E99X
P1262 X3 Z217E330
193nm X3
248nm RELACS
X3 POR (Doug)
X3 ALD (Doug)
Series8
P1264
P1266Cu Resistivity vs Line Width
0.05 0.15 0.25 0.35 0.45 0.55 0.65Copper CD (um)
ohm
-μm
Source:Intel
Effective Effective resistivityresistivity increase due to:increase due to:–– Cross section reduction due to barriersCross section reduction due to barriers–– Increased scattering from grain boundaries and surfacesIncreased scattering from grain boundaries and surfaces
65nm
Tough but manageable challenges
Salishan HPC 2007, J. Maiz 23
Single Event UpsetsTransient errors that corrupt data but do not Transient errors that corrupt data but do not produce permanent damage (on limited doses)produce permanent damage (on limited doses)–– Charge burst that overwhelms a storage nodeCharge burst that overwhelms a storage node
α α particles in materials and atmospheric particles in materials and atmospheric neutrons in terrestrial systemsneutrons in terrestrial systemsCosmic rays and heavy nuclei in spaceCosmic rays and heavy nuclei in space–– Orders of magnitude higher fluxes than at sea levelOrders of magnitude higher fluxes than at sea level
This is just one class of transient errorsThis is just one class of transient errors–– Others are noise related fails in the interconnect Others are noise related fails in the interconnect
fabric etc.fabric etc.
Salishan HPC 2007, J. Maiz 24
Single Event Upsets: Cache cell
SEU errors due to neutrons from cosmic rays and SEU errors due to neutrons from cosmic rays and αα particles from particles from residual impurities residual impurities Reduction in charge collection dominates over reduction in critiReduction in charge collection dominates over reduction in critical cal chargecharge
@900m
Cache Cell SEU Trend
0180 130 90 65 45
Technology (nm)
SE
U/ B
it (a
.u)
α FIT
n FIT Terrestrial
Total FIT10
5
Salishan HPC 2007, J. Maiz 25
Single event upsets: MultiSingle event upsets: Multi--bit failsbit fails
Multi-bit errors are increasing as a proportion of failsExpected consequence of increased charge sharing
1E-5
1E-3
1E-1
0.1 1 10Cell to cell distance (mm)
Mul
ti-B
it up
set P
roba
bilit
y
130 nm90 nm65 nm
Cache cells
Salishan HPC 2007, J. Maiz 26
Single Event Upsets: Logic latchesSingle Event Upsets: Logic latches
Similar trend starting for latchesSimilar trend starting for latches45 nm results still preliminary45 nm results still preliminary
@900m
latch SEU Trend
0
1
2
180 130 90 65 45 32Technology (nm)
SE
U /L
atch
a.u
.) α FIT
n FIT terrestrial
Total FIT
Salishan HPC 2007, J. Maiz 27
Single Event Upset: Chip impact
Saturation for cache arraysSaturation for cache arraysGetting there for logic. Perhaps in 32nmGetting there for logic. Perhaps in 32nm
SEU Trend
1
10
180 130 90 65 45 32Technology (nm)
SE
U N
orm
to 1
30nm
cache arrays
Add features to reduce error rates
logic2X increase in RAM cell & latch count
per generation
Salishan HPC 2007, J. Maiz 28
Circuit contributions to SEU in a typical microprocessor
Static combinational
logic
Flip-flops
Residual Unprotected
memory
Salishan HPC 2007, J. Maiz 29
Many protection options proposed
Adding Parity/ ECC to logic arrays and Register FilesAdding Parity/ ECC to logic arrays and Register FilesReplication of functional units or coresReplication of functional units or coresLockLock--step for cores or complete chipsstep for cores or complete chipsResidue CheckingResidue CheckingRedundant multithreadingRedundant multithreadingFingerprintingFingerprintingModified Scan latchesModified Scan latchesHardening of worst contributing latchesHardening of worst contributing latches
What is the goal that we are trying to meet?What is the value proposition for HPC?
Salishan HPC 2007, J. Maiz 30
Recap of fail types & trends
Improved estimation methodologyHardening of critical elementsArchitectural fault tolerance (1)
Increasing to flat
Transient (Ionizing Rad)
Continued Process improvementsGuard-bandsArchitectural fault tolerance (1)
IncreasingTransient (noise)
Continued process improvementArchitectural fault tolerance(2)
??Intermittent
Continued Process improvementsGuard-bandsArchitectural fault tolerance
IncreasingParametric degradation
Continued Process improvementsArchitectural fault tolerance(2)
FlatHard FailsSolution spaceTrendFail type
(1) Such as EDAC: (local circuit, component or system level)(2) Requires self-diagnostics and redundancy
Salishan HPC 2007, J. Maiz 31
Fault Tolerance trends/ needsConservatism in technology to eliminate errors to Many Sigma will cut into performance
Fault Tolerant schemes will allow few errors to occur by providing the means to detect and correct
– Minimal to no impact to the customer
The continuation of Moore’s Law makes transistor availability plentiful and enables a much broader thinking in Fault Tolerance
– Local Circuit and functional circuit block level– Multi/ Mary core availability– Complement hardware /chip strategies with Platform system
strategies
Salishan HPC 2007, J. Maiz 32
Help needed from HPC experts Help needed from HPC experts
What are the RAS Requirements for various What are the RAS Requirements for various categories and uses of HPC?categories and uses of HPC?–– Are there agreed targets that can guide us?Are there agreed targets that can guide us?–– Can a $ value be assigned to them?Can a $ value be assigned to them?
How can System architecture and Software How can System architecture and Software help and complement the effort at the help and complement the effort at the component level?component level?
Salishan HPC 2007, J. Maiz 33
Key MessagesKey MessagesTechnology scaling continues according to MooreTechnology scaling continues according to Moore’’s Laws Law
–– 2X increase in functionality every 2 years2X increase in functionality every 2 years–– In the form of cores, integrated functionality or bothIn the form of cores, integrated functionality or both–– 65nm in 2005, 45nm 2007, 32nm 200965nm in 2005, 45nm 2007, 32nm 2009
Technology & Reliability Challenges are many, but so are Technology & Reliability Challenges are many, but so are the opportunitiesthe opportunities
–– Many new device types and materialsMany new device types and materials–– A challenge as well as an opportunityA challenge as well as an opportunity
High RAS will require global fault management strategies High RAS will require global fault management strategies along with robust circuit designalong with robust circuit design
–– Better understanding needed on RAS requirementsBetter understanding needed on RAS requirements–– Research and Cost effectiveness of proposed options Research and Cost effectiveness of proposed options
Salishan HPC 2007, J. Maiz 34