The Myth of the Optimal F04 · The Myth of the Optimal F04 James A. Kahle IBM Fellow Austin, TX...

34
The Myth of the Optimal F04 The Myth of the Optimal F04 James A. James A. Kahle Kahle IBM Fellow IBM Fellow Austin, TX Austin, TX TAU, Feb. 2 2004 TAU, Feb. 2 2004

Transcript of The Myth of the Optimal F04 · The Myth of the Optimal F04 James A. Kahle IBM Fellow Austin, TX...

Page 1: The Myth of the Optimal F04 · The Myth of the Optimal F04 James A. Kahle IBM Fellow Austin, TX TAU, Feb. 2 2004

The Myth of the Optimal F04The Myth of the Optimal F04

James A. James A. KahleKahleIBM FellowIBM FellowAustin, TXAustin, TX

TAU, Feb. 2 2004TAU, Feb. 2 2004

Page 2: The Myth of the Optimal F04 · The Myth of the Optimal F04 James A. Kahle IBM Fellow Austin, TX TAU, Feb. 2 2004

Technology driven definition:

Perf ~= (1/#i)(#i *Pa)(1/Pa*Pp)(Pp/FO4)(FO4/ps)

Path length

Impedance match between software

domain &hardware domain.

Technology

Processorefficiency

Application parallelism

#i = number of inst.Pa = Application parallelismPp = Processor parallelismF04 = Fan Out 4 delayps = picoseconds

Application Domain

PerformancePerformance

Page 3: The Myth of the Optimal F04 · The Myth of the Optimal F04 James A. Kahle IBM Fellow Austin, TX TAU, Feb. 2 2004

Path Length Path Length (1/#i) (#i *Pa) (1/Pa*Pp) (Pp/FO4) (FO4/ps)

nn Path length ImprovementsPath length Improvementsnn Compiler MaturityCompiler Maturity

nn Instruction CompactionInstruction Compactionnn VLIWVLIWnn SIMDSIMD

Page 4: The Myth of the Optimal F04 · The Myth of the Optimal F04 James A. Kahle IBM Fellow Austin, TX TAU, Feb. 2 2004

Application Parallelism(1/#i) (#i *Pa) (1/Pa*Pp) (Pp/FO4) (FO4/ps)

nn Major ClassificationMajor Classificationnn Instruction ParallelismInstruction Parallelismnn Multiprocessor Parallelism (single image)Multiprocessor Parallelism (single image)nn Cluster / Multi ComputerCluster / Multi Computer

nn MultiMulti--core Eracore Erann Processor tradeoffs need to made at higher levelProcessor tradeoffs need to made at higher levelnn Single Thread vs. MultiSingle Thread vs. Multi--thread thread nn Symmetric vs. Asymmetric MultiSymmetric vs. Asymmetric Multi--corecore

Page 5: The Myth of the Optimal F04 · The Myth of the Optimal F04 James A. Kahle IBM Fellow Austin, TX TAU, Feb. 2 2004

Processor Efficiency(1/#i) (#i *Pa) (1/Pa*Pp) (Pp/FO4) (FO4/ps)

nn Total logic levels for pipelineTotal logic levels for pipelinenn Branch Redirect (Ld / Branch Redirect (Ld / CmpCmp / Br)/ Br)nn Load LatencyLoad Latencynn ComputationComputationnn Floating point latencyFloating point latency

nn Processor ParallelismProcessor Parallelismnn Super Scalar WidthSuper Scalar Widthnn Super Pipeline LengthSuper Pipeline Length

Page 6: The Myth of the Optimal F04 · The Myth of the Optimal F04 James A. Kahle IBM Fellow Austin, TX TAU, Feb. 2 2004

Processor Efficiency(1/#i)(#i *Pa)(1/Pa*Pp) (Pp/FO4) (FO4/ps)

nn Limits of current studiesLimits of current studiesnn Ignore PowerIgnore Powernn Ignore Design ImprovementsIgnore Design Improvements

nn Power ManagementPower Managementnn New Latch structureNew Latch structure

nn Fix critical parametersFix critical parameters

nn Optimal FO4 per pipe stageOptimal FO4 per pipe stagenn It Depends!!It Depends!!

Page 7: The Myth of the Optimal F04 · The Myth of the Optimal F04 James A. Kahle IBM Fellow Austin, TX TAU, Feb. 2 2004

Technology(1/#i) (#i *Pa) (1/Pa*Pp) (Pp/FO4) (FO4/ps)

nn CMOS limitationsCMOS limitationsnn Power limitationsPower limitationsnn Wire limitationsWire limitations

Page 8: The Myth of the Optimal F04 · The Myth of the Optimal F04 James A. Kahle IBM Fellow Austin, TX TAU, Feb. 2 2004

CMOS Device PerformanceCMOS Device Performance

Conventional Bulk CMOS

SOI (silicon-on-insulator)

High mobility

Double-Gate

New Device Structures are Needed to Maintain PerformanceNew Device Structures are Needed to Maintain PerformanceR

elat

ive

Dev

ice

Per

form

ance

Year

Page 9: The Myth of the Optimal F04 · The Myth of the Optimal F04 James A. Kahle IBM Fellow Austin, TX TAU, Feb. 2 2004

Leakage Current TrendsLeakage Current TrendsI O

FF@

25°C

(nA

/µm

)

Year

Page 10: The Myth of the Optimal F04 · The Myth of the Optimal F04 James A. Kahle IBM Fellow Austin, TX TAU, Feb. 2 2004

Technology Power LimitationsTechnology Power Limitations

Year of Announcement

1950 1960 1970 1980 1990 2000 2010

Mod

ule

Hea

t Flu

x (W

/cm

2 )

0

2

4

6

8

10

12

14

Bipolar

CMOS

VacuumIBM 360

IBM 370 IBM 3033

IBM ES9000

Fujitsu VP2000

IBM 3090S

NTT

Fujitsu M-780

IBM 3090

CDC Cyber 205IBM 4381

IBM 3081Fujitsu M380

IBM RY5

IBM GP

IBM RY6

Apache

Pulsar

IBM RY7

IBM RY4

(Ghoshal and Schmidt)

Page 11: The Myth of the Optimal F04 · The Myth of the Optimal F04 James A. Kahle IBM Fellow Austin, TX TAU, Feb. 2 2004

§Limiters for consumer systemsƒBox cross sectionƒAir in vs out temperature ƒNoise/Max AirflowƒBox temperatureƒTj-maxƒ...

Power Limits Performancebut perhaps not quite the way you think

Intel BTX

Apple Dual G5

Only first few pose hard limit

Many platforms do not toleratesubstantially increased power.(Form-factor/noise/heat constrained.)

Page 12: The Myth of the Optimal F04 · The Myth of the Optimal F04 James A. Kahle IBM Fellow Austin, TX TAU, Feb. 2 2004

AC vs. DC Power TrendAC vs. DC Power Trend((End of Conventional CMOS Scaling)End of Conventional CMOS Scaling)

0.0001

0.001

0.01

0.1

1

10

100

1000

1 0.1 0.01

DC powerAC power

W/cm2

Lpoly

Not realistic for most applications

Based onIntel and IBM data

Page 13: The Myth of the Optimal F04 · The Myth of the Optimal F04 James A. Kahle IBM Fellow Austin, TX TAU, Feb. 2 2004

BuyBuy--All vs. HighAll vs. High--Frequency TimingFrequency Timing

3 4 5 6 7 8

GHz

0

0.2

0.4

0.6

0.8

1

232ps +-12% 4GHz 97% 172ps +-18% 6GHz 30%

Target Performance Process Distribution

FREQUENCY

Page 14: The Myth of the Optimal F04 · The Myth of the Optimal F04 James A. Kahle IBM Fellow Austin, TX TAU, Feb. 2 2004

Scaling Between Design PointsScaling Between Design Points

140

160

180

200

220

240

260

1.5x freq targetCircuit30% RC50% RC1x freq target

scaling the late mode timing point

140

160

180

200

220

240

260

slack adjustments

- slack on demand- timing at two corners- timing at one corner, overachieve

Freq

Freq

DELAY

DELAY

Process Variation Process Variation Dieter Wendel

Page 15: The Myth of the Optimal F04 · The Myth of the Optimal F04 James A. Kahle IBM Fellow Austin, TX TAU, Feb. 2 2004

WiresWires

nn Fundamentally, wires do not scaleFundamentally, wires do not scalenn Have had a steady stream of Have had a steady stream of ““trickstricks”” to to

compensatecompensatenn Aspect ratio changes (taller wires)Aspect ratio changes (taller wires)nn Resistance improvements (copper)Resistance improvements (copper)nn Capacitance improvements (lowCapacitance improvements (low--K dielectric)K dielectric)

nn Still things to improve, but wires wonStill things to improve, but wires won’’t t keep up with transistorskeep up with transistors

Page 16: The Myth of the Optimal F04 · The Myth of the Optimal F04 James A. Kahle IBM Fellow Austin, TX TAU, Feb. 2 2004

On-Chip Wires

ITRS Data/M. Horowitz

250 180 130 90 65 45 35

100

10

1

0.1

Gate delay

Scaled wire delay

Global w. buffers

Global w/o buffer

Process Technology Node (nm)

Rel.Delay

250 180 130 90 65 45 32

Page 17: The Myth of the Optimal F04 · The Myth of the Optimal F04 James A. Kahle IBM Fellow Austin, TX TAU, Feb. 2 2004

Timing Metrics for the FutureTiming Metrics for the Future

nn FO4 does not tell it allFO4 does not tell it allnn Does not really scale with TechnologyDoes not really scale with Technologynn Does not measure WiresDoes not measure Wiresnn Gives a broad view of designGives a broad view of design

nn Future MetricsFuture Metricsnn Measure of wire designMeasure of wire designnn Circuit / Wire BalanceCircuit / Wire Balancenn How well design will scaleHow well design will scale

Page 18: The Myth of the Optimal F04 · The Myth of the Optimal F04 James A. Kahle IBM Fellow Austin, TX TAU, Feb. 2 2004

Pushing Towards Lower FO4Pushing Towards Lower FO4

nn Complexity BoundariesComplexity Boundariesnn Exponentially increasing number of latchesExponentially increasing number of latchesnn Logic complexity increaseLogic complexity increasenn Area IncreaseArea Increase

nn How do we know the edge?How do we know the edge?nn Design teams to largeDesign teams to largenn Tools can no longer handleTools can no longer handlenn Performance from other techniquesPerformance from other techniquesnn But maybe not the whole design is at same FO4But maybe not the whole design is at same FO4

Page 19: The Myth of the Optimal F04 · The Myth of the Optimal F04 James A. Kahle IBM Fellow Austin, TX TAU, Feb. 2 2004

> 16 FO4 Design Methodology

Dataflow(e.g. adder)

LATCH

Synthesized Control

CycleBoundary

Re-buffering solutionNot pre-planned

Page 20: The Myth of the Optimal F04 · The Myth of the Optimal F04 James A. Kahle IBM Fellow Austin, TX TAU, Feb. 2 2004

<16 FO4(Template Based Design)

Dataflow(e.g. adder)

Synthesized or Array-basedControl

CycleBoundary

L

LATCH

LATCH

Dynamic Mux-latch Pre-plannedRe-buffering solution

Semi-automatedRe-buffering solution

(Posluszny et al, DAC 2000)

Page 21: The Myth of the Optimal F04 · The Myth of the Optimal F04 James A. Kahle IBM Fellow Austin, TX TAU, Feb. 2 2004

<12 FO4(Input & Output Latch Bound Macros)

Dataflow(e.g. adder)

Synthesized or Array-basedControl

CycleBoundary

L

LATCH

LATCH

Pre-plannedWire level &Re-buffering solution

LATCH

LATCH

Page 22: The Myth of the Optimal F04 · The Myth of the Optimal F04 James A. Kahle IBM Fellow Austin, TX TAU, Feb. 2 2004

§Latches are All That is Left§ Logic Fully Integrated in the Latch (LSDL,

IBM)§ Every Edge The Same (GASP, Sun)§SRAM Design in Doubt§Simultaneous setup and hold time constraints are a major challenge

Sub 9-FO4 Designs

Page 23: The Myth of the Optimal F04 · The Myth of the Optimal F04 James A. Kahle IBM Fellow Austin, TX TAU, Feb. 2 2004

<9 FO4(Latches are All That’s Left)

Highly Structured Control(precharacterized gates AND wires)

HalfCycleBoundary

L1

L1

LATCH

L2

L2

L1

L2

L2

L1

L1

Wires pre-designed

Wires are a MAJORchallenge

Page 24: The Myth of the Optimal F04 · The Myth of the Optimal F04 James A. Kahle IBM Fellow Austin, TX TAU, Feb. 2 2004

§Wire and buffer planning increasingly important as design frequency increases§Structure increases with lower FO4§Latch types proliferate as frequency increases§Ultimately latches and wires are all that’s left

High-Frequency Design Summary

Page 25: The Myth of the Optimal F04 · The Myth of the Optimal F04 James A. Kahle IBM Fellow Austin, TX TAU, Feb. 2 2004

Needed ToolsNeeded Toolsnn Much more refined timingMuch more refined timingnn Variability (ACLV not uniform)Variability (ACLV not uniform)

nn Statistical approachStatistical approach

nn Clock distribution skewClock distribution skewnn Local temperatureLocal temperaturenn Local supply droopLocal supply droopnn More sophisticated wire models (e.g. liner)More sophisticated wire models (e.g. liner)nn Continue to includeContinue to include

nn NoiseNoise

nn Optimization vs. analysis for new problemsOptimization vs. analysis for new problemsnn E.g. leakageE.g. leakage--driven synthesisdriven synthesis

Page 26: The Myth of the Optimal F04 · The Myth of the Optimal F04 James A. Kahle IBM Fellow Austin, TX TAU, Feb. 2 2004

Timed CircuitsTimed Circuitsnn Timed circuits use aggressive timing assumptions to increase speTimed circuits use aggressive timing assumptions to increase speed.ed.nn Delayed Reset Domino Logic Delayed Reset Domino Logic nn Self Resetting Dynamic Logic Self Resetting Dynamic Logic nn Pulse ClockingPulse Clockingnn Many others Many others ……

nn Timing assumptions can replace transistors Timing assumptions can replace transistors ÔÔ Increased speed, decreased areaIncreased speed, decreased areann Very simple example:Very simple example: If A and B always fall before Clk

we can make this transformation:

nn Timing Challenges:Timing Challenges:nn Verify that the assumption holds in the designVerify that the assumption holds in the designnn Verify that the assumption is sufficient to guarantee correctnesVerify that the assumption is sufficient to guarantee correctness.s.

Page 27: The Myth of the Optimal F04 · The Myth of the Optimal F04 James A. Kahle IBM Fellow Austin, TX TAU, Feb. 2 2004

nfet logic

Logic Devices Latching Devices

Merging Logic and LatchesMerging Logic and Latches

Logic Gates

Latch

nn Merging logic and latches allows each device Merging logic and latches allows each device to perform multiple functions.to perform multiple functions.nn Logic and data capture overlapped.Logic and data capture overlapped.nn Latching and gain overlapped in the latch.Latching and gain overlapped in the latch.

nn Demonstrated to provide up to 2x increase Demonstrated to provide up to 2x increase in density for arithmetic functions in density for arithmetic functions -- ISSCC ISSCC 20032003

nn Higher density Higher density ÜÜ shorter wires shorter wires ÜÜ smaller smaller driversdriversnn Higher frequency from shorter wires. Higher frequency from shorter wires. nn Lower active power.Lower active power.nn Lower leakage power (90% of leakage in drivers)Lower leakage power (90% of leakage in drivers)

nn Timing Challenge: Everything is a latch.Timing Challenge: Everything is a latch.nn Need to characterize them all Need to characterize them all nn This is labor intensive and needs to be more This is labor intensive and needs to be more

automated.automated.

Page 28: The Myth of the Optimal F04 · The Myth of the Optimal F04 James A. Kahle IBM Fellow Austin, TX TAU, Feb. 2 2004

Statistical TimingStatistical Timing

nn Process variation continues to increase.Process variation continues to increase.nn Designing for the worst case does not always make sense.Designing for the worst case does not always make sense.nn Continually adding margin consumes die area and power.Continually adding margin consumes die area and power.nn Recent work has shown worst case timing to be up to 20% overly pRecent work has shown worst case timing to be up to 20% overly pessimistic essimistic -- ICCAD 2003ICCAD 2003nn This will only get worse as process variation increases.This will only get worse as process variation increases.

nn Designers and managers need to be able to quantify the impact ofDesigners and managers need to be able to quantify the impact of circuit decisions on circuit decisions on frequency and yield.frequency and yield.nn Choice X: 70% yield at 3GHz, 75% yield at 2GHzChoice X: 70% yield at 3GHz, 75% yield at 2GHznn Choice Y: 50% yield at 3GHz, 99% yield at 2GHzChoice Y: 50% yield at 3GHz, 99% yield at 2GHznn The choice depends on the target market and manufacturing cost.The choice depends on the target market and manufacturing cost.nn Currently the choice is made without this level of information Currently the choice is made without this level of information ÜÜ more conservative designmore conservative design

nn Timing challenge:Timing challenge:nn Continue to improve the capacity and speed of statistical timingContinue to improve the capacity and speed of statistical timing engines.engines.nn Bring statistical timing analysis into mainstream timing flows.Bring statistical timing analysis into mainstream timing flows.

Page 29: The Myth of the Optimal F04 · The Myth of the Optimal F04 James A. Kahle IBM Fellow Austin, TX TAU, Feb. 2 2004

Hope for the FutureHope for the Future

nn New TechnologiesNew Technologiesnn Faster TransistorsFaster Transistorsnn New Memory StructuresNew Memory Structures

nn New ApplicationsNew Applicationsnn New drivers for technologyNew drivers for technology

Page 30: The Myth of the Optimal F04 · The Myth of the Optimal F04 James A. Kahle IBM Fellow Austin, TX TAU, Feb. 2 2004

SiGeSiGetransistortransistor

StrainedStrainedsiliconsilicon

Carbon Carbon nanotubenanotube

FinFEFinFETT

New Device StructuresNew Device Structures

Page 31: The Myth of the Optimal F04 · The Myth of the Optimal F04 James A. Kahle IBM Fellow Austin, TX TAU, Feb. 2 2004

MRAM Operating Principles and MRAM Operating Principles and PrototypePrototype

MRAM MRAM ArchitectureArchitecture

Page 32: The Myth of the Optimal F04 · The Myth of the Optimal F04 James A. Kahle IBM Fellow Austin, TX TAU, Feb. 2 2004

““MillipedeMillipede”” StorageStorage

Page 33: The Myth of the Optimal F04 · The Myth of the Optimal F04 James A. Kahle IBM Fellow Austin, TX TAU, Feb. 2 2004

Impedance Match(1/#i) (#i *Pa) (1/Pa*Pp) (Pp/FO4) (FO4/ps)

nn New design optimizations to match applicationsNew design optimizations to match applicationsnn Maturing Server / DesktopMaturing Server / Desktop

nn Super Scalar design limitsSuper Scalar design limitsnn Super Pipeline design limitsSuper Pipeline design limitsnn Maximize performance efficiencyMaximize performance efficiency

nn Power limited design spacePower limited design spacenn Maximize power efficiencyMaximize power efficiency

nn New Application SpacesNew Application Spacesnn Games driving new architecture organizationsGames driving new architecture organizationsnn New levels of Architecture efficiency New levels of Architecture efficiency nn Will drive the lowest FO4 design spacesWill drive the lowest FO4 design spaces

Page 34: The Myth of the Optimal F04 · The Myth of the Optimal F04 James A. Kahle IBM Fellow Austin, TX TAU, Feb. 2 2004

SummarySummary

nn Timing constraints are growingTiming constraints are growingnn ArchitecturalArchitecturalnn LogicalLogicalnn PhysicalPhysicalnn PowerPower

nn Multitude of Optimization pointsMultitude of Optimization pointsnn ee--businessbusinessnn High Performance ComputingHigh Performance Computingnn Low PowerLow Powernn Gaming / MediaGaming / Media

nn Technology limitsTechnology limitsnn Maturity in CMOSMaturity in CMOSnn New optimizations for new structuresNew optimizations for new structures