Scalability - IBM
Transcript of Scalability - IBM
Everything you always wanted to know aboutEverything you always wanted to know about
SCALABILITYSCALABILITYand were afraid to askand were afraid to ask
Ronny RonenRonny RonenSenior Principal EngineerSenior Principal EngineerDirector of Architecture Director of Architecture ResearchResearchIntel Labs Intel Labs -- HaifaHaifa
Intel CorporationIntel Corporation
Compiler & Architecture Compiler & Architecture Seminar 2004Seminar 2004IBM HaifaIBM HaifaDecember 19, 2004December 19, 2004
Contributor: Haggai Yedidya
Intel Development Center (IDC)Haifa 9/2002
Contributor: Haggai Yedidya
*Third party marks and brands are the property of their respecti*Third party marks and brands are the property of their respective ownerve owner22
Tuning ExpectationsTuning Expectations……
Some educationSome educationHigh levelHigh levelNo product discussionNo product discussionLittle dataLittle dataLots of food for thoughtLots of food for thoughtHopefully some funHopefully some fun
“Forgive me for writing a long presentationbut I did not have time to write a short one”
http://www.classy.dk/log/archive/001074.html
“Forgive me for writing a long presentationbut I did not have time to write a short one”
http://www.classy.dk/log/archive/001074.html
*Third party marks and brands are the property of their respecti*Third party marks and brands are the property of their respective ownerve owner33
AgendaAgenda
What is ScalabilityWhat is ScalabilityElements of ScalabilityElements of ScalabilityComposition of ElementsComposition of ElementsImplicationsImplications
*Third party marks and brands are the property of their respecti*Third party marks and brands are the property of their respective ownerve owner44
The EnvironmentThe Environment
We will soon be able to put billion We will soon be able to put billion transistors on dietransistors on die……
ButBut……We are area limitedWe are area limitedWe are power limitedWe are power limitedWe are thermally limitedWe are thermally limitedWe are complexity limitedWe are complexity limitedWe are We are …… limitedlimited
*Third party marks and brands are the property of their respecti*Third party marks and brands are the property of their respective ownerve owner55
What is Scalability ?What is Scalability ?
Captured using Babylon-Pro V4.0
*Third party marks and brands are the property of their respecti*Third party marks and brands are the property of their respective ownerve owner66
What is Scalability ?What is Scalability ?
*Third party marks and brands are the property of their respecti*Third party marks and brands are the property of their respective ownerve owner77
How to Measure Scalability ?How to Measure Scalability ?
What does it take to gain additional What does it take to gain additional performance?performance?
Investment?Investment?
Return?Return?
*Third party marks and brands are the property of their respecti*Third party marks and brands are the property of their respective ownerve owner88
Investment and ReturnInvestment and ReturnWhat does it take to gain additional performance?What does it take to gain additional performance?
InvestmentInvestment–– Area (Transistors, Cost)Area (Transistors, Cost)–– PowerPower–– Effort (Complexity, Risk) Effort (Complexity, Risk)
ReturnReturn–– ““PerformancePerformance””
ScalabilityScalabilityRelative Return (Relative Return (∆∆R/R) over Relative Investment (R/R) over Relative Investment (∆∆I/I)I/I)
IIRRS //
∆∆=
Will work for foodWill work for food
Simple?Simple?
*Third party marks and brands are the property of their respecti*Third party marks and brands are the property of their respective ownerve owner99
Not that simpleNot that simple……AreaArea
–– Random LogicRandom Logic–– ArraysArrays
PowerPower–– Dynamic vs. Static (Active vs. Leakage)Dynamic vs. Static (Active vs. Leakage)–– Peak vs. AveragePeak vs. Average
Performance Performance –– Single thread vs. MultiSingle thread vs. Multi--threadthread–– Specific (e.g., MM) vs. General purpose Specific (e.g., MM) vs. General purpose –– CPU intensive vs. Memory intensiveCPU intensive vs. Memory intensive
A lot of dimensions!A lot of dimensions!
*Third party marks and brands are the property of their respecti*Third party marks and brands are the property of their respective ownerve owner1010
Attributes of ScalabilityAttributes of ScalabilityThe Return on Investment FunctionThe Return on Investment Function
The Slope of (The Slope of (f(If(I))))near (Inear (I00,R,R00): ):
=1: Linear Scalability=1: Linear Scalability>1: Super>1: Super--linearlinear
–– e.g., 2X or Xe.g., 2X or X22 gaingain<1: Sub<1: Sub--linearlinear
–– e.g., e.g., ½½X or XX or X½½
0: Saturated0: Saturated
The Return Range:The Return Range:00--5% or 1X5% or 1X--3X?3X?
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
0.50 1.00 1.50 2.00 2.50 3.00
Relative Investment
Rel
ativ
e R
etur
n
X^3
2X
X½X
X^½
Sat.
(I0,R0)(I0,R0)
*Third party marks and brands are the property of their respecti*Third party marks and brands are the property of their respective ownerve owner1111
Performance, Power, Area (1)Performance, Power, Area (1)Physics: Voltage/Frequency scalingPhysics: Voltage/Frequency scaling
Freq = k * Voltage (F=kV)Freq = k * Voltage (F=kV)–– Within a limited voltage range VWithin a limited voltage range VMINMIN/V/VMAXMAX
Power = Activity * Capacitance * VoltagePower = Activity * Capacitance * Voltage22 * Freq (P=* Freq (P=ααCVCV22f)f)Power is proportional to fPower is proportional to f3 3 (>V(>VMINMIN))Power is proportional to fPower is proportional to f (<=V(<=VMINMIN))
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
Frequency (GHz)
Vol
tage
0
5
10
15
20
25
30
35
40
45
Pow
er
VoltagePower
Cubic ZoneCubic Zone
Linear ZoneLinear Zone
*Third party marks and brands are the property of their respecti*Third party marks and brands are the property of their respective ownerve owner1212
Performance, Power, Area (2)Performance, Power, Area (2)Empirics Empirics –– the power growththe power growth
Power = Power = f(Perff(Perf, Area), Area)Capacitance = k * Area (C=kA)Capacitance = k * Area (C=kA)Power Growth GuesstimatePower Growth Guesstimate–– Area growth Area growth power growthpower growth–– Performance growth Performance growth power growthpower growth
Empirically, Empirically, for the same voltagefor the same voltage,,power growth is ~ average of both (Combined effect of power growth is ~ average of both (Combined effect of ααCC))
Power = f(area, perf)
Area
Perf
Pow er
0
0.5
1
1.5
2
2.5
3
3.5
0 0.5 1 1.5 2 2.5 3 3.5
Area
Are
a, p
ower
, per
f
Area & Performance are free variables in this chart
*Third party marks and brands are the property of their respecti*Third party marks and brands are the property of their respective ownerve owner1313
The enablers:The enablers:6 elements of Scalability 6 elements of Scalability
1.1. Process TechnologyProcess Technology2.2. ArchitectureArchitecture3.3. MicroMicro--architecturearchitecture4.4. Multithreading (SMT)Multithreading (SMT)5.5. Multi Processors (CMP)Multi Processors (CMP)6.6. Dynamic Voltage Scaling (DVS)Dynamic Voltage Scaling (DVS)
There are others…There are others…
*Third party marks and brands are the property of their respecti*Third party marks and brands are the property of their respective ownerve owner1414
DisclaimerDisclaimer
Numbers mainly represents trends Numbers mainly represents trends ––Not concrete dataNot concrete data
For a given design assume:For a given design assume:Performance scales linearly with Performance scales linearly with frequency frequency
Leakage power is ignoredLeakage power is ignored
*Third party marks and brands are the property of their respecti*Third party marks and brands are the property of their respective ownerve owner1515
Process Technology Process Technology –– In TheoryIn TheoryPracticePracticeEvery new process technology (~2.5 years cycle)Every new process technology (~2.5 years cycle)changes the physical attributes of the transistor:changes the physical attributes of the transistor:
Size reduced to 0.5XSize reduced to 0.5XDelay time reduced to 0.7XDelay time reduced to 0.7XSwitch energy reduced by 3X; Relative to Switch energy reduced by 3X; Relative to CVCV22: :
–– Capacitance per transistor reduced to 0.7XCapacitance per transistor reduced to 0.7X–– Operation voltage reduced to 0.7XOperation voltage reduced to 0.7X
Two extreme scenarios:Two extreme scenarios:
XX 0.75X0.75X
XX 1.75X1.75X
XX 0.9X0.9X
Ideal Ideal ““ShrinkShrink””–– Same Same µµarcharch–– 1X1X #Xistors#Xistors–– 0.5X0.5X size (0.7X per dimension) size (0.7X per dimension) –– 1.5X1.5X frequencyfrequency–– 0.5X0.5X powerpower–– 1.5X1.5X performanceperformance–– 1X1X power densitypower density
1X1X0.5X0.5X1.35X1.35X0.75X0.75X1.25X1.25X11
1.5X1.5X11 Performance scales behind frequency and # of transistorsPerformance scales behind frequency and # of transistors
Ideal New Ideal New designdesign–– Same die sizeSame die size–– 2X2X #Xistors#Xistors–– 1X1X sizesize–– 1.5X1.5X frequencyfrequency–– 1X1X powerpower–– 3X3X performanceperformance–– 1X 1X Power DensityPower Density
2X2X1X1X
1.35X1.35X1.5X1.5X2.2X2.2X11
1.5X1.5X
Ideally - 3X performance for “nothing” every 2.5 years!Ideally - 3X performance for “nothing” every 2.5 years!Still Good – but less than we hoped forStill Good – but less than we hoped for
*Third party marks and brands are the property of their respecti*Third party marks and brands are the property of their respective ownerve owner1616
In PicturesIn Pictures……Process TechnologyProcess Technology
1.51.5µµ 1.01.0µµ 0.80.8µµ 0.60.6µµ 0.350.35µµ 0.250.25µµ 0.180.18µµ 0.130.13µµ 0.090.09µµProcessorProcessor
Intel386Intel386™™ DX DX Processor TrendsTrends
–– Smaller every new Smaller every new processprocess
–– Larger every new Larger every new µµarcharch–– Converge to about Converge to about
100mm100mm22
–– Extra area used now forExtra area used now foronon--die cachedie cache
Processor
Intel486Intel486™™ DX DX ProcessorProcessor
PentiumPentium®®ProcessorProcessor
PentiumPentium®® Pro Pro ProcessorProcessor
PentiumPentium®® II II ProcessorProcessor
PentiumPentium®® III III ProcessorProcessor
PentiumPentium®® 4 4 ProcessorProcessor
PentiumPentium®® M M ProcessorProcessor
*Third party marks and brands are the property of their respecti*Third party marks and brands are the property of their respective ownerve owner1717
ArchitectureArchitectureDescriptionDescription
User visible additions:User visible additions:–– New instructions:New instructions:
e.g., MMX, SSEe.g., MMX, SSE–– Paradigm change Paradigm change ––
EPIC*, VectorsEPIC*, Vectors
ScalingScaling–– Varying InvestmentVarying Investment–– Varying ReturnVarying Return
++Generally exhibits super Generally exhibits super linear scalabilitylinear scalability
––ComplexityComplexityLong SW enablingLong SW enablingRecompileRecompileArchitectural baggageArchitectural baggage
** EPIC EPIC –– Explicitly Parallel Instruction Computing Explicitly Parallel Instruction Computing
*Third party marks and brands are the property of their respecti*Third party marks and brands are the property of their respective ownerve owner1818
MicroMicro--ArchitectureArchitectureDescriptionDescription
User transparent User transparent structures and algorithms structures and algorithms to gain performance and to gain performance and reduce powerreduce power–– Pipelining, caches, Pipelining, caches,
branch prediction, outbranch prediction, out--ofof--orderorder……
ScalingScalingTraditionally: investmentTraditionally: investment½½
–– 22--3X investment 3X investment 1.4X1.4X--1.7X Return1.7X Return
Range is smaller w/ timeRange is smaller w/ timeSome highly scalable Some highly scalable mechanisms:mechanisms:–– Branch prediction, Branch prediction,
instructioninstruction fusionfusion
++User Transparent User Transparent Impact fast, impact manyImpact fast, impact manyEnables different Enables different segments for same segments for same architecturearchitecture
––ComplexityComplexityLow scalabilityLow scalability
*Third party marks and brands are the property of their respecti*Third party marks and brands are the property of their respective ownerve owner1919
SMT SMT –– Simultaneous MultiSimultaneous Multi--ThreadingThreadingShared Resources MultiShared Resources Multi--ProcessingProcessing
DescriptionDescription2 or more threads to run on a 2 or more threads to run on a single processor coresingle processor coreSharing/ Splitting/ Sharing/ Splitting/ Duplicating resourcesDuplicating resourcese.g. e.g. Compaq* AlphaCompaq* Alpha--21464,21464,IntelIntel®® Hyper Threading Hyper Threading technologytechnology
ScalingScalingSuper linear in areaSuper linear in arealinear+ in powerlinear+ in powere.g., 10% area and 15% e.g., 10% area and 15% power power 20% performance20% performanceVery application dependentVery application dependent–– Severe scaling glassSevere scaling glass--jawsjaws
++Efficient performance/areaEfficient performance/areaEfficient performance/powerEfficient performance/powerCan trade ST (Single Thread) and Can trade ST (Single Thread) and MT performanceMT performance
––ComplexityComplexityPower DensityPower Density
*Third party marks and brands are the property of their respecti*Third party marks and brands are the property of their respective ownerve owner2020
CMP CMP –– Chip MultiChip Multi--ProcessingProcessingDescriptionDescription
Placing 2 or more cores Placing 2 or more cores on a single dieon a single diePotentially sharing higher Potentially sharing higher level cacheslevel cachesExamples Examples –– IBM* Power 4 IBM* Power 4 ProcessorProcessor
ScalingScalingClose to linear in areaClose to linear in arealinear in powerlinear in powerCan range from 1 to several Can range from 1 to several cores cores Somewhat application Somewhat application dependentdependent
++Lower complexityLower complexityHigher throughputHigher throughputAddresses wireAddresses wire--delaydelay
––Bounded single Thread Bounded single Thread performanceperformanceMedium area efficiencyMedium area efficiency
*Third party marks and brands are the property of their respecti*Third party marks and brands are the property of their respective ownerve owner2121
DVS DVS –– Dynamic Voltage ScalingDynamic Voltage Scaling
DescriptionDescriptionDynamicDynamic change of Voltage & change of Voltage & frequency allowing trading power & frequency allowing trading power & performanceperformancefreq = kfreq = k11*V*V(limited between V(limited between VMINMIN & V& VMAXMAX))Power = Power = ααCVCV22f f = k= k22VV33
V<VV<VMINMIN can reduce freq. only can reduce freq. only Examples Examples –– Intel Intel SpeedStepSpeedStep®®Technology, Technology, TransmetaTransmeta* * LongRunLongRun, , ……
ScalingScalingSub linear:Sub linear:cubic root cubic root -- in powerin powerRange: 2X power reduction Range: 2X power reduction
20% performance loss20% performance loss(Assuming V(Assuming VMINMIN/V/VMAXMAX = 0.8)= 0.8)
++Dynamic Dynamic –– can vary voltage/ can vary voltage/ frequency at run timefrequency at run timeBenefits all apps types, ST, Benefits all apps types, ST, MTMT
––Mainly a downward Mainly a downward scalabilityscalability
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
0 5 10 15 20 25 30
Power
Freq
uenc
y (G
Hz)
, Vol
tage
(V)
Freq (GHz)
Voltage (V)
CubicCubicZoneZone
LinearLinearZoneZone
VVMINMIN
VVMAXMAX
*Third party marks and brands are the property of their respecti*Third party marks and brands are the property of their respective ownerve owner2222
Summing it allSumming it all
"Return on Investment"
Process
uarch
SMT
CMP
DVS
0.00
0.50
1.00
1.50
2.00
2.50
3.00
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
Power range
Perfr
oman
ce ra
nge
ProcessuarchSMTCMPDVS
*Third party marks and brands are the property of their respecti*Third party marks and brands are the property of their respective ownerve owner2323
CompositionCompositionCan we apply several scaling elements? Can we apply several scaling elements? –– SureSure
The obviousThe obvious–– Process Technology is a givenProcess Technology is a given
To be assessed a prioriTo be assessed a priori–– Cost of architecture changesCost of architecture changes
How to choose among other options?How to choose among other options?–– Under given area budgetUnder given area budget–– Under given power budgetUnder given power budget–– We need to show some ST performance gainWe need to show some ST performance gain–– We want high MT performance We want high MT performance –– How much to dedicate for shared resources (Cache?)How much to dedicate for shared resources (Cache?)
*Third party marks and brands are the property of their respecti*Third party marks and brands are the property of their respective ownerve owner2424
Composition Composition –– Example (1)Example (1)
We assume process technology is a givenWe assume process technology is a givenWe ignore architecture changesWe ignore architecture changesWe look at several CMP/SMT optionsWe look at several CMP/SMT options–– 1/2/4 cores1/2/4 cores–– 1/2 way SMT per core1/2 way SMT per core
Rest of Rest of areaarea used for used for µµarch changes improvements/dearch changes improvements/de--featuresfeaturesWe examine power/performance at full range of DVSWe examine power/performance at full range of DVSWe assume:We assume:–– Budget:Budget:
100 mm100 mm22 area, Varea, VMINMIN/V/VMAX MAX 1.0V/1.25V = 80% 1.0V/1.25V = 80% 2X DVS power range2X DVS power range–– Basic building block Basic building block –– 25 mm25 mm22 area, 20W powerarea, 20W power
*Third party marks and brands are the property of their respecti*Third party marks and brands are the property of their respective ownerve owner2525
Composition Composition -- Example (2)Example (2)Unlimited PowerUnlimited Power
ConfigurationsConfigurations–– 1/2/4 cores1/2/4 cores–– 1/2 Way SMT1/2 Way SMT–– Fixed area: 100 mmFixed area: 100 mm22
–– Unlimited powerUnlimited power
With CMPWith CMP–– MT Perf goes upMT Perf goes up–– ST Perf goes downST Perf goes down–– Power efficiency is Power efficiency is
upup
With SMTWith SMT–– MT Perf goes upMT Perf goes up–– ST perf stay sameST perf stay same–– Power efficiency is Power efficiency is
upup
2.0
1.8
1.4
1.3
1.0
0.9
2.0
2.6
2.7
3.4
3.7
4.7
0.961.00
1.651.71
2.91
3.844.00
4.54
3.413.35
3.00
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
1P 1P/SMT 2P 2P/SMT 4P 4P/SMTConfiguration
Rela
tive
Pow
er &
Per
form
ance
ST PerfMT PerfST PowerMT Power
*Third party marks and brands are the property of their respecti*Third party marks and brands are the property of their respective ownerve owner2626
Composition Composition -- Example (3)Example (3)With DVSWith DVS
SingleSingle--ThreadedThreaded
1P
1P/SMT
2P2P/SMT
4P4P/SMT
0.0
0.5
1.0
1.5
2.0
2.5
0 20 40 60 80Power
Rel
ativ
e ST
Per
form
ance
1P1P/SMT2P2P/SMT4P4P/SMT
MultiMulti--ThreadedThreaded
1P
1P/SMT
2P
2P/SMT
4P
4P/SMT
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
0 20 40 60 80Power
Rel
ativ
e M
T pe
rfor
man
ce
1P1P/SMT2P2P/SMT4P4P/SMT
Note:Note:–– 4P/SMT is the MT leader in all ranges4P/SMT is the MT leader in all ranges–– 1P is an ST leader in the 25W1P is an ST leader in the 25W--60W range60W range
*Third party marks and brands are the property of their respecti*Third party marks and brands are the property of their respective ownerve owner2727
Is that all?Is that all?
No! We should consider:No! We should consider:Multiple clock domain (DVS+CMP)Multiple clock domain (DVS+CMP)““focusedfocused”” MIPSMIPS–– Application oriented ISA and Application oriented ISA and µµarcharch–– Fixed functionsFixed functions
Asymmetric CoresAsymmetric Cores–– e.g., big for higher performance, small for lower powere.g., big for higher performance, small for lower power**
Target SegmentTarget Segment–– Servers prefer throughput over ST performanceServers prefer throughput over ST performance–– Mobile has lower power budget, care about average power Mobile has lower power budget, care about average power
and may have different usage model. and may have different usage model.
** SingleSingle--ISA Heterogeneous MultiISA Heterogeneous Multi--Core Architectures for Multithreaded Workload PerformanceCore Architectures for Multithreaded Workload PerformanceKumar, Kumar, FarkasFarkas, , JouppiJouppi, , RanganathanRanganathan, , TullsenTullsen, ISCA, ISCA’’20042004
*Third party marks and brands are the property of their respecti*Third party marks and brands are the property of their respective ownerve owner2828
What next? Asymmetric Cores?What next? Asymmetric Cores?
Mix big and small cores Mix big and small cores
–– Big Big core(score(s) ) –– for single for single thread performancethread performance
–– Small cores Small cores –– for for efficient multithreaded efficient multithreaded performanceperformance
Best of all worlds?
3.5X3.5X½½XX13 cores13 cores
4X4X¼¼XX16 cores16 cores
2X2X½½XX4 cores4 cores
1X1X1X1X1 core1 core
MT MT perfperfST ST perfperf
Best of all worlds?
*Third party marks and brands are the property of their respecti*Third party marks and brands are the property of their respective ownerve owner2929
SummarySummaryMany dimensions to scalingMany dimensions to scalingTradeoffs are becoming more complex everydayTradeoffs are becoming more complex everyday–– Features, performance, power, areaFeatures, performance, power, area–– MutliMutli--threaded vs. Single Threadedthreaded vs. Single Threaded–– General purpose vs. General purpose vs. ““focused focused MIPsMIPs””
ExpectExpect–– Less complex microLess complex micro--architecturearchitecture–– More CMP and SMTMore CMP and SMT
They are just more efficient, less complexThey are just more efficient, less complex–– But how small can we split? But how small can we split?
The BIG challenge:The BIG challenge:–– Finding the next BIG thingFinding the next BIG thing
*Third party marks and brands are the property of their respecti*Third party marks and brands are the property of their respective ownerve owner3030
The EndThe EndNo animals were injured during theNo animals were injured during the
preparation of this presentation preparation of this presentation