Scalability - IBM

30
Everything you always wanted to know about Everything you always wanted to know about SCALABILITY SCALABILITY and were afraid to ask and were afraid to ask Ronny Ronen Ronny Ronen Senior Principal Engineer Senior Principal Engineer Director of Architecture Director of Architecture Research Research Intel Labs Intel Labs - - Haifa Haifa Intel Corporation Intel Corporation Compiler & Architecture Compiler & Architecture Seminar 2004 Seminar 2004 IBM Haifa IBM Haifa December 19, 2004 December 19, 2004 Contributor: Haggai Yedidya Intel Development Center (IDC) Haifa 9/2002 Contributor: Haggai Yedidya

Transcript of Scalability - IBM

Page 1: Scalability - IBM

Everything you always wanted to know aboutEverything you always wanted to know about

SCALABILITYSCALABILITYand were afraid to askand were afraid to ask

Ronny RonenRonny RonenSenior Principal EngineerSenior Principal EngineerDirector of Architecture Director of Architecture ResearchResearchIntel Labs Intel Labs -- HaifaHaifa

Intel CorporationIntel Corporation

Compiler & Architecture Compiler & Architecture Seminar 2004Seminar 2004IBM HaifaIBM HaifaDecember 19, 2004December 19, 2004

Contributor: Haggai Yedidya

Intel Development Center (IDC)Haifa 9/2002

Contributor: Haggai Yedidya

Page 2: Scalability - IBM

*Third party marks and brands are the property of their respecti*Third party marks and brands are the property of their respective ownerve owner22

Tuning ExpectationsTuning Expectations……

Some educationSome educationHigh levelHigh levelNo product discussionNo product discussionLittle dataLittle dataLots of food for thoughtLots of food for thoughtHopefully some funHopefully some fun

“Forgive me for writing a long presentationbut I did not have time to write a short one”

http://www.classy.dk/log/archive/001074.html

“Forgive me for writing a long presentationbut I did not have time to write a short one”

http://www.classy.dk/log/archive/001074.html

Page 3: Scalability - IBM

*Third party marks and brands are the property of their respecti*Third party marks and brands are the property of their respective ownerve owner33

AgendaAgenda

What is ScalabilityWhat is ScalabilityElements of ScalabilityElements of ScalabilityComposition of ElementsComposition of ElementsImplicationsImplications

Page 4: Scalability - IBM

*Third party marks and brands are the property of their respecti*Third party marks and brands are the property of their respective ownerve owner44

The EnvironmentThe Environment

We will soon be able to put billion We will soon be able to put billion transistors on dietransistors on die……

ButBut……We are area limitedWe are area limitedWe are power limitedWe are power limitedWe are thermally limitedWe are thermally limitedWe are complexity limitedWe are complexity limitedWe are We are …… limitedlimited

Page 5: Scalability - IBM

*Third party marks and brands are the property of their respecti*Third party marks and brands are the property of their respective ownerve owner55

What is Scalability ?What is Scalability ?

Captured using Babylon-Pro V4.0

Page 6: Scalability - IBM

*Third party marks and brands are the property of their respecti*Third party marks and brands are the property of their respective ownerve owner66

What is Scalability ?What is Scalability ?

Page 7: Scalability - IBM

*Third party marks and brands are the property of their respecti*Third party marks and brands are the property of their respective ownerve owner77

How to Measure Scalability ?How to Measure Scalability ?

What does it take to gain additional What does it take to gain additional performance?performance?

Investment?Investment?

Return?Return?

Page 8: Scalability - IBM

*Third party marks and brands are the property of their respecti*Third party marks and brands are the property of their respective ownerve owner88

Investment and ReturnInvestment and ReturnWhat does it take to gain additional performance?What does it take to gain additional performance?

InvestmentInvestment–– Area (Transistors, Cost)Area (Transistors, Cost)–– PowerPower–– Effort (Complexity, Risk) Effort (Complexity, Risk)

ReturnReturn–– ““PerformancePerformance””

ScalabilityScalabilityRelative Return (Relative Return (∆∆R/R) over Relative Investment (R/R) over Relative Investment (∆∆I/I)I/I)

IIRRS //

∆∆=

Will work for foodWill work for food

Simple?Simple?

Page 9: Scalability - IBM

*Third party marks and brands are the property of their respecti*Third party marks and brands are the property of their respective ownerve owner99

Not that simpleNot that simple……AreaArea

–– Random LogicRandom Logic–– ArraysArrays

PowerPower–– Dynamic vs. Static (Active vs. Leakage)Dynamic vs. Static (Active vs. Leakage)–– Peak vs. AveragePeak vs. Average

Performance Performance –– Single thread vs. MultiSingle thread vs. Multi--threadthread–– Specific (e.g., MM) vs. General purpose Specific (e.g., MM) vs. General purpose –– CPU intensive vs. Memory intensiveCPU intensive vs. Memory intensive

A lot of dimensions!A lot of dimensions!

Page 10: Scalability - IBM

*Third party marks and brands are the property of their respecti*Third party marks and brands are the property of their respective ownerve owner1010

Attributes of ScalabilityAttributes of ScalabilityThe Return on Investment FunctionThe Return on Investment Function

The Slope of (The Slope of (f(If(I))))near (Inear (I00,R,R00): ):

=1: Linear Scalability=1: Linear Scalability>1: Super>1: Super--linearlinear

–– e.g., 2X or Xe.g., 2X or X22 gaingain<1: Sub<1: Sub--linearlinear

–– e.g., e.g., ½½X or XX or X½½

0: Saturated0: Saturated

The Return Range:The Return Range:00--5% or 1X5% or 1X--3X?3X?

0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

0.50 1.00 1.50 2.00 2.50 3.00

Relative Investment

Rel

ativ

e R

etur

n

X^3

2X

X½X

X^½

Sat.

(I0,R0)(I0,R0)

Page 11: Scalability - IBM

*Third party marks and brands are the property of their respecti*Third party marks and brands are the property of their respective ownerve owner1111

Performance, Power, Area (1)Performance, Power, Area (1)Physics: Voltage/Frequency scalingPhysics: Voltage/Frequency scaling

Freq = k * Voltage (F=kV)Freq = k * Voltage (F=kV)–– Within a limited voltage range VWithin a limited voltage range VMINMIN/V/VMAXMAX

Power = Activity * Capacitance * VoltagePower = Activity * Capacitance * Voltage22 * Freq (P=* Freq (P=ααCVCV22f)f)Power is proportional to fPower is proportional to f3 3 (>V(>VMINMIN))Power is proportional to fPower is proportional to f (<=V(<=VMINMIN))

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

Frequency (GHz)

Vol

tage

0

5

10

15

20

25

30

35

40

45

Pow

er

VoltagePower

Cubic ZoneCubic Zone

Linear ZoneLinear Zone

Page 12: Scalability - IBM

*Third party marks and brands are the property of their respecti*Third party marks and brands are the property of their respective ownerve owner1212

Performance, Power, Area (2)Performance, Power, Area (2)Empirics Empirics –– the power growththe power growth

Power = Power = f(Perff(Perf, Area), Area)Capacitance = k * Area (C=kA)Capacitance = k * Area (C=kA)Power Growth GuesstimatePower Growth Guesstimate–– Area growth Area growth power growthpower growth–– Performance growth Performance growth power growthpower growth

Empirically, Empirically, for the same voltagefor the same voltage,,power growth is ~ average of both (Combined effect of power growth is ~ average of both (Combined effect of ααCC))

Power = f(area, perf)

Area

Perf

Pow er

0

0.5

1

1.5

2

2.5

3

3.5

0 0.5 1 1.5 2 2.5 3 3.5

Area

Are

a, p

ower

, per

f

Area & Performance are free variables in this chart

Page 13: Scalability - IBM

*Third party marks and brands are the property of their respecti*Third party marks and brands are the property of their respective ownerve owner1313

The enablers:The enablers:6 elements of Scalability 6 elements of Scalability

1.1. Process TechnologyProcess Technology2.2. ArchitectureArchitecture3.3. MicroMicro--architecturearchitecture4.4. Multithreading (SMT)Multithreading (SMT)5.5. Multi Processors (CMP)Multi Processors (CMP)6.6. Dynamic Voltage Scaling (DVS)Dynamic Voltage Scaling (DVS)

There are others…There are others…

Page 14: Scalability - IBM

*Third party marks and brands are the property of their respecti*Third party marks and brands are the property of their respective ownerve owner1414

DisclaimerDisclaimer

Numbers mainly represents trends Numbers mainly represents trends ––Not concrete dataNot concrete data

For a given design assume:For a given design assume:Performance scales linearly with Performance scales linearly with frequency frequency

Leakage power is ignoredLeakage power is ignored

Page 15: Scalability - IBM

*Third party marks and brands are the property of their respecti*Third party marks and brands are the property of their respective ownerve owner1515

Process Technology Process Technology –– In TheoryIn TheoryPracticePracticeEvery new process technology (~2.5 years cycle)Every new process technology (~2.5 years cycle)changes the physical attributes of the transistor:changes the physical attributes of the transistor:

Size reduced to 0.5XSize reduced to 0.5XDelay time reduced to 0.7XDelay time reduced to 0.7XSwitch energy reduced by 3X; Relative to Switch energy reduced by 3X; Relative to CVCV22: :

–– Capacitance per transistor reduced to 0.7XCapacitance per transistor reduced to 0.7X–– Operation voltage reduced to 0.7XOperation voltage reduced to 0.7X

Two extreme scenarios:Two extreme scenarios:

XX 0.75X0.75X

XX 1.75X1.75X

XX 0.9X0.9X

Ideal Ideal ““ShrinkShrink””–– Same Same µµarcharch–– 1X1X #Xistors#Xistors–– 0.5X0.5X size (0.7X per dimension) size (0.7X per dimension) –– 1.5X1.5X frequencyfrequency–– 0.5X0.5X powerpower–– 1.5X1.5X performanceperformance–– 1X1X power densitypower density

1X1X0.5X0.5X1.35X1.35X0.75X0.75X1.25X1.25X11

1.5X1.5X11 Performance scales behind frequency and # of transistorsPerformance scales behind frequency and # of transistors

Ideal New Ideal New designdesign–– Same die sizeSame die size–– 2X2X #Xistors#Xistors–– 1X1X sizesize–– 1.5X1.5X frequencyfrequency–– 1X1X powerpower–– 3X3X performanceperformance–– 1X 1X Power DensityPower Density

2X2X1X1X

1.35X1.35X1.5X1.5X2.2X2.2X11

1.5X1.5X

Ideally - 3X performance for “nothing” every 2.5 years!Ideally - 3X performance for “nothing” every 2.5 years!Still Good – but less than we hoped forStill Good – but less than we hoped for

Page 16: Scalability - IBM

*Third party marks and brands are the property of their respecti*Third party marks and brands are the property of their respective ownerve owner1616

In PicturesIn Pictures……Process TechnologyProcess Technology

1.51.5µµ 1.01.0µµ 0.80.8µµ 0.60.6µµ 0.350.35µµ 0.250.25µµ 0.180.18µµ 0.130.13µµ 0.090.09µµProcessorProcessor

Intel386Intel386™™ DX DX Processor TrendsTrends

–– Smaller every new Smaller every new processprocess

–– Larger every new Larger every new µµarcharch–– Converge to about Converge to about

100mm100mm22

–– Extra area used now forExtra area used now foronon--die cachedie cache

Processor

Intel486Intel486™™ DX DX ProcessorProcessor

PentiumPentium®®ProcessorProcessor

PentiumPentium®® Pro Pro ProcessorProcessor

PentiumPentium®® II II ProcessorProcessor

PentiumPentium®® III III ProcessorProcessor

PentiumPentium®® 4 4 ProcessorProcessor

PentiumPentium®® M M ProcessorProcessor

Page 17: Scalability - IBM

*Third party marks and brands are the property of their respecti*Third party marks and brands are the property of their respective ownerve owner1717

ArchitectureArchitectureDescriptionDescription

User visible additions:User visible additions:–– New instructions:New instructions:

e.g., MMX, SSEe.g., MMX, SSE–– Paradigm change Paradigm change ––

EPIC*, VectorsEPIC*, Vectors

ScalingScaling–– Varying InvestmentVarying Investment–– Varying ReturnVarying Return

++Generally exhibits super Generally exhibits super linear scalabilitylinear scalability

––ComplexityComplexityLong SW enablingLong SW enablingRecompileRecompileArchitectural baggageArchitectural baggage

** EPIC EPIC –– Explicitly Parallel Instruction Computing Explicitly Parallel Instruction Computing

Page 18: Scalability - IBM

*Third party marks and brands are the property of their respecti*Third party marks and brands are the property of their respective ownerve owner1818

MicroMicro--ArchitectureArchitectureDescriptionDescription

User transparent User transparent structures and algorithms structures and algorithms to gain performance and to gain performance and reduce powerreduce power–– Pipelining, caches, Pipelining, caches,

branch prediction, outbranch prediction, out--ofof--orderorder……

ScalingScalingTraditionally: investmentTraditionally: investment½½

–– 22--3X investment 3X investment 1.4X1.4X--1.7X Return1.7X Return

Range is smaller w/ timeRange is smaller w/ timeSome highly scalable Some highly scalable mechanisms:mechanisms:–– Branch prediction, Branch prediction,

instructioninstruction fusionfusion

++User Transparent User Transparent Impact fast, impact manyImpact fast, impact manyEnables different Enables different segments for same segments for same architecturearchitecture

––ComplexityComplexityLow scalabilityLow scalability

Page 19: Scalability - IBM

*Third party marks and brands are the property of their respecti*Third party marks and brands are the property of their respective ownerve owner1919

SMT SMT –– Simultaneous MultiSimultaneous Multi--ThreadingThreadingShared Resources MultiShared Resources Multi--ProcessingProcessing

DescriptionDescription2 or more threads to run on a 2 or more threads to run on a single processor coresingle processor coreSharing/ Splitting/ Sharing/ Splitting/ Duplicating resourcesDuplicating resourcese.g. e.g. Compaq* AlphaCompaq* Alpha--21464,21464,IntelIntel®® Hyper Threading Hyper Threading technologytechnology

ScalingScalingSuper linear in areaSuper linear in arealinear+ in powerlinear+ in powere.g., 10% area and 15% e.g., 10% area and 15% power power 20% performance20% performanceVery application dependentVery application dependent–– Severe scaling glassSevere scaling glass--jawsjaws

++Efficient performance/areaEfficient performance/areaEfficient performance/powerEfficient performance/powerCan trade ST (Single Thread) and Can trade ST (Single Thread) and MT performanceMT performance

––ComplexityComplexityPower DensityPower Density

Page 20: Scalability - IBM

*Third party marks and brands are the property of their respecti*Third party marks and brands are the property of their respective ownerve owner2020

CMP CMP –– Chip MultiChip Multi--ProcessingProcessingDescriptionDescription

Placing 2 or more cores Placing 2 or more cores on a single dieon a single diePotentially sharing higher Potentially sharing higher level cacheslevel cachesExamples Examples –– IBM* Power 4 IBM* Power 4 ProcessorProcessor

ScalingScalingClose to linear in areaClose to linear in arealinear in powerlinear in powerCan range from 1 to several Can range from 1 to several cores cores Somewhat application Somewhat application dependentdependent

++Lower complexityLower complexityHigher throughputHigher throughputAddresses wireAddresses wire--delaydelay

––Bounded single Thread Bounded single Thread performanceperformanceMedium area efficiencyMedium area efficiency

Page 21: Scalability - IBM

*Third party marks and brands are the property of their respecti*Third party marks and brands are the property of their respective ownerve owner2121

DVS DVS –– Dynamic Voltage ScalingDynamic Voltage Scaling

DescriptionDescriptionDynamicDynamic change of Voltage & change of Voltage & frequency allowing trading power & frequency allowing trading power & performanceperformancefreq = kfreq = k11*V*V(limited between V(limited between VMINMIN & V& VMAXMAX))Power = Power = ααCVCV22f f = k= k22VV33

V<VV<VMINMIN can reduce freq. only can reduce freq. only Examples Examples –– Intel Intel SpeedStepSpeedStep®®Technology, Technology, TransmetaTransmeta* * LongRunLongRun, , ……

ScalingScalingSub linear:Sub linear:cubic root cubic root -- in powerin powerRange: 2X power reduction Range: 2X power reduction

20% performance loss20% performance loss(Assuming V(Assuming VMINMIN/V/VMAXMAX = 0.8)= 0.8)

++Dynamic Dynamic –– can vary voltage/ can vary voltage/ frequency at run timefrequency at run timeBenefits all apps types, ST, Benefits all apps types, ST, MTMT

––Mainly a downward Mainly a downward scalabilityscalability

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

0 5 10 15 20 25 30

Power

Freq

uenc

y (G

Hz)

, Vol

tage

(V)

Freq (GHz)

Voltage (V)

CubicCubicZoneZone

LinearLinearZoneZone

VVMINMIN

VVMAXMAX

Page 22: Scalability - IBM

*Third party marks and brands are the property of their respecti*Third party marks and brands are the property of their respective ownerve owner2222

Summing it allSumming it all

"Return on Investment"

Process

uarch

SMT

CMP

DVS

0.00

0.50

1.00

1.50

2.00

2.50

3.00

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

Power range

Perfr

oman

ce ra

nge

ProcessuarchSMTCMPDVS

Page 23: Scalability - IBM

*Third party marks and brands are the property of their respecti*Third party marks and brands are the property of their respective ownerve owner2323

CompositionCompositionCan we apply several scaling elements? Can we apply several scaling elements? –– SureSure

The obviousThe obvious–– Process Technology is a givenProcess Technology is a given

To be assessed a prioriTo be assessed a priori–– Cost of architecture changesCost of architecture changes

How to choose among other options?How to choose among other options?–– Under given area budgetUnder given area budget–– Under given power budgetUnder given power budget–– We need to show some ST performance gainWe need to show some ST performance gain–– We want high MT performance We want high MT performance –– How much to dedicate for shared resources (Cache?)How much to dedicate for shared resources (Cache?)

Page 24: Scalability - IBM

*Third party marks and brands are the property of their respecti*Third party marks and brands are the property of their respective ownerve owner2424

Composition Composition –– Example (1)Example (1)

We assume process technology is a givenWe assume process technology is a givenWe ignore architecture changesWe ignore architecture changesWe look at several CMP/SMT optionsWe look at several CMP/SMT options–– 1/2/4 cores1/2/4 cores–– 1/2 way SMT per core1/2 way SMT per core

Rest of Rest of areaarea used for used for µµarch changes improvements/dearch changes improvements/de--featuresfeaturesWe examine power/performance at full range of DVSWe examine power/performance at full range of DVSWe assume:We assume:–– Budget:Budget:

100 mm100 mm22 area, Varea, VMINMIN/V/VMAX MAX 1.0V/1.25V = 80% 1.0V/1.25V = 80% 2X DVS power range2X DVS power range–– Basic building block Basic building block –– 25 mm25 mm22 area, 20W powerarea, 20W power

Page 25: Scalability - IBM

*Third party marks and brands are the property of their respecti*Third party marks and brands are the property of their respective ownerve owner2525

Composition Composition -- Example (2)Example (2)Unlimited PowerUnlimited Power

ConfigurationsConfigurations–– 1/2/4 cores1/2/4 cores–– 1/2 Way SMT1/2 Way SMT–– Fixed area: 100 mmFixed area: 100 mm22

–– Unlimited powerUnlimited power

With CMPWith CMP–– MT Perf goes upMT Perf goes up–– ST Perf goes downST Perf goes down–– Power efficiency is Power efficiency is

upup

With SMTWith SMT–– MT Perf goes upMT Perf goes up–– ST perf stay sameST perf stay same–– Power efficiency is Power efficiency is

upup

2.0

1.8

1.4

1.3

1.0

0.9

2.0

2.6

2.7

3.4

3.7

4.7

0.961.00

1.651.71

2.91

3.844.00

4.54

3.413.35

3.00

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

1P 1P/SMT 2P 2P/SMT 4P 4P/SMTConfiguration

Rela

tive

Pow

er &

Per

form

ance

ST PerfMT PerfST PowerMT Power

Page 26: Scalability - IBM

*Third party marks and brands are the property of their respecti*Third party marks and brands are the property of their respective ownerve owner2626

Composition Composition -- Example (3)Example (3)With DVSWith DVS

SingleSingle--ThreadedThreaded

1P

1P/SMT

2P2P/SMT

4P4P/SMT

0.0

0.5

1.0

1.5

2.0

2.5

0 20 40 60 80Power

Rel

ativ

e ST

Per

form

ance

1P1P/SMT2P2P/SMT4P4P/SMT

MultiMulti--ThreadedThreaded

1P

1P/SMT

2P

2P/SMT

4P

4P/SMT

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

0 20 40 60 80Power

Rel

ativ

e M

T pe

rfor

man

ce

1P1P/SMT2P2P/SMT4P4P/SMT

Note:Note:–– 4P/SMT is the MT leader in all ranges4P/SMT is the MT leader in all ranges–– 1P is an ST leader in the 25W1P is an ST leader in the 25W--60W range60W range

Page 27: Scalability - IBM

*Third party marks and brands are the property of their respecti*Third party marks and brands are the property of their respective ownerve owner2727

Is that all?Is that all?

No! We should consider:No! We should consider:Multiple clock domain (DVS+CMP)Multiple clock domain (DVS+CMP)““focusedfocused”” MIPSMIPS–– Application oriented ISA and Application oriented ISA and µµarcharch–– Fixed functionsFixed functions

Asymmetric CoresAsymmetric Cores–– e.g., big for higher performance, small for lower powere.g., big for higher performance, small for lower power**

Target SegmentTarget Segment–– Servers prefer throughput over ST performanceServers prefer throughput over ST performance–– Mobile has lower power budget, care about average power Mobile has lower power budget, care about average power

and may have different usage model. and may have different usage model.

** SingleSingle--ISA Heterogeneous MultiISA Heterogeneous Multi--Core Architectures for Multithreaded Workload PerformanceCore Architectures for Multithreaded Workload PerformanceKumar, Kumar, FarkasFarkas, , JouppiJouppi, , RanganathanRanganathan, , TullsenTullsen, ISCA, ISCA’’20042004

Page 28: Scalability - IBM

*Third party marks and brands are the property of their respecti*Third party marks and brands are the property of their respective ownerve owner2828

What next? Asymmetric Cores?What next? Asymmetric Cores?

Mix big and small cores Mix big and small cores

–– Big Big core(score(s) ) –– for single for single thread performancethread performance

–– Small cores Small cores –– for for efficient multithreaded efficient multithreaded performanceperformance

Best of all worlds?

3.5X3.5X½½XX13 cores13 cores

4X4X¼¼XX16 cores16 cores

2X2X½½XX4 cores4 cores

1X1X1X1X1 core1 core

MT MT perfperfST ST perfperf

Best of all worlds?

Page 29: Scalability - IBM

*Third party marks and brands are the property of their respecti*Third party marks and brands are the property of their respective ownerve owner2929

SummarySummaryMany dimensions to scalingMany dimensions to scalingTradeoffs are becoming more complex everydayTradeoffs are becoming more complex everyday–– Features, performance, power, areaFeatures, performance, power, area–– MutliMutli--threaded vs. Single Threadedthreaded vs. Single Threaded–– General purpose vs. General purpose vs. ““focused focused MIPsMIPs””

ExpectExpect–– Less complex microLess complex micro--architecturearchitecture–– More CMP and SMTMore CMP and SMT

They are just more efficient, less complexThey are just more efficient, less complex–– But how small can we split? But how small can we split?

The BIG challenge:The BIG challenge:–– Finding the next BIG thingFinding the next BIG thing

Page 30: Scalability - IBM

*Third party marks and brands are the property of their respecti*Third party marks and brands are the property of their respective ownerve owner3030

The EndThe EndNo animals were injured during theNo animals were injured during the

preparation of this presentation preparation of this presentation