Multicore 101: Migrating Embedded Apps to Multicore with Linux
DARK S END OF MULTICORE SCALING - Computer Sciencehadi/doc/paper/2012-toppicks-dark_silicon.pdfdark...
Transcript of DARK S END OF MULTICORE SCALING - Computer Sciencehadi/doc/paper/2012-toppicks-dark_silicon.pdfdark...
..........................................................................................................................................................................................................................
DARK SILICON AND THE ENDOF MULTICORE SCALING
..........................................................................................................................................................................................................................
A KEY QUESTION FOR THE MICROPROCESSOR RESEARCH AND DESIGN COMMUNITY IS
WHETHER SCALING MULTICORES WILL PROVIDE THE PERFORMANCE AND VALUE NEEDED
TO SCALE DOWN MANY MORE TECHNOLOGY GENERATIONS. TO PROVIDE A QUANTITATIVE
ANSWER TO THIS QUESTION, A COMPREHENSIVE STUDY THAT PROJECTS THE SPEEDUP
POTENTIAL OF FUTURE MULTICORES AND EXAMINES THE UNDERUTILIZATION OF
INTEGRATION CAPACITY—DARK SILICON—IS TIMELY AND CRUCIAL.
......Moore’s law (the doubling oftransistors on chip every 18 months) hasbeen a fundamental driver of computing.1
For the past three decades, through device,circuit, microarchitecture, architecture,and compiler advances, Moore’s law,coupled with Dennard scaling, has resultedin commensurate exponential performanceincreases.2 The recent shift to multicoredesigns aims to increase the number ofcores using the increasing transistor countto continue the proportional scaling ofperformance.
With the end of Dennard scaling, futuretechnology generations can sustain the dou-bling of devices every generation, but withsignificantly less improvement in energy effi-ciency at the device level. This device scalingtrend presages a divergence between energy-efficiency gains and transistor-densityincreases. For the architecture community,it is crucial to understand how effectivelymulticore scaling will use increased device in-tegration capacity to deliver performancespeedups in the long term. While everyoneunderstands that power and energy are criti-cal problems, no detailed, quantitative studyhas addressed how severe (or not) the power
problem will be for multicore scaling, espe-cially given the large multicore design space(CPU-like, GPU-like, symmetric, asymmet-ric, dynamic, composed/fused, and so forth).
To explore the speedup potential offuture multicores, we conducted a decade-long performance scaling projection formulticore designs assuming fixed powerand area budgets. It considers devices, coremicroarchitectures, chip organizations, andbenchmark characteristics, applying areaand power constraints at future technologynodes. Through our models we also esti-mate the effects of nonideal device scalingon integration capacity utilization and esti-mate the percentage of dark silicon (transis-tor integration capacity underutilization) onfuture multicore chips. For more informa-tion on related research, see the ‘‘RelatedWork in Modeling Multicore Speedupand Dark Silicon’’ sidebar.
Modeling multicore scalingTo project the upper bound performance
achievable through multicore scaling (undercurrent scaling assumptions), we consideredtechnology scaling projections, single-coredesign scaling, multicore design choices,
[3B2-9] mmi2012030122.3d 17/5/012 10:48 Page 122
Hadi Esmaeilzadeh
University of Washington
Emily Blem
University of Wisconsin�
Madison
Renee St. Amant
University of Texas
at Austin
Karthikeyan
Sankaralingam
University of Wisconsin�
Madison
Doug Burger
Microsoft Research
..............................................................
122 Published by the IEEE Computer Society 0272-1732/12/$31.00 �c 2012 IEEE
actual application behavior, and microarchi-tectural features. We considered fixed-sizeand fixed-power-budget chips. We builtand combined three models to project per-formance, as Figure 1 shows. The three mod-els are the device scaling model (DevM), thecore scaling model (CorM), and the multi-core scaling model (CmpM). The modelspredict performance speedup and show agap between our projected speedup and thespeedup we have come to expect with eachtechnology generation. This gap is referredto as the dark silicon gap. The models alsoproject the percentage of the dark silicon asthe process technology scales.
We built a device scaling model that pro-vides the area, power, and frequency scaling
factors at technology nodes from 45 nmthrough 8 nm. We consider aggressive Inter-national Technology Roadmap for Semiconduc-tors (ITRS; http://www.itrs.net) projectionsand conservative projections from Borkar’srecent study.3
We modeled the power/performance andarea/performance of single core designs usingPareto frontiers derived from real measure-ments. Through Pareto-optimal curves, thecore-level model provides the maximum per-formance that a single core can sustain for anygiven area. Further, it provides the minimumpower that must be consumed to sustain thislevel of performance.
We developed an analytical modelthat provides per-benchmark speedup of a
[3B2-9] mmi2012030122.3d 16/5/012 14:26 Page 123
...............................................................................................................................................................................................
Related Work in Modeling Multicore Speedup and Dark Silicon
Hill and Marty extend Amdahl’s law to model multicore speedup with
symmetric, asymmetric, and dynamic topologies and conclude that dy-
namic multicores are superior.1 Their model uses area as the primary
constraint and models single-core area/performance tradeoff using
Pollack’s rule (Performance /ffiffiffiffiffiffiffiffipArea) without considering technology
trends.2 Azizi et al. derive the single-core energy/performance tradeoff
of Pareto frontiers using architecture-level statistical models combined
with circuit-level energy/performance tradeoff functions.3 For modeling
single-core power/performance and area/performance tradeoffs, our
core model derives two separate Pareto frontiers from real measure-
ments. Furthermore, we project these tradeoff functions to the future
technology nodes using our device model.
Chakraborty considers device scaling and estimates a simultaneous
activity factor for technology nodes down to 32 nm.4 Hempstead et al.
introduce a variant of Amdahl’s law to estimate the amount of special-
ization required to maintain 1.5� performance growth per year,
assuming completely parallelizable code.5 Chung et al. study uncon-
ventional cores including custom logic, field-programmable gate arrays
(FPGAs), or GPUs in heterogeneous single-chip design.6 They rely on
Pollack’s rule for the area/performance and power/performance trade-
offs. Using International Technology Roadmap for Semiconductors
(ITRS) projections, they report on the potential for unconventional
cores considering parallel kernels. Hardavellas et al. forecast the lim-
its of multicore scaling and the emergence of dark silicon in servers
with workloads that have an inherent abundance of parallelism.7
Using ITRS projections, Venkatesh et al. estimate technology-imposed
utilization limits and motivate energy-efficient and application-specific
core designs.8
Previous work largely abstracts away processor organization and ap-
plication details. Our study provides a comprehensive model that consid-
ers the implications of process technology scaling; decouples power/area
constraints; uses real measurements to model single-core design
tradeoffs; and exhaustively considers multicore organizations, microarch-
itectural features, and the behavior of real applications.
References
1. M.D. Hill and M.R. Marty, ‘‘Amdahl’s Law in the Multicore
Era,‘‘ Computer, vol. 41, no. 7, 2008, pp. 33-38.
2. F. Pollack, ‘‘New Microarchitecture Challenges in the Com-
ing Generations of CMOS Process Technologies,‘‘ Proc.
32nd Ann. ACM/IEEE Int’l Symp. Microarchitecture (Micro
99), IEEE CS, 2009, p. 2.
3. O. Azizi et al., ‘‘Energy-Performance Tradeoffs in Processor
Architecture and Circuit Design: A Marginal Cost Analysis,‘‘
Proc. 37th Ann Int’l Symp. Computer Architecture (ISCA
10), ACM, 2010, pp. 26-36.
4. K. Chakraborty, ‘‘Over-Provisioned Multicore Systems,‘‘
doctoral thesis, Department of Computer Sciences, Univ.
of Wisconsin�Madison, 2008.
5. M. Hempstead, G.-Y. Wei, and D. Brooks, ‘‘Navigo: An Early-
Stage Model to Study Power-Constrained Architectures and
Specialization,‘‘ Workshop on Modeling, Benchmarking, and
Simulations (MoBS ), 2009.
6. E.S. Chung et al., ‘‘Single-Chip Heterogeneous Computing:
Does the Future Include Custom Logic, FPGAs, and
GPUs?‘‘ Proc. 43rd Ann. IEEE/ACM Int’l Symp. Microarchi-
tecture (Micro 43), IEEE CS, 2010, pp. 225-236.
7. N. Hardavellas et al., ‘‘Toward Dark Silicon in Servers,‘‘ IEEE
Micro, vol. 31, no. 4, 2011, pp. 6-15.
8. G. Venkatesh et al., ‘‘Conservation Cores: Reducing the En-
ergy of Mature Computations,‘‘ Proc. 15th Int’l Conf. Archi-
tectural Support for Programming Languages and Operating
Systems (ASPLOS 10), ACM, 2010, pp. 205-218.
....................................................................
MAY/JUNE 2012 123
multicore design compared to a baseline de-sign. The model projects performance foreach hybrid configuration based on high-level application properties and microarchi-tectural features. We modeled the two main-stream classes of multicore organizations,multicore CPUs and many-thread GPUs,which represent two extreme points in thethreads-per-core spectrum. The CPU multi-core organization represents Intel Nehalem-like, heavyweight multicore designs withfast caches and high single-thread perfor-mance. The GPU multicore organizationrepresents Nvidia Tesla-like lightweightcores with heavy multithreading supportand poor single-thread performance. Foreach multicore organization, we consideredfour topologies: symmetric, asymmetric, dy-namic, and composed (fused).
Table 1 outlines the four topologies inthe design space and the cores’ roles duringserial and parallel portions of applications.
Single-thread (ST) cores are uniprocessor-style cores with large caches, and many-thread (MT) cores are GPU-style coreswith smaller caches.
Combining the device model with thecore model provided power/performanceand area/performance Pareto frontiers atfuture technology nodes. Any performanceimprovements for future cores will comeonly at the cost of area or power as definedby these curves. Finally, combining all threemodels and performing an exhaustive design-space search produced the optimal multicoreconfiguration and the maximum multicorespeedups for each benchmark at future tech-nology nodes while enforcing area, power,and benchmark constraints.
Future directionsAs the rest of the article will elaborate, we
model an upper bound on parallel applica-tion performance available from multicore
[3B2-9] mmi2012030122.3d 16/5/012 14:26 Page 124
Device scaling(DevM)
Collectingempirical data
Corescaling (CorM)
Multicore scaling(CmpM) = >××
×
×
Optimal no. of coresMulticore speedup% of dark silicon
Year
Year
Search 800 configs.for 12 benchmarks
DerivingPareto frontiers
CPU-likemulticore
GPU-likemulticore
ITRSprojections
Conservativeprojections
V DD
Pow
er
Core
are
a
Performance
PerformanceApplications
Are
a
Co
re p
ow
er
Cf
Tech node
Tech node
Tech node
Tech node
Tech node
2 chip organizations × 4 topologies
12 benchmarks
Data for 152processors
2 projectionschemes
Analytical models
Microarchitecturalfeatures
Application behavior
Multic
ore
sp
eed
up
Historicalperformance
scaling
% o
f d
ark
sili
con
Dark
sili
con
gap
Model projections
1
+1 – fS (q)
fN(q)s(q)
Figure 1. Overview of the methodology and models. By combining the device scaling model (DevM), core scaling model
(CorM), and multicore scaling model (CmpM), we project performance speedup and reveal a gap between the projected
speedup and the speedup expected with each technology generation indicated as the dark silicon gap. The three-tier
model also projects the percentage of dark silicon as technology scales.
....................................................................
124 IEEE MICRO
...............................................................................................................................................................................................
TOP PICKS
and CMOS scaling—assuming no major dis-ruptions in process scaling or core efficiency.Using a constant area and power budget, thisstudy shows that the space of known multi-core designs (CPUs, GPUs, and theirhybrids) or novel heterogeneous topologies(for example, dynamic or composable) fallsfar short of the historical performance gainsour industry is accustomed to. Even withaggressive ITRS scaling projections, scalingcores achieves a geometric mean 7.9�speedup through 2024 at 8 nm. With con-servative scaling, only 3.7� geometricmean speedup is achievable at 8 nm. Fur-thermore, with ITRS projections, at 22 nm,21 percent of the chip will be dark, and at8 nm, more than 50 percent of the chip can-not be utilized.
The article’s findings and methodologyare both significant and indicate that withoutprocess breakthroughs, directions beyondmulticore are needed to provide performancescaling. For decades, Dennard scaling per-mitted more transistors, faster transistors,and more energy-efficient transistors witheach new process node, which justified theenormous costs required to develop eachnew process node. Dennard scaling’s failureled industry to race down the multicorepath, which for some time permitted perfor-mance scaling for parallel and multitaskedworkloads, permitting the economics of pro-cess scaling to hold. A key question for themicroprocessor research and design commu-nity is whether scaling multicores will pro-vide the performance and value needed toscale down many more technology genera-tions. Are we in a long-term multicore
‘‘era,’’ or will industry need to move in dif-ferent, perhaps radical, directions to justifythe cost of scaling?
The glass is half-emptyA pessimistic interpretation of this study
is that the performance improvements towhich we have grown accustomed over thepast 30 years are unlikely to continue withmulticore scaling as the primary driver. Thetransition from multicore to a new approachis likely to be more disruptive than the tran-sition to multicore and, to sustain the currentcadence of Moore’s law, must occur in only afew years. This period is much shorter thanthe traditional academic time frame requiredfor research and technology transfer. Majorarchitecture breakthroughs in ‘‘alternative’’directions such as neuromorphic computing,quantum computing, or biointegration willrequire even more time to enter industryproduct cycle. Furthermore, while a slowingof Moore’s law will obviously not be fatal, ithas significant economic implications for thesemiconductor industry.
The glass is half-fullIf energy-efficiency breakthroughs are
made on supply voltage and process scaling,the performance improvement potential ishigh for applications with very high degreesof parallelism.
Rethinking multicore’s long-term potentialWe hope that our quantitative findings
trigger some analyses in both academiaand industry on the long-term potential ofthe multicore strategy. Academia is now
[3B2-9] mmi2012030122.3d 16/5/012 14:26 Page 125
Table 1. The four multicore topologies for CPU-like and GPU-like organizations. (ST core: single-thread core;
MT core: many-thread core.)
Multicore
organization
Portion
of code
Symmetric
topology Asymmetric topology
Dynamic
topology
Composed
topology
CPU multicore Serial 1 ST core 1 large ST core 1 large ST core 1 large ST core
Parallel N ST cores 1 large ST core +N small ST cores N small ST cores N small ST cores
GPU multicore Serial 1 MT core
(1 thread)
1 large ST core (1 thread) 1 large ST core
(1 thread)
1 large ST core
(1 thread)
Parallel N MT cores
(multiple
threads)
1 large ST core
(1 thread)
+N small MT cores
(multiple threads)
N small MT cores
(multiple threads)
N small MT cores
(multiple threads)
....................................................................
MAY/JUNE 2012 125
making a major investment in research focus-ing on multicore and its related problems ofexpressing and managing parallelism. Re-search projects assuming hundreds or thou-sands of capable cores should consider thismodel and the power requirements undervarious scaling projections before assumingthat the cores will inevitably arrive. The par-adigm shift toward multicores that started inthe high-performance, general-purpose mar-ket has already percolated to mobile andembedded markets. The qualitative trendswe predict and our modeling methodologyhold true for all markets even thoughour study considers the high-end desktopmarket. This study’s results could helpbreak industry’s current widespread consen-sus that multicore scaling is the viable for-ward path.
Model points to opportunitiesOur study is based on a model that takes
into account properties of devices, processorcore, multicore organization, and topology.Thus the model inherently provides the pla-ces to focus on for innovation. To surpassthe dark silicon performance barrier high-lighted by our work, designers must developsystems that use significantly more energy-efficient techniques. Some examples includedevice abstractions beyond digital logic(error-prone devices); processing paradigmsbeyond superscalar, single instruction, multi-ple data (SIMD), and single instruction,multiple threads (SIMT); and program se-mantic abstractions allowing probabilisticand approximate computation. The resultsshow that radical departures are needed,and the model shows quantitative ways tomeasure the impact of such techniques.
A case for microarchitecture innovationOur study also shows that fundamental
processing limitations emanate from the pro-cessor core. Clearly, architectures that movewell past the power/performance Pareto-optimal frontier of today’s designs are neces-sary to bridge the dark silicon gap anduse transistor integration capacity. Thus,improvements to the core’s efficiency willimpact performance improvement and willenable technology scaling even though thecore consumes only 20 percent of the
power budget for an entire laptop, smart-phone, or tablet. We believe this study willrevitalize and trigger microarchitecture inno-vations, making the case for their urgencyand potential impact.
A case for specializationThere is emerging consensus that special-
ization is a promising alternative to efficientlyuse transistors to improve performance. Ourstudy serves as a quantitative motivation onsuch work’s urgency and potential impact.Furthermore, our study shows quantitativelythe levels of energy improvement that spe-cialization techniques must deliver.
A case for complementing the coreOur study also shows that when perfor-
mance becomes limited, techniques that oc-casionally use parts of the chip to deliveroutcomes orthogonal to performance areways to sustain the industry’s economics.However, techniques that focus on usingthe device integration capacity for improvingsecurity, programmer productivity, softwaremaintainability, and so forth must considerenergy efficiency as a primary factor.
Device scaling model (DevM)The device model (DevM) provides
transistor-area, power, and frequency-scalingfactors from a base technology node (forexample, 45 nm) to future technologies.The area-scaling factor corresponds to theshrinkage in transistor dimensions. TheDevM model calculates the frequency-scalingfactor based on the fanout-of-four (FO4)delay reduction. The model computes thepower-scaling factor using the predicted fre-quency, voltage, and gate capacitance scalingfactors in accordance with the P ¼ �CV 2
DDfequation.
We generated two device scaling models:ITRS scaling and conservative scaling.The ITRS model uses projections from the2010 ITRS. The conservative model is basedon predictions presented by Borkar3 and rep-resents a less optimistic view. Table 2 summa-rizes the parameters used for calculating thepower and performance-scaling factors. Weallocated 20 percent of the chip power budgetto leakage power and assumed chip designerscan maintain this ratio.
[3B2-9] mmi2012030122.3d 16/5/012 14:26 Page 126
....................................................................
126 IEEE MICRO
...............................................................................................................................................................................................
TOP PICKS
Core scaling model (CorM)We built the technology-scalable core
model (CorM) by populating the area/performance and power/performance designspaces with the data collected for a set of pro-cessors, all fabricated in the same technologynode. The core model is the combination ofthe area/performance Pareto frontier, A(q),and the power/performance Pareto frontier,P(q), for these two design spaces. The q isa core’s single-threaded performance. Thesefrontiers capture the optimal area/performanceand power/performance tradeoffs for acore while abstracting away specific detailsof the core.
As Figure 2 shows, we populated the twodesign spaces at 45 nm using 20 representa-tive Intel and Advanced Micro Devices(AMD) processors and derive the Paretofrontiers. The curve that bounds all power/performance (area/performance) points inthe design space and indicates the minimumamount of power (area) required for a givenperformance level constructs the Paretofrontier. The P(q) and A(q) pair, which arepolynomial equations, constitute the coremodel. The core performance (q) is the pro-cessor’s SPECmark and is collected from theSPEC website (http://www.spec.org). We
estimated the core power budget using thethermal design power (TDP) reported inprocessor datasheets. The TDP is the chippower budget, or the amount of power thechip can dissipate without exceeding thetransistor junction temperature. Afterexcluding the share of uncore componentsfrom the power budget, we divided thepower budget allocated to the cores tothe number of cores to estimate thecore power budget. We used die photosof the four microarchitectures—IntelAtom, Intel Core, AMD Shanghai, andIntel Nehalem—to estimate the core areas(excluding Level-2 [L2] and Level-3 [L3]caches). Because this work’s focus is tostudy the impact of technology constraintson logic scaling rather than cache scaling,we derive the Pareto frontiers using onlythe portion of power budget and area allo-cated to the core in each processor excludingthe uncore components’ share.
As Figure 2 illustrates, we fit a cubic poly-nomial, P (q), to the points along the edge ofthe power/performance design space, and aquadratic polynomial (Pollack’s rule4), A(q),to the points along the edge of the area/performance design space. The Intel AtomZ520 with an estimated 1.89 W core TDP
[3B2-9] mmi2012030122.3d 16/5/012 14:26 Page 127
Table 2. Scaling factors with International Technology Roadmap for Semiconductors (ITRS ) and conservative
projections. ITRS projections show an average 31 percent frequency increase and 35 percent power
reduction per node, compared to an average 6 percent frequency increase and 23 percent power
reduction per node for conservative projections.
Device scaling
model Year
Technology
node (nm)
Frequency
scaling factor
(45 nm)
VDD scaling
factor (45 nm)
Capacitance
scaling factor
(45 nm)
Power scaling
factor (45 nm)
ITRS scaling 2010 45� 1.00 1.00 1.00 1.00
2012 32� 1.09 0.93 0.70 0.66
2015 22y 2.38 0.84 0.33 0.54
2018 16y 3.21 0.75 0.21 0.38
2021 11y 4.17 0.68 0.13 0.25
2024 8y 3.85 0.62 0.08 0.12
Conservative scaling 2008 45 1.00 1.00 1.00 1.00
2010 32 1.10 0.93 0.75 0.71
2012 22 1.19 0.88 0.56 0.52
2014 16 1.25 0.86 0.42 0.39
2016 11 1.30 0.84 0.32 0.29
2018 8 1.34 0.84 0.24 0.22.................................................................................................................................................................................� Extended Planar Bulk Transistors; y Multi-Gate Transistors.
....................................................................
MAY/JUNE 2012 127
represents the lowest power design (lower-leftfrontier point), and the Nehalem-based IntelCore i7-965 Extreme Edition with an esti-mated 31.25 W core TDP represents thehighest-performing design (upper-right fron-tier point). We used the points along thescaled Pareto frontier as the search space fordetermining the best core configuration bythe multicore scaling model.
Multicore scaling model (CmpM)We developed a detailed chip-level model
(CmpM) that integrates the area and powerfrontiers, microarchitectural features, and ap-plication behavior, while accounting for thechip organization (CPU-like or GPU-like)and its topology (symmetric, asymmetric,dynamic, or composed). Guz et al. proposeda model for studying the first-order impactsof microarchitectural features (cache organi-zation, memory bandwidth, threads percore, and so forth) and workload behavior(memory access patterns).5 Their modelconsiders stalls due to memory dependencesand resource constraints (bandwidth or func-tional units). We extended their approachto build our multicore model. Our exten-sions incorporate additional applicationbehaviors, microarchitectural features, andphysical constraints, and covers both ho-mogeneous and heterogeneous multicoretopologies.
Using this model, we consider single-threaded cores with large caches to coverthe CPU multicore design space and mas-sively threaded cores with minimal cachesto cover the GPU multicore design spaceacross all four topologies, as described inTable 1. Table 3 lists the input parametersto the model, and how the multicore designchoices impact them, if at all.
Microarchitectural featuresEquation 1 calculates the multithreaded
performance (Perf ) of either a CPU-like orGPU-like multicore organization running afully parallel (f ¼ 1) and multithreaded ap-plication in terms of instructions per secondby multiplying the number of cores (N ) bythe core utilization (Z) and scaling by theratio of the processor frequency to CPIexe:
Perf ¼
min Nfreq
CPIexe�;
BWmax
rm � mL1 � mL2 � b
� �
(1)
The CPIexe parameter does not includestalls due to cache accesses, which are consid-ered separately in core utilization (Z). Thecore utilization (Z) is the fraction of timethat a thread running on the core can keepit busy. It is modeled as a function of the
[3B2-9] mmi2012030122.3d 16/5/012 14:26 Page 128
0 5 10 15 20 25 30 35 40
Performance (SPECmark)(a)
0
5
10
15
20
25
30
Core
pow
er
(W)
P(q) = 0.0002q3 + 0.0009q2 + 0.3859q − 0.0301
Intel Nehalem (45 nm)
Intel Core (45 nm)
AMD Shanghai (45 nm)
Intel Atom (45 nm)
Pareto Frontier (45 nm)
0
(b)
5 10 15 20 25 30 35 40
Performance (SPECmark)
5
10
15
20
25
30
Co
re a
rea
(m
m2)
Intel Nehalem (45 nm)
Intel Core (45 nm)
AMD Shanghai (45 nm)
Intel Atom (45 nm)
Pareto Frontier (45 nm)
A(q) = 0.0152q2 + 0.0265q + 7.4393
Figure 2. Design space and the derived Pareto frontiers. Power/performance frontier, 45 nm (a); area/performance frontier,
45 nm (b).
....................................................................
128 IEEE MICRO
...............................................................................................................................................................................................
TOP PICKS
average time spent waiting for each memoryaccess (t ), fraction of instructions that accessthe memory (rm), and the CPIexe:
� ¼ min 1;T
1þ t rm
CPIexe
!(2)
The average time spent waiting for mem-ory accesses (t) is a function of the time toaccess the caches (tL1 and tL2), time to visitmemory (tmem), and the predicted cachemiss rate (mL1 and mL2):
t ¼ (1 � mL1)tL1 þ mL1 (1 � mL2)tL2
þ mL1mL2tmem (3)
mL1 ¼CL1
T �L1
� �1��L1
and
mL2 ¼CL2
T �L2
� �1��L2
(4)
Multicore topologies
The multicore model is an extendedAmdahl’s law6 equation that incorporatesthe multicore performance (Perf ) calculatedfrom Equations 1 through 4:
Speedup ¼ 1=f
Sparallelþ 1� f
Sserial
� �(5)
The CmpM model (Equation 5) mea-sures the multicore speedup with respect
to a baseline multicore (PerfB). That is, theparallel portion of code (f ) is sped up bySParallel ¼ PerfP/PerfB and the serial portion ofcode (1�f ) is sped up by SSerial¼ PerfS /PerfB.
We calculated the number of cores thatcan fit on the chip based on the multicore’stopology, area budget (AREA), power bud-get (TDP), and each core’s area [A(q)] andpower [P (q)].
NSymmðqÞ ¼
minAREA
AðqÞ ;TDP
PðqÞ
� �
NAsymðqL; qS Þ ¼
minAREA � AðqLÞ
AðqS Þ;TDP � PðqLÞ
PðqS Þ
� �
NdynmðqL ; qS Þ ¼
minAREA � AðqLÞ
AðqS Þ;
TDP
PðqS Þ
� �
NCompðqL ; qS Þ ¼
minAREA
ð1þ �ÞAðqS Þ;
TDP
PðqS Þ
� �
For heterogeneous multicores, qS is thesingle-threaded performance of the smallcores and qL is the large core’s single-threaded
[3B2-9] mmi2012030122.3d 16/5/012 14:26 Page 129
Table 3. CmpM parameters with default values from 45-nm Nehalem.
Parameter Description Default Impacted by
N Number of cores 4 Multicore topology
T Number of threads per core 1 Core style
freq Core frequency (MHz) 3,200 Core performance
CPIexe Cycles per instruction (zero-latency cache accesses) 1 Core performance, application
CL1 Level-1 (L1) cache size per core (Kbytes) 64 Core style
CL2 Level-2 (L2) cache size per chip (Mbytes) 2 Core style, multicore topology
tL1 L1 access time (cycles) 3 N/A
tL2 L2 access time (cycles) 20 N/A
tmem Memory access time (cycles) 426 Core performance
BWmax Maximum memory bandwidth (Gbytes/s) 200 Technology node
b Bytes per memory access (bytes) 64 N/A
f Fraction of code that can be parallel Varies Application
rm Fraction of instructions that are memory accesses Varies Application
�L1, �L1 L1 cache miss rate function constants Varies Application
�L2, �L2 L2 cache miss rate function constants Varies Application
....................................................................
MAY/JUNE 2012 129
performance. The area overhead of support-ing composability is t, while no power over-head is assumed for composability support.
Model implementationOne of the contributions of this work is
the incorporation of Pareto frontiers, physi-cal constraints, real application behavior,and realistic microarchitectural features intothe multicore speedup projections.
The input parameters that characterize anapplication are its cache behavior, fraction ofinstructions that are loads or stores, and frac-tion of parallel code. For the PARSECbenchmarks, we obtained this data fromtwo previous studies.7,8 To obtain the frac-tion of parallel code (f ) for each benchmark,we fit an Amdahl’s law�based curve to thereported speedups across different numbersof cores from both studies. This fit showsvalues of f between 0.75 and 0.9999 forindividual benchmarks.
To incorporate the Pareto-optimal curvesinto the CmpM model, we converted theSPECmark scores (q) into an estimatedCPIexe and core frequency. We assumedthat core frequency scales linearly with per-formance, from 1.5 GHz for an Atom coreto 3.2 GHz for a Nehalem core. Each appli-cation’s CPIexe depends on its instructionmix and use of hardware optimizationssuch as functional units and out-of-orderprocessing. Since the measured CPIexe foreach benchmark at each technology node isnot available, we used the CmpM model togenerate per-benchmark CPIexe estimates foreach design point along the Pareto frontier.With all other model inputs kept constant,we iteratively searched for the CPIexe ateach processor design point. We started byassuming that the Nehalem core has a CPIexe
of ‘. Then, the smallest core, an Atom pro-cessor, should have a CPIexe such that theratio of its CmpM performance to the Neha-lem core’s CmpM performance is the sameas the ratio of their SPECmark scores (q).We assumed that the CPIexe does not changewith technology node, while frequencyscales.
A key component of the detailed model isthe set of input parameters modeling thecores’ microarchitecture. For single-threadcores, we assumed that each core has a
64-Kbyte L1 cache, and chips with onlysingle-thread cores have an L2 cache that is30 percent of the chip area. MT cores havesmall L1 caches (32 Kbytes for every eightcores), support multiple hardware contexts(1,024 threads per eight cores), a thread reg-ister file, and no L2 cache. From Atom andTesla die photos, we estimated that eightsmall many-thread cores, their shared L1cache, and their thread register file can fitin the same area as one Atom processor.We assumed that off-chip bandwidth(BWmax) increases linearly as process technol-ogy scales down and while the memoryaccess time is constant.
We assumed that t increases from 10 per-cent up to 400 percent, depending on thecomposed core’s total area. The composedcore’s performance cannot exceed perfor-mance of a single Nehalem core at 45 nm.
We derived the area and power budgetsfrom the same quad-core Nehalem multicoreat 45 nm, excluding the L2 and L3 caches.They are 111 mm2 and 125 W, respectively.The reported dark silicon projections arefor the area budget that’s solely allocated tothe cores, not caches and other uncore com-ponents. The CmpM’s speedup baseline is aquad-Nehalem multicore.
Combining modelsOur three-tier modeling approach allows
us to exhaustively explore the design spaceof future multicores, project their upperbound performance, and estimate theamount of integration capacity underutiliza-tion, dark silicon.
Device � core modelTo study core scaling in future technology
nodes, we scaled the 45 nm Pareto frontiersdown to 8 nm by scaling each processordata point’s power and performance usingthe DevM model and then refitting the Par-eto optimal curves at each technology node.We assumed that performance, which wemeasured in SPECmark, would scale linearlywith frequency. By making this assumption,we ignored the effects of memory latency andbandwidth on the core performance. Thus,actual performance gains through scalingcould be lower. Based on the optimisticITRS model, scaling a microarchitecture
[3B2-9] mmi2012030122.3d 16/5/012 14:26 Page 130
....................................................................
130 IEEE MICRO
...............................................................................................................................................................................................
TOP PICKS
(core) from 45 nm to 8 nm will result in a3.9� performance improvement and an 88percent reduction in power consumption.Conservative scaling, however, suggests thatperformance will increase only by 34 percentand that power will decrease by 74 percent.
Device � core � multicore modelWe combined all three models to produce
final projections for optimal multicorespeedup, number of cores, and amount ofdark silicon. To determine the best multicoreconfiguration at each technology node, weswept the design points along the scaledarea/performance and power/performancePareto frontiers (DevM � CorM) becausethese points represent the most efficientdesigns. For each core design, we constructeda multicore consisting of one such core ateach technology node. For a symmetric mul-ticore, we iteratively added identical coresone by one until we hit the area or powerbudget or until performance improvementwas limited. We swept the frontier and con-structed a symmetric multicore for each pro-cessor design point. From this set ofsymmetric multicores, we picked the multi-core with the best speedup as the optimalsymmetric multicore for that technologynode. The procedure is similar for othertopologies. We performed this procedure sep-arately for CPU-like and GPU-like organiza-tions. The amount of dark silicon is thedifference between the area occupied bycores for the optimal multicore and the areabudget that is only allocated to the cores.
Scaling and future multicoresWe used the combined models to study
the future of multicore designs and theirperformance-limiting factors. The resultsfrom this study provide detailed analysis ofmulticore behavior for 12 real applicationsfrom the PARSEC suite.
Speedup projectionsFigure 3 summarizes all of the speedup
projections in a single scatter plot. Forevery benchmark at each technology node,we plot the speedup of eight possible multi-core configurations (CPU-like or GPU-like)� (symmetric, asymmetric, dynamic, orcomposed). The exponential performance
curve matches transistor count growth asprocess technology scales.
Finding: With optimal multicore configura-tions for each individual application, at8 nm, only 3.7� (conservative scaling) or7.9� (ITRS scaling) geometric meanspeedup is possible, as shown by the dashedline in Figure 3.
Finding: Highly parallel workloads with adegree of parallelism higher than 99 percentwill continue to benefit from multicorescaling.
Finding: At 8 nm, the geometric meanspeedup for dynamic and composed topolo-gies is only 10 percent higher than thegeometric mean speedup for symmetrictopologies.
Dark silicon projectionsTo understand whether parallelism or the
power budget is the primary source of thedark silicon speedup gap, we varied each ofthese factors in two experiments at 8 nm.First, we kept the power budget constant(our default budget is 125 W) and variedthe level of parallelism in the PARSEC appli-cations from 0.75 to 0.99, assuming thatprogrammer effort can realize this improve-ment. Performance improved slowly as theparallelism level increased, with most bench-marks reaching a speedup of about only 15�at 99 percent parallelism. Provided that thepower budget is the only limiting factor, typ-ical upper-bound ITRS-scaling speedups willstill be limited to 15�. With conservativescaling, this best-case speedup is limitedto 6.3�.
For the second experiment, we kept eachapplication’s parallelism at its real level andvaried the power budget from 50 W to500 W. Eight of 12 benchmarks showedno more than 10� speedup even with a prac-tically unlimited power budget. In otherwords, increasing core counts beyond a cer-tain point did not improve performance be-cause of the limited parallelism in theapplications and Amdahl’s law. Only fourbenchmarks have sufficient parallelism toeven hypothetically sustain speedup levelsthat matches the exponential transistorcount growth, Moore’s law.
[3B2-9] mmi2012030122.3d 16/5/012 14:26 Page 131
....................................................................
MAY/JUNE 2012 131
Finding: With ITRS projections, at 22 nm,21 percent of the chip will be dark, and at8 nm, more than 50 percent of the chipcannot be utilized.
Finding: The level of parallelism in PARSECapplications is the primary contributor tothe dark silicon speedup gap. However, inrealistic settings, the dark silicon resultingfrom power constraints limits the achievablespeedup.
Core count projectionsDifferent applications saturate perfor-
mance improvements at different corecounts. We considered the chip configura-tion that provided the best speedups for allapplications to be an ideal configuration.Figure 4 shows the number of cores (solidline) for the ideal CPU-like dynamic multicoreconfiguration across technology generations,because dynamic configurations performedbest. The dashed line illustrates the numberof cores required to achieve 90 percent of theideal configuration’s geometric mean speedupacross PARSEC benchmarks. As depicted,with ITRS scaling, the ideal configuration inte-grates 442 cores at 8 nm. However, 35 coresreach the 90 percent of the speedup
achievable by 442 cores. With conservativescaling, the 90 percent speedup core countis 20 at 8 nm.
Finding: Due to limited parallelism in thePARSEC benchmark suite, even withnovel heterogeneous topologies and opti-mistic ITRS scaling, integrating more than35 cores improves performance only slightlyfor CPU-like topologies.
Sensitivity studiesWe performed sensitivity studies on the
impact of various features, including L2cache sizes, memory bandwidth, simultane-ous multithreading (SMT) support, andthe percentage of total power allocatedto leakage. Quantitatively, these studiesshow that these features have limited impacton multicore performance.
LimitationsOur device and core models do not
explicitly consider dynamic voltage and fre-quency scaling (DVFS). Instead, we take anoptimistic approach to account for its best-case impact. When deriving the Pareto fron-tiers, we assume that each processor datapoint operates at its optimal voltage and
[3B2-9] mmi2012030122.3d 16/5/012 14:26 Page 132
45 32 22 16 11 8
Technology node (nm)(a)
0
8
16
24
32
Sp
eed
up
Exponential performance
Geometric mean
Design points
45 32 22 16 11 8
Technology node (nm)(b)
0
8
16
24
32
Sp
eed
up
Exponential performance
Geometric mean
Design points
Figure 3. Speedup across process technology nodes across all organizations and topologies with PARSEC benchmarks.
The exponential performance curve matches transistor count growth. Conservative scaling (a); ITRS scaling (b).
....................................................................
132 IEEE MICRO
...............................................................................................................................................................................................
TOP PICKS
frequency setting (VDDmin,Freqmax). At a fixed
VDD setting, scaling down the frequency fromFreqmax results in a power/performance pointinside the optimal Pareto curve, which is asuboptimal design point. However, scalingvoltage up and operating at a new(V 0DDmin
,Freq 0max) setting results in a differentpower-performance point that is still on theoptimal frontier. Because we investigate allof the points along the frontier to find theoptimal multicore configuration, our studycovers multicore designs that introduce heter-ogeneity to symmetric topologies throughDVFS. The multicore model considers thefirst-order impact of caching, parallelism,and threading under assumptions that resultonly in optimistic projections. Comparingthe CmpM model’s output against publishedempirical results confirms that our model al-ways overpredicts multicore performance.The model optimistically assumes that theworkload is homogeneous; that work is infi-nitely parallel during parallel sections ofcode; that memory accesses never stall dueto a previous access; and that no thread syn-chronization, operating system serialization,or swapping occurs.
T his work makes two key contributions:projecting multicore speedup limits
and quantifying the dark silicon effect, andproviding a novel and extendible model thatintegrates device scaling trends, core designtradeoffs, and multicore configurations.While abstracting away many details, themodel can find optimal configurations andproject performance for CPU- and GPU-style multicores while considering micro-architectural features and high-level applica-tion properties. We made our modelpublicly available at http://research.cs.wisc.edu/vertical/DarkSilicon. We believe thisstudy makes the case for innovation’surgency and its potential for high impactwhile providing a model that researchersand engineers can adopt as a tool to studylimits of their solutions. MICRO
AcknowledgmentsWe thank Shekhar Borkar for sharing
his personal views on how CMOS devicesare likely to scale. Support for this researchwas provided by the NSF under grantsCCF-0845751, CCF-0917238, and CNS-0917213.
[3B2-9] mmi2012030122.3d 16/5/012 14:26 Page 133
256
224
192
160
No. of core
s
406
128
96
64
32
045 32 22
Technology node (nm)(a) (b)
16 11 8
Ideal configuration90% configuration
256
224
192
160
No. of core
s
442
128
96
64
32
045 32 22
Technology node (nm)
16 11 8
Ideal configuration90% configuration
Figure 4. Number of cores for the ideal CPU-like dynamic multicore configurations and the number of cores delivering 90
percent of the speedup achievable by the ideal configurations across the PARSEC benchmarks. Conservative scaling (a);
ITRS scaling (b).
....................................................................
MAY/JUNE 2012 133
....................................................................References
1. G.E. Moore, ‘‘Cramming More Components
onto Integrated Circuits,‘‘ Electronics, vol. 38,
no. 8, 1965, pp. 56-59.
2. R.H. Dennard et al., ‘‘Design of Ion-
Implanted Mosfet’s with Very Small Physi-
cal Dimensions,‘‘ IEEE J. Solid-State Cir-
cuits, vol. 9, no. 5, 1974, pp. 256-268.
3. S. Borkar, ‘‘The Exascale Challenge,‘‘ Proc.
Int’l Symp. on VLSI Design, Automation
and Test (VLSI-DAT 10), IEEE CS, 2010,
pp. 2-3.
4. F. Pollack, ‘‘New Microarchitecture Chal-
lenges in the Coming Generations of
CMOS Process Technologies,‘‘ Proc. 32nd
Ann. ACM/IEEE Int’l Symp. Microarchitec-
ture (Micro 99), IEEE CS, 2009, p. 2.
5. Z. Guz et al., ‘‘Many-Core vs. Many-Thread
Machines: Stay Away From the Valley,‘‘
IEEE Computer Architecture Letters, vol. 8,
no. 1, 2009, pp. 25-28.
6. G.M. Amdahl, ‘‘Validity of the Single Pro-
cessor Approach to Achieving Large-scale
Computing Capabilities,‘‘ Proc. Joint Com-
puter Conf. American Federation of Infor-
mation Processing Societies (AFIPS 67),
ACM, 1967, doi:10.1145/1465482.1465560.
7. M. Bhadauria, V. Weaver, and S. McKee,
‘‘Understanding PARSEC Performance on
Contemporary CMPs,‘‘ Proc. IEEE Int’l
Symp. Workload Characterization (IISWC
09), IEEE CS, 2009, pp. 98-107.
8. C. Bienia et al., ‘‘The PARSEC Benchmark
Suite: Characterization and Architectural
Implications,‘‘ Proc. 17th Int’l Conf. Paral-
lel Architectures and Compilation Tech-
niques (PACT 08), ACM, 2008, pp. 72-81.
Hadi Esmaeilzadeh is a PhD student in theDepartment of Computer Science andEngineering at the University of Washing-ton. His research interests include power-efficient architectures, approximate general-purpose computing, mixed-signal architec-tures, machine learning, and compilers.Esmaeilzadeh has an MS in computer sciencefrom the University of Texas at Austin andan MS in electrical and computer engineer-ing from the University of Tehran.
Emily Blem is a PhD student in theDepartment of Computer Sciences at the
University of Wisconsin�Madison. Herresearch interests include energy and perfor-mance tradeoffs in computer architectureand quantifying them using analytic perfor-mance modeling. Blem has an MS incomputer science from the University ofWisconsin�Madison.
Renee St. Amant is a PhD student in theDepartment of Computer Science at theUniversity of Texas at Austin. Her researchinterests include computer architecture, low-power microarchitectures, mixed-signal ap-proximate computation, new computingtechnologies, and storage design for approx-imate computing. St. Amant has an MS incomputer science from the University ofTexas at Austin.
Karthikeyan Sankaralingam is an assistantprofessor in the Department of ComputerSciences at the University of Wisconsin�Madison, where he also leads the VerticalResearch Group. His research interestsinclude microarchitecture, architecture, andvery-large-scale integration (VLSI). Sankar-alingam has a PhD in computer science fromthe University of Texas at Austin.
Doug Burger is the director of client andcloud applications at Microsoft Research,where he manages multiple strategic re-search projects covering new user interfaces,datacenter specialization, cloud architec-tures, and platforms that support persona-lized online services. Burger has a PhDin computer science from the Universityof Wisconsin. He is a fellow of IEEE andthe ACM.
Direct questions and comments about thisarticle to Hadi Esmaeilzadeh, University ofWashington, Computer Science & Engi-neering, Box 352350, AC 101, 185 StevensWay, Seattle, WA 98195; [email protected].
[3B2-9] mmi2012030122.3d 16/5/012 14:26 Page 134
....................................................................
134 IEEE MICRO
...............................................................................................................................................................................................
TOP PICKS