HaDeS: Architectural Synthesis for Heterogeneous Dark...
Transcript of HaDeS: Architectural Synthesis for Heterogeneous Dark...
HaDeS: Architectural Synthesis for Heterogeneous Dark Silicon Chip
Multi-processorsYatish Turakhia, Bharathwaj Raghunathan, Siddharth Garg,
Diana Marculescu
1
Dark Silicon Challenge
0
0.2
0.4
0.6
0.8
1
1.2
0
5
10
15
20
25
30
35
45 32 22 16 11 8
# Tr
ansi
stor
s
Pow
er/D
ark
Silic
on
Technology Node
Power
Dark Silicon
# Transistors80% dark silicon at the 8 nm technology node[Esmaeilzadeh et al., ISCA’11]
2
The Dark Silicon Era: Demise of Multi-core Scaling?} Rapid rise in the fraction of dark silicon due to power
constraints.
} What's the point of increasing core count if they aren’t active??
} Need to think out of the !!!
Dark
3
} How to best utilize dark silicon for performance enhancement?} Heterogeneity
Dark Silicon Architectures
Accelerators
Heterogeneous Cores
How do we decide?ØWhat types of cores and how many cores of each type should one provision on a CMP of given area so as to maximize the overall performance benefit?ØWhich core/s to turn on without exceeding power budget / how to map threads?ØHow much parallelism to exploit?
4
Library of Cores
Area Budget Peak PowerConstraint
Multi-threaded Applications
seq
parpar par
seq
parpar par
Architectural Synthesis+
DOP Optimization and Thread Mapping
HaDeS
Synthesized Heterogeneous
Dark Silicon CMPLast Level Cache
HaDeS: Overview
5
Application Performance Model
seq
LastLevelCache
SynthesizedHeterogeneousDarkSiliconCMP
par par par
Application type i
6
Application Performance Model
seq
LastLevelCache
par
par par Dark
7
Application Performance Model
seq
LastLevelCache
par
par par
iE
Total execution time of application i
8
Application Performance Model
seq
LastLevelCache
par
par par
smisi
E t=
Sequential time of application ion core type ms
Core type ms
9
Application Performance Model
seq
LastLevelCache
par
par par
[1, ]max {s
i
si im v DOPE t
Î= +
Slowest parallel thread limits performance
10
Application Performance Model
seq
LastLevelCache
par
par par
[1, ]max {s
i
si vim DOPE t
Î= +
Ranges over all parallel threads
11
Application Performance Model
seq
LastLevelCache
par
par par
[1, ]max {
is
si v Pi Om DE t
Î= +
Degree of parallelism of application i
12
Application Performance Model
seq
LastLevelCache
par
par par
[ ]
( )
1,max {s
p
i
si im v DO
pim v
Pi
Et
DOPt
Î= + +
Total execution time of a parallel thread v of application i depends on the core mapping mp(v) and is inversely proportional to# parallel threads.
mp(1)
mp(2) mp(3)
13
Application Performance Model
seq
LastLevelCache
par
par par
)
, ] ( )(
[1m .ax { }
p
s pi
pim vs
i im v DOPi
iim v
tE t
DOPK DOP
Î= + +
Penalty term for increased resource contention with increasing # parallel threads
14
Application Performance Model
seq
LastLevelCache
par
par par
)
, ] ( )[1
(max { . }p
s pi
pi
i iv DOPi
m vsim im v
ttE DOP
DOPK
Î= + +
Modelparameters
15
Application Performance Model: Validation
seq
LastLevelCache
par
par par
( )( )[1, ]
max { . }p
s pi
pim vs
i iim im vv DOPi
tE t K DOP
DOPÎ= + +
Simple,butsurprisinglyaccurate!!!
16
HaDeS Framework
min .( l l )N
s pi i i
iræ ö
+ç ÷è øå
Goal: Minimize weighted sum of execution time for each application.
User defined weights for workload i
16
HaDeS Framework
min .( l l )N
s pi i i
iræ ö
+ç ÷è øå
Goal: Minimize weighted sum of execution time for each application.
Sequential execution time of workload i
18
HaDeS Framework
min .( l l )N
s pi i i
iræ ö
+ç ÷è øå
Goal: Minimize weighted sum of execution time for each application.
Parallel execution time of workload i
19
HaDeS Framework: ILP-OPT
min .( l l )N
s pi i i
iræ ö
+ç ÷è øå
. [1,N]pijp
i ij ij ii
tl b K DOP i
DOPæ ö
³ + " Îç ÷ç ÷è ø
OBJECTIVE
0-1 Indicator variable that is 1 if at least one parallel thread of application iexecutes on a core of type j
Constraint: Parallel Execution Time
20
HaDeS Framework: ILP-OPT
min .( l l )N
s pi i i
iræ ö
+ç ÷è øå
. [1,N]pijp
i ij ij ii
tl b K DOP i
DOPæ ö
³ + " Îç ÷ç ÷è ø
OBJECTIVE
Variable DOP introduces non-linearity
Constraint: Parallel Execution Time
21
HaDeS Framework: ILP-OPT
min .( l l )N
s pi i i
iræ ö
+ç ÷è øå OBJECTIVE
Constraint: Parallel Execution Time
max max
1 1
max
max
max
[1,N], [1,N], [1,DOP ]
[1,N], [1,N], [1,DOP ]
1 [1,N], [1,N], [1,DOP ]
p pDOP DOPij ijp
i ij iw ij ijw ijww w
ijw ij
ijw iw
ijw ij iw
t tl b y K w m K
w w
m b i j wm y i j wm b y i j w
= =
æ ö æ ö³ + = +ç ÷ ç ÷ç ÷ ç ÷
è ø è ø£ " Î Î Î
£ " Î Î Î
³ + - " Î Î Î
å å
Many variables O(MxNxDOPmax) to solve for in an ILP!Only O(MxN) variables in fixed DOP case
22
HaDeS Framework: ITER-OPT} Separates the architectural
optimization from DOP optimization.
} Start with initial guess of design vector, compute optimal DOP for this heterogeneous design, re-synthesize heterogeneous design for this DOP and keep iterating till no further improvement in performance.
} Represents local minima. Only 2.5% loss of optimality in experiments.
} Orders of magnitude faster (430X speedup)than ILP-OPT.
DOP Optimizer
D1 DN
App 1 App N
Architecture Optimizer
Optimized Heterogeneous Architecture
Opt
imal
DO
PFo
r E
ach
App
licat
ion
(D*)
Optim
al Heterogeneous
Architecture (Q
*)
Initial Guess For Q*
23
Experimental Results} Used Snipersim, interval core
model based simulator for performance estimates and McPAT for power/area numbers.
} Library of 108 cores with 5 multi-threaded application benchmarks (from Parsec and Splash2 suites). Only 15 cores pareto optimal in area, power or performance.
} Performance model fairly accurate. Only 5.2% average error over 180 experiments with homogeneous and heterogeneous dark silicon CMPs.
24
Experimental Results} Between 19% to 60% performance improvements in
heterogeneous dark silicon CMPs over variable and fixed (16 threads) DOP homogeneous design respectively for ~50% dark silicon.
25
Experimental Results} Benefits of heterogeneous design begin to saturate
beyond certain portion of dark silicon.
26
Thank You!
Please visit the poster for questions and discussions!
27