Configurational Workload Characterization Hashem H. Najaf-abadi Eric Rotenberg.
-
Upload
candace-gibson -
Category
Documents
-
view
215 -
download
1
Transcript of Configurational Workload Characterization Hashem H. Najaf-abadi Eric Rotenberg.
Configurational Workload Characterization
Hashem H. Najaf-abadi
Eric Rotenberg
Program 2 Program 1
Heterogeneity
Processor
A Single-Core:
Program 2
Processor
Program 1
Heterogeneity
Processor
Processor
Multiple Cores:
Program 2
Processor
Program 1
Heterogeneity
Multiple Cores:
Processor
Processor
Program 1Program 2
Heterogeneity
Processor
Processor
Heterogeneous Cores:
Heterogeneous CMP Design
Must determine:
1) Best processor configuration for a group of
workloads.
2) Best way to group workloads together.
The Challenge:
A
B
C
D
Core 1
Core 2
Workload Space Best core configurations
Core 1
Core 2
Communal Customization
EF
GH
I
JK
L
M
N
Existing Approaches
• Regression models: Enable speedy exploration.
• Subsetting: Reduce workloads to a representative subset based on characteristics.
The Argument
• Subsetting isn’t a valid substitute or facilitator for communal customization.
• Reason: complex interdependencies between different architectural units.
Ties that bind
1) The global clock intertwines the sizing of different architectural units.
2) The burden of compromise in one unit can be passed on to another.
Example: The Global Clock
solid line: delay of the issue queue,dashed line: access delay of the cache
1ns
CacheIss
ue
Qu
eu
e
0.66ns
CacheIss
ue
Qu
eu
e
0.66ns
Cache
Iss
ue
Qu
eu
e
1ns
Cache
Iss
ue
Qu
eu
ePipeline:
Less slack Slack
Pipeline too deep
Small Issue-queue
Needlessly large cache
Example: The Global Clock
The clock period, issue-queue size and cache size can not be optimized independent of each other.
1ns
Cache
Issu
e Q
ueu
e
0.66ns
Cache
Issu
e Q
ueu
e
0.66ns
Cache
Issu
e Q
1ns
Cache
Issu
e Q
ueu
e
Ties that bind
1) The global clock intertwines the sizing of different architectural units.
2) The burden of compromise in one unit can be passed on to another.
Example: Passing on the Burden
02468
10A
B
CD
E
024
68
10A
B
CD
E
024
68
10A
B
CD
E
A) Working-set size, B) Branch predictabilityC) Density of dependence chains D) Frequency of loadsE) Frequency of conditional branches* All normalized to a scale of 0~10
βα γ
Example: Passing on the Burden
02468
10A
B
CD
E
024
68
10A
B
CD
E
024
68
10A
B
CD
E
A) Working-set sizeB) Branch predictabilityC) Density of dependence chains D) Frequency of loadsE) Frequency of conditional branches* all normalized to a scale of 0~10
βα γ
L HSpeed:
Core
Cache
Core
Cache
L HL H
Cache
L H L H
Customized Architectures:
Example: Passing on the Burden
02468
10A
B
CD
E
024
68
10A
B
CD
E
024
68
10A
B
CD
E
A) Working-set size, B) Branch predictabilityC) Density of dependence chains D) Frequency of loadsE) Frequency of conditional branches* all normalized to a scale of 0~10
βα γ
Speed:
Core
CacheCache
Core
L HL H L H
Customized Architectures:
A More Accurate Solution
• Represent workloads by their customized architectural configurations.
• Allows for direct and accurate evaluation how well different workloads do on customized configurations.
• We call this Configurational Workload Characterization
Design Process Overview
Important workloads
Rep. workloads
Optimal core combination
Select representative workloads based on workload behavior
Search for opt. core combination
Important workloads
Customized architectures
Optimal core combination
Customize a core for each workload (configurational characterization)
Search for opt. core combination
How not to do it How to do it
Pros & Cons
- more costly to determine
+ provides a more optimal design solution
+ provides a systematic approach
+ can be performed prior to the design phase that is critical for time-to-market
XP-SCALAR
• A superscalar design-space exploration frame work
• www4.ncsu.edu/~hhashem/xpscalar.htm
• Uses Simplescalar to perform cycle-accurate simulations
• Uses CACTI model to approximate the access latency of the different units
XP-SCALAR
What parameters are varied: Clock period,
Processor width,
Size of the issue queue,
Size of the register-file,
Size of the load-store queue,
Size of the L1 and L2 caches
XP-SCALAR
How they are varied:a) Clock period is varied, and architecture
parameters are adjusted to make latencies fit within pipeline stages.
b) Number of pipeline stages of a unit is varied and its configuration
appropriately adjusted.
Determining the Best cores
• Execute all benchmarks on each-other’s customized configurations.
• From that, determine best grouping through a complete search.
Best Core Results
customized core(s) avg. IPT har. IPT
best config for avg. & har. IPT gcc 2.06 1.57
2 best configs for avg. IPT parser, twolf 2.27 1.76
2 best configs for har. IPT gcc, mcf 2.12 1.88
3 best configs for avg. IPT crafty, parser, twolf 2.35 1.82
3 best configs for har. IPT crafty, mcf, twolf 2.27 2.05
4 best configs for avg. & har. IPT crafty, mcf, parser, twolf 2.32 2.08
each benchmark on its own customized architecture
- 2.38 2.12
The effect of subsetting
• Subsetting of a single pair of benchmarks results in the extraction of a totally different set of best cores.
Representation
• Dendogram are
Conclusions
• There are interdependencies between architectural units in how they are customized.
• In the design of a heterogeneous CMP subsetting can lead to performance degradation.