Leveraging Simultaneous Multithreading for Adaptive ...dbrooks/tacs05/donald_slides.pdf · Process...
Transcript of Leveraging Simultaneous Multithreading for Adaptive ...dbrooks/tacs05/donald_slides.pdf · Process...
Leveraging Simultaneous Multithreading for Adaptive Thermal Control
James Donald & Margaret MartonosiPrinceton University
Outline
• Background & Problem Description• Related Work• Experimental Setup & Benchmarks• Adaptive Fetch Algorithm• Adaptive Register Renaming• Future Work & Conclusions
Background
• Temperature-Aware Design– Increasingly relevant w/ greater transistor densities– Tools e.g. HotSpot
• Applied to multithreaded architectures (SMT, CMP)– Applicable to prevalent emerging architectures– Provides opportunity for smarter thermal
management
Localized Thermal Hotspots• A single hot spot
limits performance.• Global DTM (e.g.
DVFS) handles this at expense of entire core’s performance.
Hottest Spots for fma3d-twolf with Basic Fetch Throttling
84.9084.9184.9284.9384.9484.9584.9684.9784.9884.9985.0085.01
0 20 40 60 80 100 120 140 160Time (ms)
Tem
pera
ture
(°C
)
FXU registers
FPU registers
Proposed Solution
• Selectively fetch instructions to mitigate single hotspot’s severity.
• Our experiments: two hotspots monitored.
Hottest Spots for fma3d-twolf with Adaptive Fetch
84.9084.9184.9284.9384.9484.9584.9684.9784.9884.9985.0085.01
0 20 40 60 80 100 120 140Time (ms)
Tem
pera
ture
(°C
)
FXU registers
FPU registers
fetch policythermal sensors
profiling data
thread#1thread#2
IFU decision
Outline
• Background & Problem Description• Related Work• Experimental Setup & Benchmarks• Adaptive Fetch Algorithm• Adaptive Register Renaming• Future Work & Conclusions
Related Work
• M. Powell, M. Gomaa and T. N. Vijaykumar. “Heat and Run: Leveraging SMT and CMP to Manage Power Density Through the Operating System”. ASPLOS 2004.
• Y. Li, Z. Hu, D. Brooks and K. Skadron. “Performance, Energy, and Thermal Considerations for SMT and CMP Architectures”. HPCA 2005.
Outline
• Background & Problem Description• Related Work• Experimental Setup & Benchmarks• Adaptive Fetch Algorithm• Adaptive Register Renaming• Future Work & Conclusions
Simulation Setup
• Turandot PowerPC simulator w/ SMT• PowerTimer + leakage power estimation• HotSpot 2.0
Machine ParametersProcessor Configuration
Process Technology 0.18 µ
Supply Voltage 1.2 V
Clock Rate 1.4 GHz
Organization single-core, 2-context SMT
Functional Units 2 FXU, 2 FPU, 2 LSU, 1 BRU
PhysicalRegisters
120 GPR, 90 FPR
Thermal Control fetch throttling
Processor Floorplan
SPEC Benchmark Heat Behaviorsbenchmark type function FXUreg FPUregammp FP computational chemistryapplu FP CFD/physicsfma3d FP mechanical responsegalgel FP CFD
gcc int compilergzip int compressionmcf int transportation schedmesa FP 3-D graphicsparser int word processingtwolf int lithography routing
Test WorkloadsSMT workload Mix typeammp-gzip FP-integerammp-mcf FP-integerapplu-parser FP-integerapplu-twolf FP-integerfma3d-galgel FP-FPfma3d-twolf FP-integergalgel-mesa FP-integergcc-mesa integer-FPgcc-parser integer-integergzip-mcf integer-integer
• Tests a wide range of integer/floating-point and hot/hotter mixes.
Outline
• Background & Problem Description• Related Work• Experimental Setup & Benchmarks• Adaptive Fetch Algorithm• Adaptive Register Renaming• Future Work & Conclusions
Adaptive Fetch Algorithm (a)• Decide which hotspot to mitigate:
Total Int regfile accesses
Total FP regfile accesses
Total Instructions fetched
/
/
Thermal threshold °
Int regfile temp °
FP regfile temp °
-
-
/
/
comparison
Critical regfile
gapetemperaturrateaccessregfileoverheattorate =
Adaptive Fetch Algorithm (b)• Favor best thread for the identified hotspot.
– Compare which thread has lower:
Thread1 Int(or FP) regfile accesses
Thread2 Int(or FP) regfile accesses
Thread2 Instructions fetched
/
/
Thread1 Instructions fetched
comparison
Priority thread
Moderate policy: 50% of cycles default to round-robinAggressive policy: 12.5%
instraccesses
Metrics
• Performance – IPC and weighted speedup
• Power-performance – ED2 product
• Thread fairness (percentages of retired instructions)
Performance Results57
.8%
/ 42
.2%
66.7
% /
33.3
%
63.5
% /
36.5
%
60.6
% /
39.4
%
43.4
% /
56.6
%
60.4
% /
39.6
%
60.8
% /
39.2
%
53.0
% /
47.0
%
62.0
% /
38.0
%
57.1
% /
42.9
%58.9
% /
41.1
%
68.1
% /
31.9
%
61.8
% /
38.2
%
61.0
% /
39.0
%
41.4
% /
58.6
%
62.6
% /
37.4
%
60.7
% /
39.3
%
50.5
% /
49.5
%
60.4
% /
39.6
%
59.0
% /
41.0
%
70.3
% /
29.7
%
78.4
% /
21.6
%
47.5
% /
52.5
%
51.1
% /
48.9
%
40.8
% /
59.2
%
62.1
% /
37.9
%
35.6
% /
64.4
%
22.5
% /
77.5
% 32.8
% /
67.2
%
72.3
% /
27.7
%
00.10.20.30.40.50.60.70.80.9
1
ammp-gzip
ammp-mcf
applu-parser
applu-twolf
fma3d-galgel
fma3d-twolf
galgel-mesa
gcc-mesa
gcc-parser
gzip-mcf
workload mix
Wei
ghte
d sp
eedu
p
baseline (fetch throttle)moderate adaptive fetchaggressive adaptive fetch
Significant performance increases at the cost of thread fairness.
Results – ED2 Product57
.8%
/ 42
.2%
66.7
% /
33.3
%
63.5
% /
36.5
%
60.6
% /
39.4
%
43.4
% /
56.6
%
60.4
% /
39.6
%
60.8
% /
39.2
%
53.0
% /
47.0
%
62.0
% /
38.0
%
57.1
% /
42.9
%
58.9
% /
41.1
%
68.1
% /
31.9
%
61.8
% /
38.2
%
61.0
% /
39.0
%
41.4
% /
58.6
%
62.6
% /
37.4
%
60.7
% /
39.3
%
50.5
% /
49.5
%
60.4
% /
39.6
%
59.0
% /
41.0
%
70.3
% /
29.7
%
78.4
% /
21.6
%
47.5
% /
52.5
%
51.1
% /
48.9
%
40.8
% /
59.2
%
62.1
% /
37.9
%
35.6
% /
64.4
%
22.5
% /
77.5
%
32.8
% /
67.2
%
72.3
% /
27.7
%
00.10.20.30.40.50.60.70.80.9
11.11.21.31.41.5
ammp-gzip
ammp-mcf
applu-parser
applu-twolf
fma3d-galgel
fma3d-twolf
galgel-mesa
gcc-mesa
gcc-parser
gzip-mcf
workload mix
ED2 p
rodu
ct (n
orm
aliz
ed)
baseline (fetch throttle)moderate adaptive fetchaggressive adaptive fetch
Quadraticallycorrelated ED2
reduction.
Outline
• Background & Problem Description• Related Work• Experimental Setup & Benchmarks• Adaptive Fetch Algorithm• Adaptive Register Renaming• Future Work & Conclusions
Adaptive Register Renaming
• Same counter-based thread priority decision algorithm.• Difference: Decision affects thread priority for register
mapping instead of instruction fetch.• Disadvantage: Response at later pipeline stage already
allows instructions in to waste resources.• Advantage: Closer to point of interest (registers),
allows more isolated, simpler implementation.
Speedup – Adaptive Renaming57
.7%
/ 42
.3%
66.6
% /
33.4
%
63.2
% /
36.8
%
60.3
% /
39.7
%
43.1
% /
56.9
%
60.3
% /
39.7
%
60.8
% /
39.2
%
53.0
% /
47.0
%
61.8
% /
38.2
%
57.1
% /
42.9
%
58.4
% /
41.6
%
67.0
% /
33.0
%
62.7
% /
37.3
%
61.3
% /
38.7
%
40.5
% /
59.5
%
61.3
% /
38.7
%
61.8
% /
38.2
%
47.7
% /
52.3
%
56.8
% /
43.2
%
57.5
% /
42.5
%
62.5
% /
37.5
%
70.9
% /
29.1
%
46.1
% /
53.9
%
61.7
% /
38.3
%
31.9
% /
68.1
%
64.9
% /
35.1
%
45.9
% /
54.1
%
23.4
% /
76.6
%
32.7
% /
67.3
%
62.1
% /
37.9
%
00.10.20.30.40.50.60.70.80.9
1
ammp-gzip
ammp-mcf
applu-parser
applu-twolf
fma3d-galgel
fma3d-twolf
galgel-mesa
gcc-mesa
gcc-parser
gzip-mcf
workload mix
Wei
ghte
d sp
eedu
p
baseline (rename throttle)moderate adaptive renamingaggressive adaptive renaming
ED2 – Adaptive Renaming57
.8%
/ 42
.2%
66.7
% /
33.3
%
63.5
% /
36.5
%
60.6
% /
39.4
%
43.4
% /
56.6
%
60.4
% /
39.6
%
60.8
% /
39.2
%
53.0
% /
47.0
%
62.0
% /
38.0
%
57.1
% /
42.9
%
58.9
% /
41.1
%
68.1
% /
31.9
%
61.8
% /
38.2
%
61.0
% /
39.0
%
41.4
% /
58.6
%
62.6
% /
37.4
%
60.7
% /
39.3
%
50.5
% /
49.5
%
60.4
% /
39.6
%
59.0
% /
41.0
%
70.3
% /
29.7
%
78.4
% /
21.6
%
47.5
% /
52.5
%
51.1
% /
48.9
%
40.8
% /
59.2
%
62.1
% /
37.9
%
35.6
% /
64.4
%
22.5
% /
77.5
%
32.8
% /
67.2
%
72.3
% /
27.7
%
00.10.20.30.40.50.60.70.80.9
11.1
ammp-gzip
ammp-mcf
applu-parser
applu-twolf
fma3d-galgel
fma3d-twolf
galgel-mesa
gcc-mesa
gcc-parser
gzip-mcf
workload mix
ED2 p
rodu
ct (n
orm
aliz
ed)
baseline (rename throttle)moderate adaptive renamingaggressive adaptive renaming
Outline
• Background & Problem Description• Related Work• Experimental Setup & Benchmarks• Adaptive Fetch Algorithm• Adaptive Register Renaming• Future Work & Conclusions
Future Work
• Implement in context of multicore (SMT combined CMP) processors.
• Comparison to real-world fetch policies (e.g. ICOUNT).
• Tradeoffs when in conjunction with global DTM strategies.
Conclusions
• Adaptive fetch: 30% performance improvement (44% ED2 reduction)
• Adaptive rename: 23% performance improvement (35% ED2 reduction)
• Benefits even in integer-only or FP-only mixes.
• Tradeoff between power-performance benefits and thread fairness.
Acknowledgements
• Special thanks– IBM– Yingmin Li– The anonymous reviewers