CPU POWER ESTIMATION USING PMCs, AND ITS APPLICATION …
Transcript of CPU POWER ESTIMATION USING PMCs, AND ITS APPLICATION …
1
CPU POWER ESTIMATIONUSING PMCs, AND ITSAPPLICATION IN gem5
Dr Geoff MerrettArm Research Summit, 11 September 2017
2
OVERVIEW
Introduction and Background
Power Estimation on Hardware
• Our Accurate and Robust Approach
• Open Source Tools
Power Estimation in gem5
• PMCs vs gem5 Statistics
• Power Estimation
Conclusions
3
WHY POWER ESTIMATION?
Run-Time Management Approaches
• Make energy-savings by controlling operation.
– DVFS (dynamic-voltage frequency scaling) and DPM
– Task scheduling and mapping
• Make decisions based on real-time power ‘measurements’
System Research
• Design-space exploration
• Evaluating new power management strategies
• Research power-optimized software (microcode to applications)
• SOC architecture & design balancing for power and performance
4
POWER MODELLING APPROACHES
“Bottom-Up” Power Models
• Take a design specification (e.g. pipeline stages, ROB size etc.)
• Simulate gates and toggle rates
• Uses statistics from an architectural simulator (e.g. gem5)
• Advantages: flexibility to specify any design; cache size, etc.
• Disadvantages: large errors, slow, limited validation
“Top-Down” Power Models
• Characterise a specific device
• Estimate relationship between measured power and stats, e.g. PMCs
• Advantages: accurate and lightweight
• Disadvantages: specific to the device they were built on
5
POWMON METHODOLOGY1. Runworkloads
@differentDVFSlevels
39workloadsused:MiBench,LMBench,RoyLongbottom,ParMiBench andALPBench
ODROID-XU3Exynos-54224xCortex-A74xCortex-A5
2.Record• PMCs• Power,Voltage,
Temperature,etc.
3.ChoosePMCsHierarchicalclusteranalysis,Correlationmatrixanalysis,Exhaustivesearch,etc.
5.Validate• K-foldcrossvalidation• R2 :>0.99• Error:2.8– 3.7%
6.Uses• OSRun-time
management• Referenceforresearch• gem5add-on
4.BuildModel• OLSmultipleregression• Considerscollinearityand
heteroscedasticity• “sensible”equation
M. J. Walker et al., "Accurate and Stable Run-Time Power Modeling for Mobile and Embedded CPUs," in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 36, no. 1, pp. 106-119, Jan. 2017.
6
THE POWMON APPROACH
A power model’s stability is more important than its average error
Unstable model
• Appears accurate, but performs poorly with diverse workloads
Stable model
• Remains accurate across a diverse range of workloads and scenarios
• Requires careful choice of inputs (PMCs) & observations (workloads)
Eg: choose 3 sensors and appropriate training data to estimate colour:
Training Dataset A: Training Dataset B:
Input Colour Channel A: Input Colour Channel B:
Unstable StableM. J. Walker et al., "Accurate and Stable Run-Time Power Modeling for Mobile and Embedded CPUs," in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 36, no. 1, pp. 106-119, Jan. 2017.
7
PERFORMANCE MONITORING COUNTERS
CPU Registers that count architectural and microarchitectural events
– E.g. L2 cache miss, TLB access, integer instruction, etc.
Positives
• Available on several platforms (e.g. ARM, Intel, AMD); low overhead
• Many different available events (>70)…
Negatives
• …but a small number (e.g. 4-6) can be monitored simultaneously
PMCs are often selected using intuition – e.g. try to split PMCs into different sub-architectural units. However can be problematic as:
• They may not gather enough information
• Different PMCs are correlated (can make a model unstable)
8
HIERARCHICAL CLUSTER ANALYSIS
• HCA groups similar events together
• Output is a dendrogram
• This allows PMCs to be groupedinto clusters
M. J. Walker et al., "Accurate and Stable Run-Time Power Modeling for Mobile and Embedded CPUs," in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 36, no. 1, pp. 106-119, Jan. 2017.
9
HIERARCHICAL CLUSTER ANALYSIS
• We combine clusters with correlation of each event with CPU power
• Aim: Choose PMCs with a high correlation with power, avoiding ones from the same cluster
02/12/2015, 13:52Graph Test:Run-Time Power Modelling
Page 1 of 1http://127.0.0.1/~Matthew/Micro_Server/results-plots/results-pmc-drag-drop.html
PMC Exploration Results
Cluster and CorrelationSelect Clustering: None Clustering A Clustering B
Cluster and CorrelationL1
D_T
LB_R
EFIL
L:0x
05L1
D_T
LB_R
EFIL
L_LD
:0x4
CM
EM_A
CCES
S:0x
13L1
D_C
ACHE
_ACC
ESS:
0x04
CYCL
E_CO
UNT:
0x11
LDST
_SPE
C:0x
72LD
_SPE
C:0x
70M
EM_A
CCES
S_LD
:0x6
6BU
S_CY
CLES
:0x1
DL1
D_C
ACHE
_LD
:0x4
0BR
_RET
RUN
_SPE
C:0x
79BR
_IN
DIR
ECT_
SPEC
:0x7
AIN
ST_S
PEC:
0x1B
L1I_
CACH
E_AC
CESS
:0x1
4BR
_IM
MED
_SPE
C:0x
78D
P_SP
EC:0
x73
BR_P
RED
:0x1
2BR
_MIS
_PRE
D:0
x10
PC_W
RITE
_SPE
C:0x
76IN
STR_
RETI
RED
:0x0
8VP
F_SP
EC:0
x75
STRE
X_PA
SS_S
PEC:
0x6D
DM
B_SP
EC:0
x7E
LDRE
X_SP
EC:0
x6C
L2D
_CAC
HE_W
B:0x
18L1
D_C
ACHE
_WB:
0x15
L1D
_CAC
HE_R
EFIL
L_LD
:0x4
2
L1D
_CAC
HE_R
EFIL
L_ST
:0x4
3
L1D
_CAC
HE_W
B_VI
CTIM
:0x4
6
L2D
_CAC
HE_W
B_VI
CTIM
:0x5
6BU
S_AC
CESS
_LD
:0x6
0BU
S_AC
CESS
:0x1
9L2
D_C
ACHE
_REF
ILL:
0x17
BUS_
ACCE
SS_S
HARE
D:0
x62
BUS_
ACCE
SS_S
T:0x
61
L2D
_CAC
HE_R
EFIL
L_LD
:0x5
2
L2D
_CAC
HE_R
EFIL
L_ST
:0x5
3L2
D_C
ACHE
_ACC
ESS:
0x16
BUS_
ACCE
SS_N
ORM
AL:0
x64
L2D
_CAC
HE_S
T:0x
51L2
D_C
ACHE
_LD
:0x5
0L1
D_C
ACHE
_REF
ILL:
0x03
L1D
_TLB
_REF
ILL_
ST:0
x4D
UNAL
IGN
ED_L
D_S
PEC:
0x68
UNAL
IGN
ED_S
T_SP
EC:0
x69
UNAL
IGN
ED_L
DST
_SPE
C:0x
6AM
EM_A
CCES
S_ST
:0x6
7ST
_SPE
C:0x
71L1
D_C
ACHE
_ST:
0x41
L2D
_CAC
HE_I
NVA
L:0x
58ST
REX_
FAIL
_SPE
C:0x
6E
L2D
_CAC
HE_W
B_CL
EAN
:0x5
7L1
I_CA
CHE_
REFI
LL:0
x01
L1I_
TLB_
REFI
LL:0
x02
DSB
_SPE
C:0x
7DEX
C_RE
TURN
:0x0
A
BUS_
ACCE
SS_N
OT_
SHAR
ED:0
x63
BUS_
ACCE
SS_P
ERIP
H:0x
65EX
C_TA
KEN
:0x0
9
L1D
_CAC
HE_W
B_CL
EAN
:0x4
7L1
D_C
ACHE
_IN
VAL:
0x48
CID
_WT_
RETI
RED
:0x0
B
TTBR
_WRI
TE_R
ETIR
ED:0
x1C
ISB_
SPEC
:0x7
CAS
E_SP
EC:0
x74
-0.1
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Corr
elat
ion
with
Pow
er
Events
M. J. Walker et al., "Accurate and Stable Run-Time Power Modeling for Mobile and Embedded CPUs," in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 36, no. 1, pp. 106-119, Jan. 2017.
10
STABLE vs UNSTABLE MODELS
1) Training and validating the model with a ‘typical’ set of workloads
Both unstable and stable model seem good (<2.5%)
Training: Small set of 20 typical workloads (S.T), e.g. MiBench
Testing: Small set of 20 typical workloads (S.T), e.g. MiBench
M. J. Walker et al., "Accurate and Stable Run-Time Power Modeling for Mobile and Embedded CPUs," in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 36, no. 1, pp. 106-119, Jan. 2017.
11
STABLE vs UNSTABLE MODELS
2) Validating the same model with a ‘full’ set of workloads
Both models perform poorly, errors > 7%; not enough information from training workloads.
Training with a small set of workloads results an optimistic reported error
Training: Small set of 20 typical workloads (S.T), e.g. MiBench
Testing: Full set of 60 diverse workloads (F)
M. J. Walker et al., "Accurate and Stable Run-Time Power Modeling for Mobile and Embedded CPUs," in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 36, no. 1, pp. 106-119, Jan. 2017.
12
STABLE vs UNSTABLE MODELS
3) Training and validating the model with a ‘random’ set of workloads
Stable model copes better with workload diversity
Training: Small set of 20 random workloads (S.R)
Testing: Small set of 20random workloads (S.R)
M. J. Walker et al., "Accurate and Stable Run-Time Power Modeling for Mobile and Embedded CPUs," in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 36, no. 1, pp. 106-119, Jan. 2017.
13
STABLE vs UNSTABLE MODELS
4) Validating the same model with a ‘full’ set of workloads
Accuracy of stable model close to full training set (E); unstable model poor
Diverse (random) training allows a stable model to gain prediction power
Stable models perform well even with limited training workloads (saves time)
Training: Small set of 20 random workloads (S.R)
Testing: Full set of 60 diverse workloads (F)
M. J. Walker et al., "Accurate and Stable Run-Time Power Modeling for Mobile and Embedded CPUs," in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 36, no. 1, pp. 106-119, Jan. 2017.
14
STABLE vs UNSTABLE MODELS
Our stable approach achieves a low average error and narrow error distribution compared to existing techniques. Models trained with 20 workloads, validated with 60.
09/03/2016, 01:50Graph Test:Run-Time Power Modelling
Page 1 of 1file:///Users/user/Dropbox/ARM/Power_Modelling_Website/paper/comparison-graph.html
Filename: comparison-graph-data.csv
Comparison Graph
a b c d e P0
2
4
6
8
10
12
14
16
18
20
22
Per
cent
age
Err
or (
%)
Power Model
Mean
Last updated:
Training: Small set of 20 workloads
Testing: Full set of 60 workloads
[a] M. Pricopi, T. S. Muthukaruppan, V. Venkataramani, T. Mitra, and S. Vishin, “Power-performance modeling on asymmetric multi-cores,” CASES ’13.[b] M. Walker et al., “Run-time power estimation for mobile and embedded asymmetric multi-core cpus,” HIPEAC Workshop Energy Efficiency with Hetero. Comp. 2015[c] S. K. Rethinagiri et al., “System-level power estimation tool for embedded processor based platforms,” RAPIDO ’14. New York, 2014.[d], [e] R. Rodrigues et al, “A study on the use of performance counters to estimate power in microprocessors,” IEEE TCAS II, vol. 60, no. 12, pp. 882–886, Dec 2013.
M. J. Walker et al., "Accurate and Stable Run-Time Power Modeling for Mobile and Embedded CPUs," in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 36, no. 1, pp. 106-119, Jan. 2017.
15
ROBUST MODEL FORMULATION
Typical regression-based power model formulation
• Relationships between power and other variables is not captured
• Too many independent variables -> instability
Our robust model formulation
P = const+ β0Frequency + β1V oltage+ β2E0 + β3E1 + β4E2 + ...
M. J. Walker et al., "Accurate and Stable Run-Time Power Modeling for Mobile and Embedded CPUs," in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 36, no. 1, pp. 106-119, Jan. 2017.
16
ROBUST MODEL FORMULATION – WHY?
Reduces the experiment time
• frequencies * core utilisations * workloads * average workload time
• By splitting model into static and dynamic, all workloads can be run at a single frequency, with just one (i.e. sleep) at all frequencies
Allows combination with ‘bottom-up’ approaches
• Once power has been divided into components, can apply theory to different parts.
Indicates where power may be being consumed
M. J. Walker et al., "Accurate and Stable Run-Time Power Modeling for Mobile and Embedded CPUs," in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 36, no. 1, pp. 106-119, Jan. 2017.
17
ROBUST MODEL FORMULATION – WHY?
01/10/2015 16:21Graph Test:Run-Time Power Modelling
Page 2 of 2file:///Users/Matthew/Dropbox/ARM/Power_Modelling_Website/paper/test-coefficients-results-model.html
/Users/Matthew/Dropbox/ARM/ARM_XU3_Experiment/Experiments/Power_Experiment_Model_Building_128_A15_extended_WLs_arm_desk/Model_Building_Analyse_Experiment_Build/Run_Model/run_model_breakdown.csv
Select Frequency: Average 200 MHz 400 MHz 600 MHz 800 MHz 1000 MHz 1200 MHz 1400 MHz 1600 MHz1800 MHzSelect Number of Cores: Average 1 2 3 4Fixed Axis: Off On
Coefficient Breakdownid
lebi
tcou
nt
susa
njp
eg_
dec
dijk
stra
ispe
llad
pcm
_c fft
h264
_lq
mpe
g4_
lqdh
ryst
one
neon
_m
ulcs
tm_
int
lat_
ctx_
0_18
lat_
fsla
t_m
em_
rd_
1_8
lat_
mem
_rd
_20
0_8
lat_
mem
_rd
_p4
lat_
proc
_fo
rkbw
_m
em_
wr
bw_
mem
_cp
_70
0m tlb
cach
e
gcc
mpe
g4_
lq_
mp
mp_
bus_
spd
mp_
dhry
mp_
lp_
neon
mp_
rand
mem
open
mp_
mem
_sp
d
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
2.0
Pow
er (
W)
Workloads
Actual Static (V,f) EPH_0x11 EPH_0x1b_minus_EPH_0x73 EPH_0x50 EPH_0x6a EPH_0x73 EPH_0x14 EPH_0x19
Last updated:
Predicted power and modelled power for 30 different workloads
M. J. Walker et al., "Accurate and Stable Run-Time Power Modeling for Mobile and Embedded CPUs," in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 36, no. 1, pp. 106-119, Jan. 2017.
18
AVAILABLE TOOLS www.powmon.ecs.soton.ac.uk
19
AVAILABLE TOOLS www.powmon.ecs.soton.ac.uk
20
gem5 POWER ESTIMATION
Karunakar Basireddy, Matthew Walker, Domenico Balsamo, Stephan Diestelhorst, Bashir Al-Hashimi, Geoff Merrett, “Empirical CPU power modelling and estimation in the gem5 simulator” Int’l Symp. Power and Timing Modeling, Optimization and Simulation (PATMOS) 2017
RunWorkloads(benchmarks)
ExecutedonHardwareODROID-XU3(#60)
Executedongem5modelofthesamehardware(#15)
Record• PMCs• Power,Voltage
Record• Activitystatistics
ChoosePMCs,modelbuildingandvalidation
SelectionofactivitystatisticssimilartoPMCs
EmpiricalPowerModel
Gem5modelofthehardware
Estimatedpowerongem5model
ModellingMethodology gem5ArchitecturalModel
21Karunakar Basireddy, Matthew Walker, Domenico Balsamo, Stephan Diestelhorst, Bashir Al-Hashimi, Geoff Merrett, “Empirical CPU power modelling and estimation in the gem5 simulator” Int’l Symp. Power and Timing Modeling, Optimization and Simulation (PATMOS) 2017
PMC SELECTION
• Our Cortex-A15 power model uses the following seven PMCs:
– 0x11 CYCLE COUNT: active CPU cycles
– 0x1B INST SPEC: instructions speculatively executed
– 0x50 L2D CACHE LD: level 2 data cache accesses - read
– 0x6A UNALIGNED LDST SPEC: unaligned accesses
– 0x73 DP SPEC: instructions speculatively executed, int data processing
– 0x14 L1I CACHE ACCESS: level 1 instruction cache accesses
– 0x19 BUS ACCESS: bus accesses
• Suitable gem5 event counts for PMC events 0x6A and 0x73 were not available; the model was rebuilt without these
22
MODEL VALIDATION (vs HARDWARE)
Karunakar Basireddy, Matthew Walker, Domenico Balsamo, Stephan Diestelhorst, Bashir Al-Hashimi, Geoff Merrett, “Empirical CPU power modelling and estimation in the gem5 simulator” Int’l Symp. Power and Timing Modeling, Optimization and Simulation (PATMOS) 2017
23
MODEL VALIDATION (vs HARDWARE)
• Would expect greater error, as only using 4 PMCs, and gem5 doesn’t model temperature or voltage variation.
Model fitting
K-fold cross-validation
MAPE across all DVFS points and core mappings
Karunakar Basireddy, Matthew Walker, Domenico Balsamo, Stephan Diestelhorst, Bashir Al-Hashimi, Geoff Merrett, “Empirical CPU power modelling and estimation in the gem5 simulator” Int’l Symp. Power and Timing Modeling, Optimization and Simulation (PATMOS) 2017
24
ARCHITECTURAL MODEL
• A detailed OoO model of the 4-core Cortex-A15 in FS mode
• Instruction timing in execution stage configured as per (Endo et al., 2015).
• Integer instructions have latencies of 1 (ALU), 2 (x) and 12 (÷), and default latencies for FP instructions.
• Integer and floating point stages are pipelined.
• Cortex-A15 has two levels of TLBrather than one. To compensate, the ITLB and DTLB are over-dimensioned.
Karunakar Basireddy, Matthew Walker, Domenico Balsamo, Stephan Diestelhorst, Bashir Al-Hashimi, Geoff Merrett, “Empirical CPU power modelling and estimation in the gem5 simulator” Int’l Symp. Power and Timing Modeling, Optimization and Simulation (PATMOS) 2017
25
gem5 EVENTS VS HARDWARE PMCs
• 15 MiBench workloads
• 4 frequencies:– 200 MHz
– 600 MHz
– 1000 MHz
– 1600 MHz
Hardware vs gem5: execution time and PMCs/activity statistics (f=1 GHz)
Karunakar Basireddy, Matthew Walker, Domenico Balsamo, Stephan Diestelhorst, Bashir Al-Hashimi, Geoff Merrett, “Empirical CPU power modelling and estimation in the gem5 simulator” Int’l Symp. Power and Timing Modeling, Optimization and Simulation (PATMOS) 2017
26
gem5 EVENTS VS HARDWARE PMCs
This difference is likely due to factors including:
• Specification error in the simulator:
– in the fetch stage contributes to the I-cache miss error.
– in the TLB models contributes to the reported error in execution time and activity statistics.
• LPDDR3 DRAM in gem5 corresponds to 800 MHz, vs 933 MHz in the hardware.
Karunakar Basireddy, Matthew Walker, Domenico Balsamo, Stephan Diestelhorst, Bashir Al-Hashimi, Geoff Merrett, “Empirical CPU power modelling and estimation in the gem5 simulator” Int’l Symp. Power and Timing Modeling, Optimization and Simulation (PATMOS) 2017
27
MODEL VALIDATION (gem5 vs HARDWARE)
Modelled power for each workload @1000 MHz.
Karunakar Basireddy, Matthew Walker, Domenico Balsamo, Stephan Diestelhorst, Bashir Al-Hashimi, Geoff Merrett, “Empirical CPU power modelling and estimation in the gem5 simulator” Int’l Symp. Power and Timing Modeling, Optimization and Simulation (PATMOS) 2017
28
MODEL VALIDATION (gem5 vs HARDWARE)
Difference in power between gem5 modeland hardware model at different frequencies (MHz).
Karunakar Basireddy, Matthew Walker, Domenico Balsamo, Stephan Diestelhorst, Bashir Al-Hashimi, Geoff Merrett, “Empirical CPU power modelling and estimation in the gem5 simulator” Int’l Symp. Power and Timing Modeling, Optimization and Simulation (PATMOS) 2017
29
CONCLUSIONS
Robust and Stable Power Modelling
• Appropriate workload selection
• Stable PMC selection
• Robust model formulation
Applying Models to gem5
• Real hardware vs modelled architecture
• PMCs vs gem5 event stats/exec. time
• 10% error in gem5 vs hardware model
Tools Available!
• www.powmon.ecs.soton.ac.uk
30
ACKNOWLEDGEMENTS
Matthew WalkerUni. Southampton (PhD)
Dr Domenico BalsamoUni. Southampton (Postdoc)
Karunakar BasireddyUni. Southampton (PhD)
Prof Bashir Al-HashimiUni. Southampton
Stephan DiestelhorstArm Research
Andreas Hansson(previously) Arm Research
Any Questions?
Dr Geoff V MerrettAssociate Professor
Electronics and Computer ScienceTel: +44 (0)23 8059 2775Email: [email protected] | www.geoffmerrett.co.ukHighfield Campus, Southampton, SO17 1BJ UK