CPU POWER ESTIMATION USING PMCs, AND ITS APPLICATION …

1

CPU POWER ESTIMATIONUSING PMCs, AND ITSAPPLICATION IN gem5

Dr Geoff MerrettArm Research Summit, 11 September 2017

2

OVERVIEW

Introduction and Background

Power Estimation on Hardware

• Our Accurate and Robust Approach

• Open Source Tools

Power Estimation in gem5

• PMCs vs gem5 Statistics

• Power Estimation

Conclusions

3

WHY POWER ESTIMATION?

Run-Time Management Approaches

• Make energy-savings by controlling operation.

– DVFS (dynamic-voltage frequency scaling) and DPM

– Task scheduling and mapping

• Make decisions based on real-time power ‘measurements’

System Research

• Design-space exploration

• Evaluating new power management strategies

• Research power-optimized software (microcode to applications)

• SOC architecture & design balancing for power and performance

4

POWER MODELLING APPROACHES

“Bottom-Up” Power Models

• Take a design specification (e.g. pipeline stages, ROB size etc.)

• Simulate gates and toggle rates

• Uses statistics from an architectural simulator (e.g. gem5)

• Advantages: flexibility to specify any design; cache size, etc.

• Disadvantages: large errors, slow, limited validation

“Top-Down” Power Models

• Characterise a specific device

• Estimate relationship between measured power and stats, e.g. PMCs

• Advantages: accurate and lightweight

• Disadvantages: specific to the device they were built on

5

POWMON METHODOLOGY1. Runworkloads

@differentDVFSlevels

39workloadsused:MiBench,LMBench,RoyLongbottom,ParMiBench andALPBench

ODROID-XU3Exynos-54224xCortex-A74xCortex-A5

2.Record• PMCs• Power,Voltage,

Temperature,etc.

3.ChoosePMCsHierarchicalclusteranalysis,Correlationmatrixanalysis,Exhaustivesearch,etc.

5.Validate• K-foldcrossvalidation• R2 :>0.99• Error:2.8– 3.7%

6.Uses• OSRun-time

management• Referenceforresearch• gem5add-on

4.BuildModel• OLSmultipleregression• Considerscollinearityand

heteroscedasticity• “sensible”equation

M. J. Walker et al., "Accurate and Stable Run-Time Power Modeling for Mobile and Embedded CPUs," in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 36, no. 1, pp. 106-119, Jan. 2017.

6

THE POWMON APPROACH

A power model’s stability is more important than its average error

Unstable model

• Appears accurate, but performs poorly with diverse workloads

Stable model

• Remains accurate across a diverse range of workloads and scenarios

• Requires careful choice of inputs (PMCs) & observations (workloads)

Eg: choose 3 sensors and appropriate training data to estimate colour:

Training Dataset A: Training Dataset B:

Input Colour Channel A: Input Colour Channel B:

Unstable StableM. J. Walker et al., "Accurate and Stable Run-Time Power Modeling for Mobile and Embedded CPUs," in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 36, no. 1, pp. 106-119, Jan. 2017.

7

PERFORMANCE MONITORING COUNTERS

CPU Registers that count architectural and microarchitectural events

– E.g. L2 cache miss, TLB access, integer instruction, etc.

Positives

• Available on several platforms (e.g. ARM, Intel, AMD); low overhead

• Many different available events (>70)…

Negatives

• …but a small number (e.g. 4-6) can be monitored simultaneously

PMCs are often selected using intuition – e.g. try to split PMCs into different sub-architectural units. However can be problematic as:

• They may not gather enough information

• Different PMCs are correlated (can make a model unstable)

8

HIERARCHICAL CLUSTER ANALYSIS

• HCA groups similar events together

• Output is a dendrogram

• This allows PMCs to be groupedinto clusters


9

HIERARCHICAL CLUSTER ANALYSIS

• We combine clusters with correlation of each event with CPU power

• Aim: Choose PMCs with a high correlation with power, avoiding ones from the same cluster

02/12/2015, 13:52Graph Test:Run-Time Power Modelling

Page 1 of 1http://127.0.0.1/~Matthew/Micro_Server/results-plots/results-pmc-drag-drop.html

PMC Exploration Results

Cluster and CorrelationSelect Clustering: None Clustering A Clustering B

Cluster and CorrelationL1

D_T

LB_R

EFIL

L:0x

05L1

D_T

LB_R

EFIL

L_LD

:0x4

CM

EM_A

CCES

S:0x

13L1

D_C

ACHE

_ACC

ESS:

0x04

CYCL

E_CO

UNT:

0x11

LDST

_SPE

C:0x

72LD

_SPE

C:0x

70M

EM_A

CCES

S_LD

:0x6

6BU

S_CY

CLES

:0x1

DL1

D_C

ACHE

_LD

:0x4

0BR

_RET

RUN

_SPE

C:0x

79BR

_IN

DIR

ECT_

SPEC

:0x7

AIN

ST_S

PEC:

0x1B

L1I_

CACH

E_AC

CESS

:0x1

4BR

_IM

MED

_SPE

C:0x

78D

P_SP

EC:0

x73

BR_P

RED

:0x1

2BR

_MIS

_PRE

D:0

x10

PC_W

RITE

_SPE

C:0x

76IN

STR_

RETI

RED

:0x0

8VP

F_SP

EC:0

x75

STRE

X_PA

SS_S

PEC:

0x6D

DM

B_SP

EC:0

x7E

LDRE

X_SP

EC:0

x6C

L2D

_CAC

HE_W

B:0x

18L1

D_C

ACHE

_WB:

0x15

L1D

_CAC

HE_R

EFIL

L_LD

:0x4

2

L1D

_CAC

HE_R

EFIL

L_ST

:0x4

3

L1D

_CAC

HE_W

B_VI

CTIM

:0x4

6

L2D

_CAC

HE_W

B_VI

CTIM

:0x5

6BU

S_AC

CESS

_LD

:0x6

0BU

S_AC

CESS

:0x1

9L2

D_C

ACHE

_REF

ILL:

0x17

BUS_

ACCE

SS_S

HARE

D:0

x62

BUS_

ACCE

SS_S

T:0x

61

L2D

_CAC

HE_R

EFIL

L_LD

:0x5

2

L2D

_CAC

HE_R

EFIL

L_ST

:0x5

3L2

D_C

ACHE

_ACC

ESS:

0x16

BUS_

ACCE

SS_N

ORM

AL:0

x64

L2D

_CAC

HE_S

T:0x

51L2

D_C

ACHE

_LD

:0x5

0L1

D_C

ACHE

_REF

ILL:

0x03

L1D

_TLB

_REF

ILL_

ST:0

x4D

UNAL

IGN

ED_L

D_S

PEC:

0x68

UNAL

IGN

ED_S

T_SP

EC:0

x69

UNAL

IGN

ED_L

DST

_SPE

C:0x

6AM

EM_A

CCES

S_ST

:0x6

7ST

_SPE

C:0x

71L1

D_C

ACHE

_ST:

0x41

L2D

_CAC

HE_I

NVA

L:0x

58ST

REX_

FAIL

_SPE

C:0x

6E

L2D

_CAC

HE_W

B_CL

EAN

:0x5

7L1

I_CA

CHE_

REFI

LL:0

x01

L1I_

TLB_

REFI

LL:0

x02

DSB

_SPE

C:0x

7DEX

C_RE

TURN

:0x0

A

BUS_

ACCE

SS_N

OT_

SHAR

ED:0

x63

BUS_

ACCE

SS_P

ERIP

H:0x

65EX

C_TA

KEN

:0x0

9

L1D

_CAC

HE_W

B_CL

EAN

:0x4

7L1

D_C

ACHE

_IN

VAL:

0x48

CID

_WT_

RETI

RED

:0x0

B

TTBR

_WRI

TE_R

ETIR

ED:0

x1C

ISB_

SPEC

:0x7

CAS

E_SP

EC:0

x74

-0.1

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Corr

elat

ion

with

Pow

er

Events


10

STABLE vs UNSTABLE MODELS

1) Training and validating the model with a ‘typical’ set of workloads

Both unstable and stable model seem good (<2.5%)

Training: Small set of 20 typical workloads (S.T), e.g. MiBench

Testing: Small set of 20 typical workloads (S.T), e.g. MiBench


11


2) Validating the same model with a ‘full’ set of workloads

Both models perform poorly, errors > 7%; not enough information from training workloads.

Training with a small set of workloads results an optimistic reported error

Training: Small set of 20 typical workloads (S.T), e.g. MiBench

Testing: Full set of 60 diverse workloads (F)


12


3) Training and validating the model with a ‘random’ set of workloads

Stable model copes better with workload diversity

Training: Small set of 20 random workloads (S.R)

Testing: Small set of 20random workloads (S.R)


13


4) Validating the same model with a ‘full’ set of workloads

Accuracy of stable model close to full training set (E); unstable model poor

Diverse (random) training allows a stable model to gain prediction power

Stable models perform well even with limited training workloads (saves time)

Training: Small set of 20 random workloads (S.R)

Testing: Full set of 60 diverse workloads (F)


14


Our stable approach achieves a low average error and narrow error distribution compared to existing techniques. Models trained with 20 workloads, validated with 60.

09/03/2016, 01:50Graph Test:Run-Time Power Modelling

Page 1 of 1file:///Users/user/Dropbox/ARM/Power_Modelling_Website/paper/comparison-graph.html

Filename: comparison-graph-data.csv

Comparison Graph

a b c d e P0

2

4

6

8

10

12

14

16

18

20

22

Per

cent

age

Err

or (

%)

Power Model

Mean

Last updated:

Training: Small set of 20 workloads

Testing: Full set of 60 workloads

[a] M. Pricopi, T. S. Muthukaruppan, V. Venkataramani, T. Mitra, and S. Vishin, “Power-performance modeling on asymmetric multi-cores,” CASES ’13.[b] M. Walker et al., “Run-time power estimation for mobile and embedded asymmetric multi-core cpus,” HIPEAC Workshop Energy Efficiency with Hetero. Comp. 2015[c] S. K. Rethinagiri et al., “System-level power estimation tool for embedded processor based platforms,” RAPIDO ’14. New York, 2014.[d], [e] R. Rodrigues et al, “A study on the use of performance counters to estimate power in microprocessors,” IEEE TCAS II, vol. 60, no. 12, pp. 882–886, Dec 2013.


15

ROBUST MODEL FORMULATION

Typical regression-based power model formulation

• Relationships between power and other variables is not captured

• Too many independent variables -> instability

Our robust model formulation

P = const+ β0Frequency + β1V oltage+ β2E0 + β3E1 + β4E2 + ...


16

ROBUST MODEL FORMULATION – WHY?

Reduces the experiment time

• frequencies * core utilisations * workloads * average workload time

• By splitting model into static and dynamic, all workloads can be run at a single frequency, with just one (i.e. sleep) at all frequencies

Allows combination with ‘bottom-up’ approaches

• Once power has been divided into components, can apply theory to different parts.

Indicates where power may be being consumed


17

ROBUST MODEL FORMULATION – WHY?

01/10/2015 16:21Graph Test:Run-Time Power Modelling

Page 2 of 2file:///Users/Matthew/Dropbox/ARM/Power_Modelling_Website/paper/test-coefficients-results-model.html

/Users/Matthew/Dropbox/ARM/ARM_XU3_Experiment/Experiments/Power_Experiment_Model_Building_128_A15_extended_WLs_arm_desk/Model_Building_Analyse_Experiment_Build/Run_Model/run_model_breakdown.csv

Select Frequency: Average 200 MHz 400 MHz 600 MHz 800 MHz 1000 MHz 1200 MHz 1400 MHz 1600 MHz1800 MHzSelect Number of Cores: Average 1 2 3 4Fixed Axis: Off On

Coefficient Breakdownid

lebi

tcou

nt

susa

njp

eg_

dec

dijk

stra

ispe

llad

pcm

_c fft

h264

_lq

mpe

g4_

lqdh

ryst

one

neon

_m

ulcs

tm_

int

lat_

ctx_

0_18

lat_

fsla

t_m

em_

rd_

1_8

lat_

mem

_rd

_20

0_8

lat_

mem

_rd

_p4

lat_

proc

_fo

rkbw

_m

em_

wr

bw_

mem

_cp

_70

0m tlb

cach

e

gcc

mpe

g4_

lq_

mp

mp_

bus_

spd

mp_

dhry

mp_

lp_

neon

mp_

rand

mem

open

mp_

mem

_sp

d

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

Pow

er (

W)

Workloads

Actual Static (V,f) EPH_0x11 EPH_0x1b_minus_EPH_0x73 EPH_0x50 EPH_0x6a EPH_0x73 EPH_0x14 EPH_0x19

Last updated:

Predicted power and modelled power for 30 different workloads


18

AVAILABLE TOOLS www.powmon.ecs.soton.ac.uk

19

AVAILABLE TOOLS www.powmon.ecs.soton.ac.uk

20

gem5 POWER ESTIMATION

Karunakar Basireddy, Matthew Walker, Domenico Balsamo, Stephan Diestelhorst, Bashir Al-Hashimi, Geoff Merrett, “Empirical CPU power modelling and estimation in the gem5 simulator” Int’l Symp. Power and Timing Modeling, Optimization and Simulation (PATMOS) 2017

RunWorkloads(benchmarks)

ExecutedonHardwareODROID-XU3(#60)

Executedongem5modelofthesamehardware(#15)

Record• PMCs• Power,Voltage

Record• Activitystatistics

ChoosePMCs,modelbuildingandvalidation

SelectionofactivitystatisticssimilartoPMCs

EmpiricalPowerModel

Gem5modelofthehardware

Estimatedpowerongem5model

ModellingMethodology gem5ArchitecturalModel

21Karunakar Basireddy, Matthew Walker, Domenico Balsamo, Stephan Diestelhorst, Bashir Al-Hashimi, Geoff Merrett, “Empirical CPU power modelling and estimation in the gem5 simulator” Int’l Symp. Power and Timing Modeling, Optimization and Simulation (PATMOS) 2017

PMC SELECTION

• Our Cortex-A15 power model uses the following seven PMCs:

– 0x11 CYCLE COUNT: active CPU cycles

– 0x1B INST SPEC: instructions speculatively executed

– 0x50 L2D CACHE LD: level 2 data cache accesses - read

– 0x6A UNALIGNED LDST SPEC: unaligned accesses

– 0x73 DP SPEC: instructions speculatively executed, int data processing

– 0x14 L1I CACHE ACCESS: level 1 instruction cache accesses

– 0x19 BUS ACCESS: bus accesses

• Suitable gem5 event counts for PMC events 0x6A and 0x73 were not available; the model was rebuilt without these

22

MODEL VALIDATION (vs HARDWARE)


23

MODEL VALIDATION (vs HARDWARE)

• Would expect greater error, as only using 4 PMCs, and gem5 doesn’t model temperature or voltage variation.

Model fitting

K-fold cross-validation

MAPE across all DVFS points and core mappings


24

ARCHITECTURAL MODEL

• A detailed OoO model of the 4-core Cortex-A15 in FS mode

• Instruction timing in execution stage configured as per (Endo et al., 2015).

• Integer instructions have latencies of 1 (ALU), 2 (x) and 12 (÷), and default latencies for FP instructions.

• Integer and floating point stages are pipelined.

• Cortex-A15 has two levels of TLBrather than one. To compensate, the ITLB and DTLB are over-dimensioned.


25

gem5 EVENTS VS HARDWARE PMCs

• 15 MiBench workloads

• 4 frequencies:– 200 MHz

– 600 MHz

– 1000 MHz

– 1600 MHz

Hardware vs gem5: execution time and PMCs/activity statistics (f=1 GHz)


26

gem5 EVENTS VS HARDWARE PMCs

This difference is likely due to factors including:

• Specification error in the simulator:

– in the fetch stage contributes to the I-cache miss error.

– in the TLB models contributes to the reported error in execution time and activity statistics.

• LPDDR3 DRAM in gem5 corresponds to 800 MHz, vs 933 MHz in the hardware.


27

MODEL VALIDATION (gem5 vs HARDWARE)

Modelled power for each workload @1000 MHz.


28

MODEL VALIDATION (gem5 vs HARDWARE)

Difference in power between gem5 modeland hardware model at different frequencies (MHz).


29

CONCLUSIONS

Robust and Stable Power Modelling

• Appropriate workload selection

• Stable PMC selection

• Robust model formulation

Applying Models to gem5

• Real hardware vs modelled architecture

• PMCs vs gem5 event stats/exec. time

• 10% error in gem5 vs hardware model

Tools Available!

• www.powmon.ecs.soton.ac.uk

30

ACKNOWLEDGEMENTS

Matthew WalkerUni. Southampton (PhD)

Dr Domenico BalsamoUni. Southampton (Postdoc)

Karunakar BasireddyUni. Southampton (PhD)

Prof Bashir Al-HashimiUni. Southampton

Stephan DiestelhorstArm Research

Andreas Hansson(previously) Arm Research

Any Questions?

Dr Geoff V MerrettAssociate Professor

Electronics and Computer ScienceTel: +44 (0)23 8059 2775Email: [email protected] | www.geoffmerrett.co.ukHighfield Campus, Southampton, SO17 1BJ UK

CPU POWER ESTIMATION USING PMCs, AND ITS APPLICATION …

Documents

Transcript of CPU POWER ESTIMATION USING PMCs, AND ITS APPLICATION …