© 2012 IBM Corporation Barcelona Supercomputing Center MICRO 2012 Tuesday, December 4, 2012...

© 2012 IBM CorporationBarcelona Supercomputing Center

MICRO 2012 Tuesday, December 4, 2012

Systematic Energy Characterization of CMP/SMT Processor Systems via Automated Micro-Benchmarks R. Bertran*+, A. Buyuktosunoglu*, M. Gupta*, M. Gonzalez+, P. Bose*

*IBM T.J. Watson Research Center+Barcelona Supercomputing Center



2

What is the maximum power consumption?

Any performance bug?Any reliability issues?

…

Time consuming and tedious – Error prone task

• Trial and error process – Several micro-

benchmarks are required

Deep expertise limited to few designers

– Detailed knowledge of the underlying architecture is required

Why do we need micro-benchmarks?

Micro-benchmarks!

AUTOMATED

SOLUTION N

EEDED!



MicroProbe:a micro-benchmark generation framework



MicroProbe Workflow

MicroProbeFramework

User

Micro-Bench-mark

Inputs Outputs

Micro-benchmarkgeneration

policy

ArchitectureDefinition

files

Endless loop50% INT 50% FPEndless loop for each instruction

of the ISA

Micro-Bench-mark

Micro-Bench-mark

Micro-Bench-mark

Max Powerstressmark

External tools

Realplatforms

Simulators Models



MicroProbe: Distinguishing Features

5

Feature Previous works MicroProbeISA queries- Instruction type - Operand length, binary codification etc. (manual)

Micro-architecture queries- Functional unit, latency, throughput, energy per instruction, average instruction power etc.

(manual)

Micro-architecture models- Set-associative cache model (no)

Code generation- Skeleton and instruction definition passes, memory modeling pass, branch modeling pass, ILP definition pass.

- Configurable passes (no)

Design space exploration- Integrated (no) - GA-based search - Exhaustive search (manual) - Customizable search (manual)



MicroProbe Usage and Design Overview

Researchidea

Micro-benchmark generation policies (user-defined scripts)

Loop stressingthe floating

point unit

Sequence of loadshitting 50% L1

and 50% L2

Generate a stress-mark for each functionalunit of the architecture

Search for the sequence of 2loads and 2 integer operations

with maximum IPC

MicroProbe Framework (Python API)

Architecture module Code generationmodule

Design spaceexploration moduleISA

definitionsISA

definitionsISA

definitions

Micro-architectureanalytical modelsMicro-architectureanalytical modelsMicro-architectureanalytical models

Micro-architecturedefinitions



Micro-benchmarksynthesizer

PassesPassesPasses

SearchdriversSearchdriversSearchdrivers

PropertiesPropertiesProperties

Micro-benchmarkMicro-benchmarkMicro-benchmark

Automaticbootstrapprocess

External tools



Max-power Stressmark Generation

7

Use MicroProbe to

generate max-power

stressmark

Characterize energy per instruction (EPI) and IPC (Architecture Module)

Select N instructions with max (IPC* EPI)

Form a basic endless loop (e.g. 4K) using

selected instructions (Code Generation Module)

Generate micro-benchmarks with different orders of the selected N

instructions

Evaluate using Design Space Exploration

Module

Pick the highest power microbenchmark

Loop:…mulldomulldolxvw4xlxvw4xxvnmsubmdpxvnmsubmdp…

mulldoxvnmsubmdp

lxvw4x

Loop:…mulldolxvw4xmulldoxvnmsubmdplxvw4xxvnmsubmdp…



CASE STUDIES

MicroProbe:A Micro-benchmark Generation Framework

8



Experimental Methodology

Platform:– Processor: POWER7 @ 3GHz

• 8-core 4-way SMT• 32KB L1, 256KB L2 and 4MB L3 per core

– Memory: 32 GB DDR3 SDRAM @ 800MHz– OS: RHEL 5.7 + Linux 3.0.1– EnergyScale architecture

• Power measurements in miliwatts• Sampling rate up to 1ms

In-house software collects power and performance counter traces [C. Lefurgy et al, IBM]

9



Case Study 1: EPI Characterization

10

High differences in EPI across instructions stressing different micro-

architecture components

High differences in EPI across instructions stressing the same micro-

architecture components and at the same rate (IPC)



MicroProbe

Heuristic:Max(EPI * IPC)

Selected instructions:mulldo,

xvnmsubmdp,lxvw4x

Case Study 2: Max-power Stressmark Generation

11

?Use a

computational intensive kernel

Use complex instructions

accessing different functional units with

high IPC

Generate all possible combinations of

complex instructions stressing different

units

Use MicroProbe

DAXPYSelected

intructions:mullw

xvmaddadplxvd2x

Loop:…mullwmullwxvmaddadpxvmaddadplxvd2xlxvd2x…

Loop:…mullwlxvd2xmullwxvmaddadpxvmaddadplxvd2x…

Loop:…mullwlxvd2xmullwxvmaddadplxvd2xxvmaddadp…

MicroProbe

LoopsLoopsLoopsLoopsLoopsLoopsLoopsLoopsLoopsLoopsLoopsLoops

ExpertDSE

Expertmanual

MicroProbe



Max-power Stressmark Generation

12

Max-power results

0.6

0.7

0.8

0.9

1

1.1

1.2

DAXPY Expert Manual Expert DSE MicroProbe

Methods

No

rmal

ized

po

wer

Min

Mean

Max



Case Study 3: Counter-based Processor Power Model

13

Bottom-up

Power modelingmethod

Dynamic Power

f(PMCs)

Intercept SMT1

Intercept SMT2-4

SMT effect

Linear Regression

f(CMP)

CMP effect

Uncore power

Func.Unit micro-BenchmarksCMP1–SMT1

Random micro-BenchmarksCMP1–SMT1

Random micro-Benchmarks

CMP1–SMT2/4

Random micro-Benchmarks

CMP1/8–SMT2/4

Model:

cores

k

#

1

Dynamic Power

f(PMCs)

SMTeffect

CMP effect

Uncore power

SMT enabled

#cores

threads

k

#

1

1

2

3



Counter-based Processor Power ModelValidation

Within acceptable error margins: < 4% on average

Model accuracy results on SPEC CPU2006

0123456789

10

1-1 1-2 1-4 2-1 2-2 2-4 4-1 4-2 4-4 6-1 6-2 6-4 8-1 8-2 8-4 Mean

CMP - SMT configuration

% E

rro

r

Micro trained

Random trained

SPEC trained

Proposed



Counter-based Processor Power ModelValidation on Corner Cases

Models trained using non-micro-architecture aware training sets show high errors and variability

Models trained using the micro-architecture aware training set show acceptable error margins: < 5% on average

Model accuracy results

0

5

10

15

20

FXUHigh

FXULow

L1Loads

MainMemory

VSUHigh

VSULow

Mean

Validation set

% E

rro

r

Micro trained

Random trained

SPEC trained

Proposed

62%



Conclusions

MicroProbe is a productive micro-benchmark generation framework

– Adaptive and flexible– Includes micro-architecture semantics– Integrates design space exploration

Presented three case studies:– Instruction-based EPI characterization– Automated max-power stressmark generation– CMP/SMT-aware bottom-up counter-based processor power model

16



QUESTIONS?

MicroProbe:A Micro-benchmark Generation Framework

17

© 2012 IBM Corporation Barcelona Supercomputing Center MICRO 2012 Tuesday, December 4, 2012...

Documents

Transcript of © 2012 IBM Corporation Barcelona Supercomputing Center MICRO 2012 Tuesday, December 4, 2012...