Performance and Energy-E ciency Modelling for …...Ao longo dos ultimos anos, o aumento das...

Performance and Energy-Efficiency Modelling for

Multi-Core Processors

Diogo Augusto Pereira Marques

Thesis to obtain the Master of Science Degree in

Electrical and Computer Engineering

Supervisors: Doctor Aleksandar Ilic

Doctor Leonel Augusto Pires Seabra de Sousa

Examination Committee

Chairperson: Doctor Antonio Manuel Raminhos Cordeiro Grilo

Supervisor: Doctor Aleksandar Ilic

Member of the Committee: Doctor Joao Pedro Faria Mendonca Barreto

November 2017

AcknowledgmentsI would like to thank my supervisors, Dr. Leonel Sousa and Dr. Aleksandar Ilic, for their support and guidance

through this Thesis, where their helpful insights were valuable for the development of the work here presented.

Furthermore, I would like to thank INESC-ID for providing me the tools and infrastructure that allowed me to

conclude this work.

I would also like to thank all the co-authors of my publications and people from Intel Corporation that helped

me in developing this Thesis, namely Roman Belenov, Philippe Thierry, Zakhar A. Matveev, Ahmad Yasin and

Jawad Haj-Yahya.

I would also like to thank all my friends for helping me through the entire course at IST and especially to my

girlfriend, Raynara Silva, for all the support given during this Thesis, pushing me to always do my best.

Finally, special thanks to my family for all their support in my academic course, especially my mother and

father for their sacrifices and hard-work that allowed me to attend a course at IST and perform this Thesis.

i

AbstractIn the last years, the increasing computational needs of modern applications brought an increase in the com-

plexity of multi-core processor architectures. Hence, deeply understanding the factors that have a major impact on

the performance, power consumption and efficiency on those platforms has become a difficult task. Thus, it is not

trivial to guarantee the best execution efficiency for applications in multi-core processors. Given this challenge,

insightful tools capable of relating the application requirements with the processor capabilities, such as Cache-

Aware Roofline Model and Original Roofline Model, are very valuable for programmers, mostly during the stages

of prototyping and design of the applications. However, the simplicity of these tools brings up certain limitations

when characterizing the behavior of real-world applications and determining their execution bottlenecks. To ad-

dress these limitations, this Thesis proposes a set of Cache-Aware Roofline Model extensions to increase model

insightfulness and usability, in order to provide more accurate hints regarding application optimization. To validate

the proposed extensions and methodologies, a set of applications from standard benchmark suites is characterized

in Intel Skylake 6700K, correlating their behavior with the different computational capabilities of the processor

and providing primary hints about their main bottlenecks. Besides, the insights derived from the models proposed

in this Thesis allowed to increase the performance of an application kernel for up to 6.43×, when compared to its

unoptimized version, demonstrating the model usability when optimizing the execution of real applications.

Keywords: Performance, Power consumption, Efficiency, Processor capabilities, Insightful tools, Cache-Aware

Roofline Model.

iii

ResumoAo longo dos ultimos anos, o aumento das necessidades computacionais das aplicacoes actuais, provocou um

aumento da complexidade nas arquitecturas dos processadores multi-core. Assim, perceber quais os factores com

maior influencia na performance, potencia consumida e eficiencia destas plataformas tornou-se um grande de-

safio. Portanto, nao e trivial garantir a melhor eficiencia na execucao das aplicacoes nos processadores multi-core.

Dado este desafio, ferramentas proficientes capazes de relacionar, de forma simples e rapida, os requisitos de uma

aplicacao com as capacidades do processador, tais como Cache-Aware Roofline Model e Original Roofline Model,

sao uma mais valia para os programadores, principalmente durante a prototipagem e projecto das aplicacoes. Con-

tudo, a simplicidade destas ferramentas acarreta algumas limitacoes aquando a caracterizacao das aplicacoes reais

e na determinacao dos seus bottlenecks. De forma a solucionar estas limitacoes, esta Tese propoe um conjunto

de extensoes que aumentem as capacidades do Cache-Aware Roofline Model, de forma a obter sugestoes mais

precisas na optimizacao de aplicacoes. Para validar as extensoes propostas e metodologias utilizadas, um conjunto

de aplicacoes de benchmarks padrao e caracterizado no Intel Skylake 6700K, correlacionando o comportamento

das aplicacoes com as diferentes capacidades computacionais e identificando os seus bottlenecks principais. Para

alem disto, as sugestoes derivadas dos modelos propostos nesta Tese permitiram o aumento da performance de um

kernel de uma aplicacao ate cerca de 6.43×, quando comparada com a versao original, demonstrando a usabilidade

do modelo na optimizacao da execucao de aplicacoes reais.

Palavras-chave: Performance, Potencia consumida, Eficiencia, Capacidades do processador, Ferramentas pro-

ficientes, Cache-Aware Roofline Model.

v

Contents

Abstract iii

Resumo v

List of Figures ix

List of Tables xi

List of Algorithms xiii

List of Acronyms xv

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Main contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Background: Insightful modeling of modern multi-core processors 6

2.1 Modern multi-core processors and performance analysis . . . . . . . . . . . . . . . . . . . . . . 6

2.1.1 Intel Ivy Bridge and Skylake micro-architectures . . . . . . . . . . . . . . . . . . . . . . 7

2.1.2 Top-Down method for application analysis and detection of execution bottlenecks . . . . . 10

2.2 State-of-the-art approaches for insightful modeling of multi-cores . . . . . . . . . . . . . . . . . 12

2.2.1 Performance Roofline Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2.2 Power, Energy and Energy-Efficiency Roofline Modeling . . . . . . . . . . . . . . . . . . 17

2.2.3 Remarks on Original and Cache-aware Roofline principles . . . . . . . . . . . . . . . . . 21

2.2.4 State-of-the-art approaches on extending the usability of insightful models . . . . . . . . 22

2.3 Open challenges in insightful modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3 Reaching the architecture upper-bounds with micro-benchmarking 26

3.1 Tool for fine-grain micro-architecture benchmarking . . . . . . . . . . . . . . . . . . . . . . . . 26

3.2 Micro-architecture benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2.1 Exploring the maximum compute performance . . . . . . . . . . . . . . . . . . . . . . . 29

3.2.2 Memory subsystem benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

vii

4 Proposed insightful models: Construction and experimental validation 41

4.1 Proposed Cache-Aware Roofline Model (CARM) extensions: Model construction . . . . . . . . . 42

4.1.1 State-of-the-art CARM construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.2 Experimental validation of proposed CARM extensions . . . . . . . . . . . . . . . . . . . . . . . 46

4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5 Application characterization and optimization in the proposed insightful models 51

5.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.2 Evaluation methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.3 Case Study: Toypush mini-application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.3.1 CARM-guided application optimization example . . . . . . . . . . . . . . . . . . . . . . 55

5.4 Characterization of real-world applications in the proposed models . . . . . . . . . . . . . . . . . 57

5.4.1 Application characterization in the Single Precision (SP) Scalar LD CARM extension . . 58

5.4.2 Application characterization in the DP Scalar 2LD/ST CARM extension . . . . . . . . . . 60

5.4.3 Application characterization in the Double Precision (DP) Scalar LD CARM extension . . 64

5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

6 Conclusions and Future Works 70

6.1 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

References 72

viii

List of Figures

2.1 Central Processing Unit (CPU) pipeline for a Skylake micro-architecture [1]. . . . . . . . . . . . 8

2.2 Memory subsystem for Intel micro-architectures. . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 Connection between cache levels, GPU and system agent [1] . . . . . . . . . . . . . . . . . . . . 10

2.4 Top-Down Analysis hierarchy [2]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.5 Original Roofline Model (ORM) and CARM memory traffic [3]. . . . . . . . . . . . . . . . . . . 13

2.6 ORM and CARM [3, 4]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.7 Performance Cache-Aware Roofline Model for an Intel 6700K quad-core processor (Skylake). . . 15

2.8 Intel Advisor Roofline: Performance characterization of Minighost loops . . . . . . . . . . . . . . 16

2.9 Original Roofline Models for energy-efficiency and power consumption. . . . . . . . . . . . . . . 18

2.10 Power consumed by the processor Intel 3370K Ivy Bridge [5] . . . . . . . . . . . . . . . . . . . . 19

2.11 Power CARM Models for Intel 3370K Ivy Bridge [5] . . . . . . . . . . . . . . . . . . . . . . . . 20

2.12 Energy and Energy-Efficiency CARM for Intel 3370K Ivy Bridge . . . . . . . . . . . . . . . . . 20

2.13 Application with different problem sizes in Intel 3770K Ivy Bridge. . . . . . . . . . . . . . . . . 21

3.1 Benchmarking tool general layout. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2 Floating Point (FP) Units maximum performance using Advanced Vector Extensions (AVX) Single

Instruction Multiple Data (SIMD) DP instructions. . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.3 FP Units maximum power consumption using AVX SIMD DP instructions. . . . . . . . . . . . . 31

3.4 FP units performance using AVX SIMD DP instructions. . . . . . . . . . . . . . . . . . . . . . . 32

3.5 FP units power consumption using AVX SIMD DP instructions. . . . . . . . . . . . . . . . . . . 32

3.6 FP units performance and power consumption for different instruction set extensions in Intel Sky-

lake 6700K (4 cores). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.7 Top Down Method for Fused Multiply-Add (FMA) AVX SIMD DP at nominal frequency. . . . . 34

3.8 Memory subsystem bandwidth for LD AVX SIMD DP at nominal frequency. . . . . . . . . . . . 34

3.9 Memory subsystem power consumption for LD AVX SIMD DP at nominal frequency. . . . . . . 35

3.10 Memory subsystem power consumption for LD AVX SIMD DP at nominal frequency. . . . . . . 36

3.11 Memory ratios bandwidth for AVX SIMD DP at nominal frequency. . . . . . . . . . . . . . . . . 37

3.12 Memory ratios power consumption for AVX SIMD DP at nominal frequency. . . . . . . . . . . . 37

3.13 FP units performance and power consumption for different instruction set extensions in Intel Sky-

lake 6700K (4 cores). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.14 Top Down Method for 2LD/ST AVX SIMD DP at nominal frequency. . . . . . . . . . . . . . . . 40

4.1 Proposed CARM extensions for AVX DP FP instructions for Intel Skylake 6700K (4 Cores, 2LD/ST). 42

4.2 Proposed CARM extensions for AVX LD and ST operations for Intel Skylake 6700K (4 Cores). . 43

4.3 Proposed CARM extensions for 2LD/ST ratio with Streaming SIMD Extensions (SSE) and Scalar

DP instructions for Intel Skylake 6700K (4 Cores). . . . . . . . . . . . . . . . . . . . . . . . . . 44

ix

4.4 AVX DP LD and 2LD/ST memory bandwidth evaluation and State-of-the-art CARM for Intel

Skylake 6700K (4 Cores). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.5 Performance and power consumption LD AVX SIMD DP CARM validations for Intel Ivy Bridge

3770K (1 core). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.6 Performance and power consumption LD AVX SIMD DP CARM validations for Intel Ivy Skylake

6700K (1 core). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.7 Performance and power consumption 2LD/ST AVX SIMD DP CARM validations for Intel Ivy

Bridge 3770K (1 core). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.8 CARM for AVX SIMD DP at nominal frequency. . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.1 Toypush instruction mix 5.1a and Top-Down metrics 5.1b. . . . . . . . . . . . . . . . . . . . . . 54

5.2 CARM characterization of main Toypush kernels in Intel Skylake 6700K. . . . . . . . . . . . . . 55

5.3 CARM model: Toypush optimization characterization in Intel Skylake 6700K. . . . . . . . . . . . 56

5.4 Instruction distribution and Top-Down analysis for SP Scalar LD applications. . . . . . . . . . . . 59

5.5 Application characterization within state-of-the-art CARM and proposed SP Scalar LD extension. 60

5.6 Application characterization with SP Scalar LD COMPS CARM. . . . . . . . . . . . . . . . . . . 60

5.7 Instruction distribution and Top-Down analysis for DP Scalar 2LD/ST applications. . . . . . . . . 61

5.8 Application characterization within state-of-the-art CARM and proposed DP Scalar 2LD/ST ex-

tension. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.9 Application characterization with SP Scalar LD COMPS CARM. . . . . . . . . . . . . . . . . . . 62

5.10 Power consumption characterization methodology. . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.11 Batch 1: Instruction distribution and Top-Down analysis for DP Scalar LD applications. . . . . . . 65

5.12 Batch 1: Application characterization within state-of-the-art CARM and proposed DP Scalar LD

CARM extension. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.13 Application efficiency characterization with proposed DP Scalar LD CARM extension. . . . . . . 66

5.14 Batch 2: Instruction distribution and Top-Down analysis for DP Scalar LD applications. . . . . . . 67

5.15 Batch 2: Application characterization within state-of-the-art CARM and proposed DP Scalar LD

CARM extension. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.16 Application characterization with DP Scalar LD INST CARM. . . . . . . . . . . . . . . . . . . . 68

x

List of Tables

2.1 State-of-the-art works. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5.1 Performance and arithmetic intensity of Toypush kernels before and after optimization. . . . . . . 57

xi

List of Algorithms1 Generic memory benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2 Generic FP benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3 Multiply and Add (MAD) DP AVX Benchmark for Intel Ivy Bridge . . . . . . . . . . . . . . . . 30

4 FMA DP AVX Benchmark for Intel Skylake . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5 Generic CARM benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

xiii

List of AcronymsAI Arithmetic Intensity

AMT Asynchronous Many-Task

AVX Advanced Vector Extensions

BPU Branch Prediction Unit

CARM Cache-Aware Roofline Model

CPU Central Processing Unit

DMI Direct Media Interface

DP Double Precision

DRAM Dynamic Random Access Memory

DSB Decoded Icache

ECM Execution-Cache-Memory Model

FMA Fused Multiply-Add

FP Floating Point

FPGA Field Programmable Gate Array

GPU Graphics Processing Unit

HLS High-Level Synthesis

HPC High Performance Computing

IDQ Instruction Decode Queue

LLC Last Level Cache

LSD Loop Stream Detector

MAD Multiply and Add

MSR Model Specific Register

MSROM Micro-Code Store Read-Only Memory

NUMA Non-Uniform Memory Access

OI Operational Intensity

xv

ORM Original Roofline Model

PAPI Performance Application Programming Interface

PCIe Peripheral Component Interconnect express

PMU Performance Monitoring Unit

RAPL Running Average Power limit

SIMD Single Instruction Multiple Data

SP Single Precision

SPEC Standard Performance Evaluation Corporation

SSE Streaming SIMD Extensions

TSC Time Stamp Counter

xvi

1. IntroductionIn order to keep up with the growing computational needs of parallel applications, almost every feature of

modern multi-core processors suffered continuous improvements over the last years. However, this provoked an

increase in the complexity of multi-cores, which may diversely impact their performance, power, and energy-

efficiency [3]. Consequently, it is difficult to guarantee the best execution efficiency for the applications, since

the evaluation of all possible application execution bottlenecks is far from being a trivial task, especially when

different execution domains are involved (e.g., performance versus power consumption).

In this process, it is very important to relate the application requirements with the capabilities of the system

where they are executed. Although cycle-accurate counters and/or methods that use hardware counters to per-

form an extensive experimental evaluation (e.g., Top-Down Method [2]) provide an in-depth characterization of

the architecture/application capabilities, those environments are usually too complex and hard to develop. The

alternative is using simple, insightful and more intuitive approaches for modeling the micro-architecture upper-

bounds for performance, power consumption and energy-efficiency [3, 5]. These models typically focus only on

some micro-architecture features when describing the multi-core upper-bounds, hence they can smooth the work

of programmers, in the stages of design and prototyping.

One of the mostly used of insightful approaches is the roofline modeling [6], which considers the maximum

capabilities of specific functional units (usually, double precision Floating Point (FP) units) and memory hierarchy

(in terms of the bandwidth). Due to their simplicity, the roofline models are usually used to detect the main

bottlenecks in the application execution and to provide useful optimization guidelines. As a result, these models

can quantify the potential of applications to reach the micro-architecture maximums (rooflines).

1.1 Motivation

In past decades, technological improvements and micro-architectural innovation lead to an exponential per-

formance increase of processor architectures. This performance growth is typically coupled with an increase in

the overall system complexity due to introduced micro-architecture and system-wide enhancements, e.g., a higher

number of cores with advanced pipeline functionalities and memory hierarchy with deeper and diversified lev-

els [7, 8]. Besides the constant improvements in the architectural features, the number of the transistors on chip

was also increasing within each new processor micro-architecture, according to the Moore’s Law (doubling of

transistors on chip each 18 months) [9]. However, the current technology is reaching its limit and the occur-

rence of dark-silicon, i.e., the impossibility of having all transistors operating at the same time, turned into a

huge obstacle for performance, power and energy-efficiency scaling in modern multi-core processors [9]. Due to

this phenomenon, different approaches to build more energy-efficient architectures are needed and they are even

moving towards highly heterogeneous designs [8].

The increased complexity and diversity of contemporary architectures impose significant challenges to ensure

efficient execution and performance portability of real-world and commonly used applications. These challenges

mainly arise from the difficulty to fine tune the application execution on a given computing platform in respect

1

to the capabilities of hardware resources and their ability to satisfy the application-specific characteristics and

demands. Therefore, relating the application run-time behavior with the capabilities of underlying hardware re-

sources is crucial for identifying the potential execution bottlenecks (arising from the inefficient use of system

resources), as well as for assessing the application potential to fully exploit the hardware capabilities.

Besides these challenges at the computer architecture level, the inherent complexity of modern parallel ap-

plications imposes an additional burden when optimizing their execution and improving performance on general-

purpose hardware. In particular, a real-world application may involve a high degree of execution heterogeneity

via inclusion of several execution phases, each exercising a specific set of hardware resources. Hence, different

application phases may experience different execution bottlenecks from the hardware perspective, thus requiring a

set of different optimization techniques to be applied. For example, the performance of application phases bound

by the inefficient use of memory hierarchy can be improved by optimizing the memory access pattern and cache

utilization, while code vectorization can be applied to boost the performance of phases that do not fully exploit the

capability of functional units. In real-world scenarios, even for a single application phase, it might be necessary to

simultaneously or iteratively apply several different techniques until reaching the desired performance.

In such a broad optimization space, evaluating the benefits and trade-offs across a set of different solutions and

optimization goals is far from being a trivial task. To identify the most appropriate implementations, optimization

techniques and main hardware execution bottlenecks, a range of complex execution scenarios must be considered

(even at the level of a single core or a functional unit). Although architecture-specific testing and simulation en-

vironments can precisely model the functionality of architectures and applications, these environments are rather

complex, hard to use and difficult to develop [10, 11]. However, for fast prototyping, simple and insightful perfor-

mance models are essential and particularly useful for computer architects and application designers, since they

provide the means to quickly assess and relate the main characteristics of the architectures and applications.

To this respect, the approaches based on roofline modeling are particularly useful, since they provide simple and

intuitive ways to combine application needs with micro-architecture upper-bounds. In the roofline modeling, there

are two main approaches: the Original Roofline Model (ORM) [4] and Cache-Aware Roofline Model (CARM)

[3, 5]. The main difference between these two models is the way how the memory traffic is perceived. The ORM

only considers the data traffic between consecutive memory levels (usually, Dynamic Random Access Memory

(DRAM)), while CARM considers the data traffic as seen by the core and through all memory hierarchy. On the

other hand, both models consider the upper-bounds for memory and computation performance and they allow to

evaluate how far the application is from exploiting the maximum processor capabilities. Besides this, these models

are useful to define if the application is mostly memory or compute bound.

Due to its usefulness, roofline modeling is the starting point of several works. ORM is used in the analysis

of Non-Uniform Memory Access (NUMA) systems [12–14], for the analysis of application execution bottlenecks

[15] and for application in Asynchronous Many-Task (AMT) runtimes [16]. Regarding CARM, it was used to

characterize real applications [17] and several tools to facilitate its analysis were developed, such as in [18]. More-

over, it resulted in a collaboration with Intel Corporation, and CARM is integrated as the fully supported feature

in the Intel Advisor 2017, contained in Intel R© Parallel Studio XE 2017 [19].

However, due to their simplicity, these models do not take into account all important micro-architecture fea-

tures, which might be relevant for characterization of certain workloads. For example, real applications have

2

different instruction mixes and, since these models only consider FP operations, some application characteristics

can not be analyzed. Since the models assume that computations and data transfers completely overlap in time, for

applications where data transfer time and compute time are mainly sequential, the results might not be the most

accurate, mainly in power consumption. Finally, despite giving a set of high level information, in certain scenarios

it is not possible to distinguish the actual execution bottleneck for some applications, e.g. to quantify the impact of

the bandwidth between the memory levels in the performance CARM.

In order to tackle those issues, the main objective of this Thesis is to extend the insightfulness of existing

roofline models, by proposing a set of novel methods and extensions to overcome CARM limitations while main-

taining the model simplicity, e.g., by including additional information regarding the potential execution bottle-

necks, such as bandwidth between memory levels and the impact of different instruction mixes. In order to accom-

plish this objective, it is fundamental to find a trade-off between the detail level of modeling and its easiness to be

understood by the user. In fact, the outcomes of this Thesis are expected to be used as a starting point for further

improvements of the CARM implementation in Intel Advisor.

1.2 Objectives

In order to overcome the previously referred limitations, this Thesis has the following objectives:

• Benchmarking real applications by using hardware counters in order to characterize the application behavior

in real platforms and by analyzing the application source codes to correlate their instruction mix, followed

by tests in state-of-the-art platforms/models, with the purpose of acquiring additional information regarding

application execution bottlenecks.

• Proposing a set of novel insightful methodologies and extensions to more precisely characterize the appli-

cation behavior and execution bottlenecks in the performance, power consumption and energy-efficiency

CARMs.

• Benchmarking Intel Skylake 6700K and Intel Ivy Bridge 3770K micro-architectures, through the develop-

ment of benchmarks and by using hardware instrumentation tools to characterize the upper-bounds of the

architectures for different scenarios, e.g., different instruction mixes, different loads/stores ratio, etc.

• Validating the proposed CARM extensions for performance and power consumption in Intel Ivy Bridge

3770K and Intel Skylake 6700K by focusing on different micro-architecture capabilities, e.g., different load-

/store ratios and utilization of different arithmetic units.

• Providing a set of suggestions and improvements for the current implementation of the Intel Advisor CARM,

in order to further improve its insightfulness.

1.3 Main contributions

In this Thesis, a set of application-centric micro-architecture insightful models for performance, power con-

sumption and energy-efficiency are proposed that aim at improving the insightfulness of the state-of-the-art models

3

in order to cover a wide range of execution scenarios from both micro-architecture and application perspectives.

The proposed models explicitly consider the impact of different instruction types, ratio of memory operations

and instruction set extensions on the realistically attainable upper-bounds of the modern multi-core processors.

In addition, a set of novel and redefined general roofline models are also investigated in order to provide a more

precise characterization of the potential execution bottlenecks for applications which execution is not necessarily

dominated by the FP operations. As such, these models provide a foundation to derive more general insightful

micro-architecture models based on the fundamental roofline modeling principles.

In order to fully demonstrate the usefulness and insightfulness of the proposed models, an extensive experi-

mental validation, evaluation and analysis are performed on real hardware platforms (equipped with the quad-core

Intel Skylake 6700K and Ivy Bridge 3770K processors) and real-world applications, including a set of FP bench-

marks for the Standard Performance Evaluation Corporation (SPEC) suite. For these applications, the information

extracted from the proposed CARMs is also compared with the state-of-the-art CARM implementation, demon-

strating their ability to provide the application characterization of higher accuracy. In particular, by following

the optimization guidelines given by the proposed models, the performance of several application kernels was

improved for up to 6.43 times when compared to their unoptimized versions.

The initial outcomes and research achievements from this Thesis were communicated at the HPCS 2017 inter-

national conference with the following contributions:

• Diogo Marques, Helder Duarte, Aleksandar Ilic, Leonel Sousa, Roman Belenov, Philippe Thierry and Za-

khar A. Matveev. “Performance Analysis with Cache-Aware Roofline Model in Intel Advisor”, In Proceed-

ings of the International Conference on High Performance Computing & Simulation (HPCS’17), Genoa,

Italy, July 2017. (paper in collaboration with Intel Corporation);

• Diogo Marques, Helder Duarte, Leonel Sousa, and Aleksandar Ilic. “Analyzing Performance of Multi-cores

and Applications with Cache-aware Roofline Model”, In Special Session on High Performance Computing

for Application Benchmarking and Optimization (HPBench’17), collocated with International Conference

on High Performance Computing & Simulation (HPCS’17), Genoa, Italy, July 2017. (extended abstract)

In addition, the outcomes and experimental evaluations from this Thesis were also presented in several tutorials

and invited talks at international conferences, such as SC’17, HiPEAC’17, HPCS’17, HPBench’17 and NESUS’17.

Furthermore, the presented methodology for experimental assessment of the micro-architecture upper-bound ca-

pabilities (in particular, for the attainable bandwidth for different levels of memory hierarchy) was indirectly used

to improve the Intel Advisor CARM implementation (in alpha and beta development stages of the tool).

1.4 Outline

This document is structured as follows:

• Chapter 2 - Background: Insightful modeling of multi-core processors: This chapter presents a sum-

mary regarding the state-of-the-art. It starts by briefly explaining the Skylake micro-architecture, which is

the multi-core processor used for developments in this Thesis. Besides, Top-Down method [2, 20] is also

presented, to better relate its metrics with Intel micro-architecture capabilities. Next, the roofline modeling

4

is introduced, where its features, approaches and open challenges are explained. Finally, a brief overview

is provided regarding the state-of-art works, mainly based on the roofline modeling, and related with the

application behavior in several systems (mostly in Central Processing Unit (CPU)), by referring to their

importance to this Thesis.

• Chapter 3 - Reaching the architecture upper-bounds with micro-benchmarking: This chapter presents

all the steps performed to obtain different micro-architecture upper-bounds. It starts by proposing the bench-

marking tool and designed benchmarks structure. Besides, Intel Ivy Bridge 3770K and Intel Skylake 6700K

capabilities are compared from performance, power consumption and energy-efficiency points of view. Fi-

nally, Top-Down analysis is performed in the proposed benchmarks in order to assess their quality and

accuracy.

• Chapter 4 - Proposed insightful models: Construction and experimental validation: In this chapter,

several CARM instances reflecting different micro-architectural capabilities are proposed. In particular, a

CARM model reflecting the maximum upper-bounds in memory subsystem and FP units of Intel Skylake

6700K is constructed (state-of-the-art CARM). The insights provided by this model are compared with

the characterization obtained with proposed extensions in Chapter 5. Moreover, CARM validation in Intel

Skylake 6700K and Intel Ivy Bridge 3770 K is performed, using the tool presented in Chapter 3.

• Chapter 5 - Application characterization and optimization in the proposed insightful models: In this

chapter the insightfulness and usability of the proposed CARM extensions is demonstrated. To accomplish

this task, a deep characterization of set of real-word FP applications from SPEC suite is performed in the

proposed CARM instances, by taking into account application instruction distribution and Top-Down analy-

sis. Besides, Toypush miniapp [21] is optimized based on insights provided by proposed CARM extensions.

Finally, state-of-the-art CARM insights are compared with the characterization of the proposed CARM ex-

tensions.

• Chapter 6 - Conclusions and Future Works: In this chapter, the conclusions obtained from the work per-

formed in this Thesis are presented, as well as, possible future works that can increase CARM insightfulness.

5

2. Background: Insightful modeling of mod-

ern multi-core processorsThe research work in this Thesis is mainly focused on proposing novel methods and extensions to existing

approaches for insightful modeling of multi-core processors, aiming to increase their accuracy and usability. With

this aim, in this chapter, an in-depth overview of the main concepts and the required background information is pro-

vided, which are necessary to facilitate the understanding of proposed solutions and their scientific contributions.

For this purpose, a thorough overview regarding the architecture of modern multi-core processors is presented, as

well the state-of-the-art approaches for insightful modeling of their performance, power consumption and energy-

efficiency. A special emphasis is also given to exposing the main challenges and open problems in this research

area, which are specifically tackled in this Thesis.

To this respect, two different micro-architectures from Intel Core processor family are firstly introduced (i.e.,

Intel Ivy Bridge [1] and the most recent Intel Skylake [7]), by providing an overview about their overall structure,

pipeline functionality and memory hierarchy. Besides, one of the most relevant methods for performance analysis

in multi-core processors, i.e., the Top-Down method [2], is introduced, correlating its metrics with the presented

Intel micro-architectures. Furthermore, the most relevant approaches for insightful modeling for performance,

power consumption and energy efficiency of multi-core processors are deeply examined, with a specific emphasis

given to ORM [4] and CARM [3, 5]. This Chapter showcases the usability of both ORM and CARM and it also

states a set of open challenges, in order to further improve their insightfulness. In addition, other state-of-the-art

models and extensions to roofline modeling are analyzed, as well as the breakthrough methods for application

characterization and analysis [1, 2].

2.1 Modern multi-core processors and performance analysis

The first multi-core processors appeared around 2001, with IBM-Power4 processor [22], but only in 2005

they outnumbered the single-core processors, with the release of AMD Dual-Core Opteron and Intel Pentium

D. Since then, with the improvement of silicon technology, the multi-core processors have increased their per-

formance and computational capabilities, by mostly following Moore’s Law [9]. In particular, until 2016, Intel

micro-architectures followed a tick-tock manufacturing model, where the “tock” represents the introduction of a

new micro-architecture (or the substantial improvement over the previous one), while the next “tick” processor

maintains the same micro-architecture with reduced manufacturing technology. This strategy was followed since

2006, e.g., Sandy Bridge architecture tock (32 nm) and the Ivy Bridge architecture tick (22 nm).

For the most recent Skylake micro-architecture (14nm “tock”, introduced in 2015), the “tick-tock” manu-

facturing and design model is officially discontinued, thus suggesting the potential diminishing applicability of

the Moore’s Law in the most recent micro-architectures. In turn, Intel now adopts the “process-architecture-

optimization” model, where the first Skylake successor, i.e., the Kabylake micro-architecture (launched in 2017

6

for the desktop market), is still produced in 14 nm technology, thus does not adhere to a reduction in the size

of the transistors. It is also announced that Intel will introduce one more Skylake/Kaby-Lake optimization step

(Coffee Lake, expected in 2017), before releasing the 10 nm “tick”, i.e., Cannonlake in 2018. As a result, Intel

Skylake-based micro-architectures are expected to dominate the CPU market even in the future, at least for next

several years.

For these reasons, in the scope of this Thesis, two Intel micro-architectures are considered, namely Ivy Bridge

and Skylake micro-architectures, which fundamental aspects are subsequently analyzed. It is also worth to note

that the choice on primarily focusing on Intel CPUs (versus other manufacturers, e.g., AMD) is motivated by

its clear dominance in the High Performance Computing (HPC) environments, e.g., more than 90% of Top500

supercomputers rely on Intel devices and architectures.

Moreover, Top-Down method [2] is a performance analysis tool developed by Intel Corporation. It correlates

application characteristics with processor capabilities, by identifying the main bottlenecks that limit performance.

Since it is a very complete model, which evaluates a huge amount of possible bottlenecks that can limit application

performance, Top-Down method will be used in this Thesis to confirm the correctness and insightfulness of the

insights provided by the proposed CARM extensions.

2.1.1 Intel Ivy Bridge and Skylake micro-architectures

Despite the introduced enhancements, different micro-architecture generations from the Intel Core processor

family share a very similar basic pipeline structure. For Intel Skylake micro-architecture, Figure 2.1 presents a

high level overview of the CPU core pipeline. The core pipeline can be divided in two main parts: frontend

(in-order execution) and backend (out-of-order execution). The frontier between them is the Instruction Decode

Queue (IDQ), which can hold up to 64 micro-operations (µops) and contains a Loop Stream Detector (LSD), able

to detect loops of up to 64 µops [1].

In the frontend, the µops are delivered to the IDQ by three components: Micro-Code Store Read-Only Mem-

ory (MSROM), Decoded Icache (DSB) and Legacy Decode Pipeline. The Legacy Decode Pipeline obtains the

instructions from L1 Instruction Cache and delivers up to 5 µops per cycle to the IDQ. The DSB is fed by the

Legacy Decode Pipeline and it stores the latest fetched and decoded µops. As such, DSB allows to bypass the

Legacy Decode Pipeline for a set of recently decoded instructions, which is very useful for loop execution (i.e., in

cases when the instructions are repeated in each loop iteration). The DSB can deliver up to 6 µops per cycle to the

IDQ. Finally, the MSROM can issue a maximum of 4 µops per cycle and it is only used for instructions longer

than 4 µops [1]. The instruction flow is controlled by the Branch Prediction Unit (BPU), which designates the

next instruction to be forwarded to the IDQ either from the DSB or from the traditional decoding pipeline (i.e., by

fetching the instruction from the L1 instruction cache and by decoding it in the Legacy Decode Pipeline).

From the IDQ, the µops enter in the renamer block, where several execution steps can be performed, such as,

binding the dispatch ports with execution resources, zero-idiom operations (to clear register contents to zero using

common operations, e.g., XOR), one-idiom operations (to set all the register bits to 1 using common operations,

e.g., CMPEQ) and zero-latency register move operations (to exchange the content between registers). As it can

be observed, these operations are performed before the instruction scheduling stage, which allows to reduce the

scheduler’s workload and complexity, thus resulting in the overall performance improvements [1].

7

Bad Speculation

Memory Subsystem

Frontend

Frontend Bound

Backend

Compute Ports

MemoryPorts

Retiring

Figure 2.1: CPU pipeline for a Skylake micro-architecture [1].

In the scheduler, µops are forwarded to the respective dispatch ports. In both micro-architectures (i.e., Ivy

Bridge and Skylake), ports 0, 1 and 5 are mainly used for FP operations, while ports 2, 3 and 4 are dedicated to the

memory operations. In Skylake micro-architecture, the additional ports 6 and 7 are introduced to provide further

enhancements, mainly for integer arithmetic and memory operations, respectively. As a result, Skylake processor

can dispatch a ready µop to one of eight different ports for execution (versus six ports in Ivy Bridge).

Both micro-architectures support Single Instruction Multiple Data (SIMD) instructions, e.g., Advanced Vector

Extensions (AVX) and Streaming SIMD Extensions (SSE), as well as scalar instructions (e.g., ADD and MUL).

In contrast to Ivy Bridge, Skylake micro-architecture provides the full support for AVX Double Precision (DP)

FP Fused Multiply-Add (FMA) operations in two different execution ports (see Figure 2.1) [1]. However, in

Ivy Bridge, AVX FMA instructions can only be replicated by simultaneously performing a multiply instruction

followed by an addition in two different ports, which is referred herein as the AVX FP Multiply and Add (MAD)

operation. As a result, at the same operating frequency and for the same instruction set, the AVX FP throughput

can be effectively doubled in Skylake when compared to Ivy Bridge. In detail, in Intel Ivy Bridge, FP MUL and

FP ADD are served by two different ports (ports 0 and 1, respectively), and these two instructions can be executed

in the same clock cycle [1], i.e., one MAD per clock. Hence, when using AVX SIMD DP instructions, AVX

vector length allows to perform 4 flops per instruction and, correspondingly, the Ivy Bridge micro-architecture can

8

(a) Intel Skylake micro-architecture. (b) Intel Ivy Bridge micro-architecture.

Figure 2.2: Memory subsystem for Intel micro-architectures.

deliver up to 8 flops per cycle for AVX DP MAD operations. For example, at the nominal frequency (3.5 GHz),

Intel Ivy Bridge 3770K processor can deliver a maximum performance of 8×3.5=28 GFLOPS/s per core, i.e., 112

GFLOPS/s when all four cores are fully utilized (4×28). Regarding Intel Skylake, AVX DP FP FMA instructions

are served by ports 0 and 1 and the respective functional unit can deliver 8 flops per cycle (per port). Thus, Intel

Skylake has a maximum throughput of 16 flops per cycle, i.e., the double of Intel Ivy Bridge maximum throughput.

Hence, at the nominal frequency (4 GHz) Intel Skylake 6700K processor can deliver the maximum performance

of 16×4=64 GFLOPS/s per core, i.e., 256 GFLOPS/s for 4 cores, which corresponds to an increase of about 2.3

times when compared to Intel Ivy Bridge 3770K processor.

Figures 2.2a and 2.2b present the memory subsystem organization for Intel Skylake and Ivy Bridge micro-

architectures, respectively. In both architectures, the ports 2 and 3 are reserved for load operations, while port 4

serves the store instruction from/to the core to the L1 data cache. Moreover, in Skylake micro-architecture, there

is an additional port 7, which is reserved for store address calculation, in order to provide the full support for two

loads and one store instruction (2LD+ST) per cycle per core. Furthermore, from Ivy Bridge to Skylake, the bus

width of the connection lane between the ports and the L1 data cache was increased from 128 bits to 256 bits, i.e.,

from 16 bytes to 32 bytes per port (see Figure 2.2). As a result, Skylake supports a maximum theoretical throughput

of 32bytes×3ports=96 bytes per cycle (per core), while Ivy Bridge can only deliver 16bytes×3ports=48 bytes per

cycle (per core).

The memory subsystem of Ivy Bridge and Skylake micro-architectures contains three cache levels (L1, L2

and L3) and DRAM. L1 and L2 caches are private to each core, and their sizes are 32 KB and 256 KB per core,

respectively. L3 and DRAM are shared between cores and their size varies according to processor model and

system configuration. For example, in the scope of this Thesis, two different computing platforms were evaluated:

i) an Ivy Bridge-based system with a quad-core Intel 3770K processor, 8 MB of L3 cache and 8 GB of DRAM;

and ii) a Skylake-based platform with a quad-core Intel 6700K processor, 8 MB of L3 cache and 32 GB of DRAM.

It is also worth to note that the L1 instruction and data caches are separated, while L2 and L3 caches include both

instructions and data [1].

The connection between the cores and the L3 cache is made through a ring interconnection to multiple slices

of this memory level, as shown in Figure 2.3. This connection is a coherent bi-directional ring bus that delivers 32

bytes per cycle in each stop and connects three different parts of the chip: the cores and L3, the on-chip Graphics

9

Figure 2.3: Connection between cache levels, GPU and system agent [1]

Processing Unit (GPU) and the system agent that includes the DRAM controller, Direct Media Interface (DMI)

controller and Peripheral Component Interconnect express (PCIe) controller. Finally, this micro-architecture also

supports speculative data loads to one of the cache levels, using several hardware pre-fetching mechanisms, which

can improve performance for the codes dominated by sequentially ordered memory accesses.

2.1.2 Top-Down method for application analysis and detection of execution bottlenecks

Recently, a Top-Down method for counter-based application analysis was proposed in [1, 2], which represents

a breakthrough approach to identify different application execution bottlenecks that limit application performance

in modern out-of-order CPUs. It aims at solving the limitations of traditional methods that do not take into account

several characteristics of modern CPUs, e.g., CPU stalls that are overlapped among different functional units,

speculative execution and the effects of branch miss prediction.

The Top-Down concept is based on a structured drill down method that guides the user to critical areas within

the processor pipeline by relying on the CPU Performance Monitoring Unit (PMU) (in particular, hardware perfor-

mance counters). In detail, the Top-Down method decouples the processor pipeline in a tree-like structure, where

each node represents a potential execution bottleneck (in different parts of the CPU pipeline) and each node is

attributed with a specific weight to emphasize its relevance, as presented in Figure 2.4.

When applying the Top-Down methodology, it is possible to identify the predominant execution bottlenecks,

since the Top-Down method reports the overall contribution for each of these pipeline parts in the overall applica-

tion execution, i.e., how different parts of the CPU pipeline are used by the application. As such, the component

with the highest utilization in the Top-Down hierarchy can be considered as the predominant limiting factor for

the application execution. Typically, this analysis should be performed between the nodes in the same level of the

hierarchy (since they refer to the same pipeline stage), starting from the top level. Afterwards, the nodes in the

inner hierarchy should be examined only for the top nodes marked as the predominant sources of bottlenecks.

As previously referred, a modern out-of-order CPU engine is divided in two main parts: frontend and backend

(see Figure 2.1). The former is responsible for instruction fetches and their transformation in micro-operations

(decoding), while the latter is responsible for scheduling, executing and retiring the micro-operations. Therefore,

in the Top-Down method, the pipeline analysis is divided in four categories: retiring, bad speculation, frontend

bound and backend bound (see Figure 2.4).

Since the backend receives micro-instructions from frontend, an application is frontend bound when the back-

end is under-supplied. In this case, the application can be bounded by the bandwidth or latency of the frontend,

10

Figure 2.4: Top-Down Analysis hierarchy [2].

where the former signals the inefficiency in the fetch-units, while the latter is directly connected with fetch starva-

tion. Bad speculation includes all the stalls originating from branch miss-predictions and machine clears, i.e., when

the entire CPU pipeline is cleared due to memory ordering violations, self-modifying code or when certain loads

refer to illegal addresses. Thus, it includes stalls from two main situations: 1) the pipeline slots are used to issue

micro-operations that do not retire; and 2) the slots where the issue pipeline is blocked due to a miss-speculation.

The retiring category takes into account the issued micro-operations that eventually get retired. The best-

case scenario corresponds to a retiring of 100%, i.e., when the processor retires the maximum amount of micro-

operations per cycle. However, this does not imply that the application is fully optimized. For example, a high

retiring for a non-vectorized application can suggest possible improvements by introducing SIMD instructions in

the code. Lastly, the backend bound node is divided in core bound and memory bound parts. In core bound, a

stall can occur due to execution starvation or sub-optimal ports utilization. On the other hand, the memory bound

includes execution stalls that occur while serving the data requests from the memory hierarchy (which can be

further decoupled on a per memory level basis).

Top-Down method was recently extended to provide the power consumption breakdown for different com-

ponents in the CPU pipeline [20] by following a similar approach used in the performance Top-Down method

[2]. As a consequence, this power breakdown method allows to decouple the contribution of frontend, backend

and core to the overall power consumption. For this purpose, a set of hardware counters is used to correlate the

performance metrics with power consumption. Each counter is associated with a weight, obtained through a set

of experimental tests performed on a specific micro-architecture and micro-architecture simulator. However, the

proposed power breakdown method currently covers the single-core execution and only a set of private caches in

the memory subsystem (i.e., L1 and L2) [20].

Despite the model complexity (due to the high number of hardware counters), the Top-Down method allows to

deeply correlate application behavior with the micro-architecture capabilities, as well as to cover a wide range of

11

potential sources of application execution bottlenecks from the micro-architecture perspective. For these reasons,

the Top-Down method will be relied upon in this Thesis to complement the analysis and validation of the herein

proposed roofline modeling approaches and extensions, as well as to confirm the efficiency and accuracy of the

developed micro-benchmarks for in-depth experimental evaluation of the micro-architecture capabilities.

2.2 State-of-the-art approaches for insightful modeling of multi-cores

The characterization and optimization process of the applications can be a difficult task, due to micro-architecture

complexity and application heterogeneity. For a given algorithm and application implementation, determining what

are the current execution bottlenecks is far from being a trivial job, since it is required to relate application charac-

teristics/demands with the capability of different subsystems in the processor pipeline. This is especially challeng-

ing process when considering the applications with a large diversity of instruction types in their instruction mixes,

which can simultaneously exercise several different components of the micro-architecture, e.g., different levels of

memory hierarchy and/or functional units.

In these scenarios, approaches for insightful modeling of multi-core processors are valuable resources for com-

puter architects and application developers, allowing to ease the characterization and optimization of the applica-

tions, through a fast analysis and intuitive visual representation of the most relevant micro-architecture capabilities.

In order to be insightful and general, the model can not include too many micro-architectural details, as it will lead

to the model that is too complex and/or architecture-specific. As such, the general insightful model needs to incor-

porate only the minimum set of architecture-related information in order to be able to provide important guidelines

about the primary application execution bottlenecks. As a result, the insightful modeling represents a trade-off

between the level of detail (modeling accuracy) and model simplicity.

Roofline modeling is an insightful modeling method widely used in both academia and industry, which has

already provided several contributions in micro-architecture and application analysis [6]. It represents an intuitive

and insightful tool, which allows to characterize application behavior in multi-core, many-core or accelerator

processor architectures. It combines, in a single plot, the inherent hardware limitations and application optimization

potential, by modeling the architecture attainable upper-bounds for performance, power consumption, energy or

efficiency [6]. Roofline modeling relies on the observation that memory operations and computations can be

executed concurrently in the modern out-of-order processors, thus the overall execution can be limited either by

the time to perform the computations or by the memory accesses. Hence, roofline modeling methods contain two

distinct regions: memory bound and compute bound regions, which are useful to pinpoint the potential application

execution bottlenecks [3, 4, 6]. In the existing literature, there are two main approaches for roofline modeling: the

ORM [4] (also referred as the Classic Roofline Model) and the recently proposed CARM [3, 5]. Both models relate

intensity, i.e., the ratio between computations and memory traffic, with different metrics (performance, power,

energy-efficiency) in order to facilitate application characterization and provide important optimization guidelines.

2.2.1 Performance Roofline Modeling

In general, the performance roofline models relate intensity to FP performance and memory bandwidth (mem-

ory traffic). However, CARM [3, 5] and ORM [4] analyze memory traffic differently. While ORM only observes

12

the traffic between two specific memory levels (usually between the Last Level Cache (LLC) and DRAM), CARM

considers the complete memory hierarchy by observing the memory traffic from core point of view, as shown in

Figure 2.5. Hence, CARM can represent in a single plot the realistically attainable bandwidth (By) of each memory

level y, where y ∈ {L1,L2, ...,LLC,DRAM}. In addition, the throughput of FP units is seen equally by both models

and it is used to represent the peak compute performance of a given processor (FP in flops/s).

Since ORM and CARM observe memory traffic differently, the intensity used in each of these models is

also different. ORM introduces the term Operational Intensity (OI) to denote compute operations per byte of

data traffic to/from a specific level of the memory hierarchy [4]. For example, the DRAM variant of ORM only

observes the data traffic between the LLC and DRAM, i.e., bytes transferred to/from the DRAM, herein referred

as DRAMbytes. As such, the OI in DRAM ORM is expressed in f lops per DRAMbyte. On the other hand, CARM

uses Arithmetic Intensity (AI) i.e., the ratio between compute operations and the total number of bytes originating

from the instructions in the application code (regardless of the memory level where those requests are served in

the memory hierarchy) [3]. As a consequence, the AI in CARM is expressed in f lops per byte.

Figure 2.5: ORM and CARM memory traffic [3].

In order to construct each of these models, it is necessary to take into account their respective approaches. Since

both models see the FP unit throughput equally, the time to perform a given amount of flops (φ ) is expressed as

φ/Fp, corresponding to the time involved in computations (Tc). Regarding the time to perform memory transfers

(Tm), it differs between CARM and ORM. In ORM, the time to transfer an amount of bytes served by DRAM

(βD, i.e., DRAMbytes), with DRAM bandwidth BD, is given by βD/BD. Thus, ORM application execution time is

calculated by:

T (OI) = T(

φ

βD

)= max{ Tc,Tm} = φ ×max

{1

BD×OI,

1Fp

}. (2.1)

Hence, the maximum attainable performance of the architecture in the DRAM ORM, i.e., Fa(OI), is defined as:

Fa(OI) =φ

T (OI)= min

{BD×OI,Fp

}. (2.2)

On the other hand, from CARM point of view, the time to transfer the amount of bytes (β ), served by the

memory level y with bandwidth By, is given by β/By. Consequently, CARM application execution time for level

y is expressed as Ty(AI) = φ ×max{

1By×AI

,1Fp

}. Hence, CARM maximum attainable performance, Fa,y(I), is

expressed as:

Fa,y(AI) = min{

By×AI,Fp}. (2.3)

13

(a) ORM in Intel 3770K Ivy Bridge (b) CARM in Intel 3770K Ivy Bridge

Figure 2.6: ORM and CARM [3, 4].

Figures 2.6a and 2.6b present ORM and CARM models, respectively, for Intel Ivy Bridge 3770K, with three

cache levels and DRAM. Both models are plotted for the DP FP AVX instructions, with the intensity in x-axis and

performance in y-axis (both axis in the log scale).

As presented in Figure 2.6b, CARM includes in a single plot all the memory levels, represented by four

slanted roofs (one for each memory level). Each slanted roof delimits the memory bound region of the respective

memory level (L1, L2, L3 and DRAM). The maximum attainable performance in this region is limited by L1 cache

bandwidth, while the remaining levels offer a lower attainable performance, due to the bandwidth reduction when

data is fetched further away from the core. ORM only contains one slanted roof (Figure 2.6a), representing the

bandwidth between the LLC and DRAM.

In the right part of the models, a set of horizontal roofs forms the compute bound region, which describes

the processor computational capabilities. Since Intel Ivy Bridge supports vectorized instructions, e.g., AVX, SSE,

and scalar instructions (such as ADD and MUL), the compute region can include one horizontal roof for each

instruction type. In particular, Intel Ivy Bridge achieves maximum FP throughput when using DP AVX MAD,

corresponding to the FP peak performance, as shown in Figures 2.6a and 2.6b. Furthermore, the intersection

between the horizontal and slanted roofs (i.e., the ridge point) represents the minimum intensity that allows to

reach Fp and it also demarks the point where computation time is equal to the memory transfer time [3, 6]. As a

consequence, the ridge point defines the boundary between the two regions of the model, i.e., the compute (on the

right side of the ridge point) and memory bound (on the left side of the ridge point) regions. Since the application

is usually plotted with a single point within the roofline chart, if the application intensity is on the right side of the

ridge point, it is compute bound; if it is on the left side, the application is characterized as memory bound [3, 6].

It is worth to note that the ORM can also be applied to other memory levels but, instead of using DRAM

bandwidth, it is constructed with the peak bandwidth of the desired memory level. Hence, to analyze the gains when

applying different application optimization strategies (e.g., improving the memory access pattern), it is necessary

to construct and simultaneously use several different representations of the model, one for each memory level [4].

Furthermore, application characterization greatly differs in CARM and ORM. In ORM, since only DRAM

traffic is analyzed, the implementation of certain optimization techniques (e.g. cache blocking) can provoke a

reduction in the DRAM traffic, thus increasing OI. As a result, the application point can move from the memory

bound region towards the compute bound region. In CARM, since the memory traffic is seen from the core

perspective, the optimization techniques do not modify the AI (the AI is the property of the application), unless the

applied optimizations change the algorithm itself. Hence, CARM allows to visualize the optimization potential of

14

1

2

4

8

16

32

64

128

256

512

0.015625 0.0625 0.25 1 4 16

Pe

rfo

rma

nce

(G

FL

OP

/s)

Arithmetic Intensity (FLOP/Byte)

L1Core

(Bandwid

th)

L2Core

(Bandwid

th)

L3Core

(Bandwid

th)

DRAM

Core (B

andwidth

) Performance CARM

Intel 6700K (Skylake)

4 cores | DP FP AVX

FMA DP FP AVX (Performance)

ADD/MUL DP FP AVX (Performance)

M

C

- ridge points

Figure 2.7: Performance Cache-Aware Roofline Model for an Intel 6700K quad-core processor (Skylake).

the applications, by plotting a vertical line with constant AI.

A given application (kernel) is typically plotted with a single point in the CARM, in respect to its AI and the

obtained performance when executed on a given platform. Since the AI of application point is not expected to

significantly vary when applying optimization techniques, a simple rule of thumb can be followed in the CARM

when determining the potential execution bottlenecks and deriving the optimization guidelines. Given the position

of application point in the CARM plot, an imaginary vertical line should be drawn at the application AI, then all

rooflines intersected with this imaginary line represent potential sources of execution bottlenecks that limit the

application performance.

Figure 2.7 presents an example of a Cache-aware Roofline plot where two kernels are reported. The first

kernel (marked with “M”) has an AI marked with “A1”, which is underneath the L3 bandwidth ceiling, thus it

is memory-bound, and its performance can be potentially improved by applying memory-related optimizations

(e.g., by improving cache utilization and memory access pattern). The second kernel (marked with “C”) has

an AI denoted with “A2”, which is underneath the peak performance ceilings, thus it is compute-bound, and its

performance can be potentially improved by code vectorization and the use of advanced ISA extensions.

When deciding on which optimization techniques to apply, a special attention should be given to the memory

and/or compute bottlenecks signaled by the rooflines positioned directly above the application point. Hence, when

applying different optimization techniques, it is expected that the application point moves along the y axis to-

wards the uppermost roofline (i.e., to improve application performance by breaking the above-positioned rooflines

without significant changes in the AI on the x axis). This observation does not necessary hold for the application

optimization based on the ORM and OI, due to its strong dependency on hardware properties [3–5] .

Intel Advisor Roofline

In 2017, CARM is integrated as an official feature of Intel Advisor (also referred as Intel Advisor Roofline),

where the process of building the roofline plots and in-depth application characterization are fully automatized

with respect to the hardware platform where the applications are executed [19]. Intel Advisor is a software tool

for analyzing application behavior on a wide range of Intel processors, which covers all contemporary Intel CPU

15

1

2

Figure 2.8: Intel Advisor Roofline: Performance characterization of Minighost loops

micro-architectures (from Nehalem to Skylake) up to massively parallel devices (e.g., Intel Xeon Phi x200 family,

codenamed Knights Landing). By tightly coupling a set of tools from Vectorization Advisor and Roofline Analysis,

the Intel Advisor can now provide insightful performance and design hints to help in the application optimization.

To exploit the capabilities of Intel Advisor Roofline, it is required to run the Survey and Trip Counts / FLOPS

analysis when profiling an application. During this phase, Intel Advisor also performs a set of quick benchmarks to

assess the CARM-related performance parameters of a given execution platform, such as the realistically attainable

bandwidth from different memory levels to the core and the peak performance of different arithmetic units. These

parameters are subsequently used to automatically construct all necessary rooflines in the performance CARM

for a given micro-architecture. Performance data of the target application is also extracted during the Survey and

Trip Counts analysis, e.g., the total amount of floating point operations (flops), the total amount of requested data

(bytes), execution time and vectorization efficiency. By combining this analysis with the performance CARM, the

final outcome of the Intel Advisor Roofline is produced where all loops and functions of the target applications are

characterized in the CARM plot.

Figure 2.8 presents an example of Intel Advisor CARM characterization (hierarchical mode) for several loops

in the Minighost application [23] on a single core of Intel 6700K processor (Skylake). As it can be seen, the

automatically constructed performance CARM in the Intel Advisor encapsulates the previously elaborated features

of the performance CARM, when representing the attainable micro-architecture performance upper-bounds for

different levels of the memory hierarchy (from L1 to DRAM). The loops of Minighost application are represented

as dots in the CARM plot, which size and color are selected relatively to their execution time, starting from green

to yellow and finally red. Green points are usually unworthy of attention, since their contribution to the overall

application execution time is very small. However, the contribution of red and yellow points to the overall execution

time is more significant, thus they represent the potential candidates for optimization.

In the most recent update of Intel Advisor, the Hierarchical Roofline feature is introduced, which allows vi-

sualization of the agglomerative performance for several kernels in respect to the parent kernel that invokes their

execution. As shown in Figure 2.8, this functionality is attained by connecting several application kernels into a

single parent dot, thus evaluating the FLOPS and bytes contribution of each loop/function in the main kernel (see

16

the connection of kernel 1 in Figure 2.8). This feature increases the insightfulness of the Intel Advisor Roofline,

since it eases the source code analysis with the hierarchical application characterization.

As it can be observed, Intel Advisor CARM provides a set of powerful tools for in-depth application perfor-

mance analysis on a given architecture and it eases the selection of optimization techniques that can be applied

to increase application performance. This, in turn, avoids wasting time in micro-optimizations that do not con-

tribute greatly to the overall performance of the application. In the scope of this Thesis, Intel Advisor is mainly

used to facilitate application analysis. In particular, this tool allows to define the hotspots with the biggest impact

on the overall execution time of the application, which are the main focus of this Thesis when characterizing the

applications. Besides, Intel Advisor provides the assembly code for all the measured kernels, allowing to perform

an extensive analysis of the instructions and instruction set extension utilized by each hotspot. Since Intel Advi-

sor CARM and its hierarchical version represent the first steps towards the insightful application characterization,

these charts are used as baselines for comparison with the CARM extensions proposed in this Thesis. Finally, the

instruction mix analysis in Intel Advisor might provide additional insights into application design and code quality,

as well as the additional hints on possible execution bottlenecks.

However, the contributions of this Thesis greatly surpass the pure utilization of a set of Advisor features. In

particular, a special focus is given to uncover the CARM construction methodology to provide the visual represen-

tation of the model. This analysis allows to pinpoint the shortcomings of the state-of-the-art CARM implemen-

tation, which may result in inconclusive (or even misleading) characterization and optimization hints for a set of

real-world applications. To this respect, this Thesis also proposes a set of strategies and recommendations on how

to further improve the roofline insightfulness not only for the Intel Advisor CARM implementation, but also for

the roofline modeling in general.

2.2.2 Power, Energy and Energy-Efficiency Roofline Modeling

In previous years, the main concern about application optimization only involved performance maximization.

However, due to technology and architectural constraints, the recent trends are more focused on energy-efficient

execution. In order to address this problem, ORM and CARM were extended to provide power consumption,

energy consumption and energy-efficiency modeling of CPU micro-architectures [5, 24, 25]. These new models

use the similar approaches adopted from the respective performance models, thus they inherit all the previously

mentioned differences that occur between CARM and ORM in the performance domain.

Original Roofline Model

The authors in [24–26] applied ORM performance principles to power, energy and energy-efficiency. While in

the performance model, computation time and memory transfer time can overlap, the energy consumption when

performing computations (φε f lop) and memory transfers (βDεmem) can not follow this principle. In [24, 25], by

considering a constant power (π0), which does not depend on any executed operation, the total energy consumption

of an application is expressed as:

E = φε f lop +βDεmem +π0T = φε f lop× (1+Bε

OI+

π0

ε f lop

Tφ) , (2.4)

17

where φ represents the amount of flops performed and βD is the amount of DRAMbytes transferred. Equation

(2.4) depends on three parameters: the constant energy per flop (ε f lop) and the constant energy per byte (εmem)

and the energy provided by the constant power (π0T ), which is linear in time [24–26]. From this equation, the

models for energy-efficiency (φ/E) and power (E/T ) can be derived, which are represented in Figures 2.9a and

2.9b, respectively, for Intel Ivy Bridge 3770K. Both figures are plotted with operational intensity in x-axis and the

respective metric (power or efficiency) in y-axis.

(a) ORM Energy-Efficiency model [26]. (b) ORM Power Model [26].

Figure 2.9: Original Roofline Models for energy-efficiency and power consumption.

In the energy-efficiency ORM extension, the energy balance point (Bε = ε f lop/εmem) is introduced, which

corresponds to the operational intensity where bytes and flops consume the same amount of energy. This parameter

defines the memory bound and compute bound regions, from the efficiency point of view as shown in Figure 2.9a

[24–26]. Regarding ORM power model (Figure 2.9b), it is worth to mention that maximum power corresponds

to the ridge point in the performance model. Furthermore, when OI increases, the application is deep inside the

compute bound region, and, as expected, also its average power tends to the computation power. On the other

hand, by reducing OI, the workload becomes memory bound and its power limited by DRAM [24–26].

Cache-Aware Roofline Model

Recently, CARM principles were applied to model the power consumption and energy-efficiency upper-bounds

of modern multi-cores with multiple levels of memory hierarchy [5]. For this purpose, the multi-core system is

modeled in three internal power domains: core domain (Pc), which corresponds to the power consumed by the units

related with instruction execution and memory subsystem; uncore domain (Pu), related with the power consumption

in the remaining on-chip components; and package domain (Pp), i.e., the overall power consumed by the chip. The

relation between these three domains is given by

Pp = Pc +Pu . (2.5)

Through a set of experimental benchmarks, performed in Intel Ivy Bridge 3770K, with three cache levels and

DRAM, the power consumed by the FP units and memory subsystem is obtained, as presented in Figure 2.10.

Correspondingly, the core domain can be divided in two parts - the memory subsystem and the FP units. In the

memory subsystem (presented in Figure 2.10a), due to the increase in activity of different cache levels, the core

domain power increases from L1 to L3 caches, i.e., Pβc,y, where y∈{L1,L2,L3,DRAM}, since more cache levels

are used when data is fetched further away from the core. However, when accessing DRAM, the bandwidth seen

from the core is reduced and the activity in cache diminishes (stalls while data is not fetched from DRAM), which

18

causes a reduction in the power of the core domain. In what concerns the uncore domain power (Pβu ), it is constant

for the cache accesses, and increases when the DRAM is used, since the memory controller and interconnect are

more intensely used.

(a) Memory subsystem (b) FP units

Figure 2.10: Power consumed by the processor Intel 3370K Ivy Bridge [5]

Regarding the FP units power (presented in Figure 2.10b), the core domain power (Pφc ) initially increases with

the number of performed FP operations, stabilizing when the maximum performance is achieved. The uncore

power (Pφu ) is constant, since only arithmetic units are utilized, and it is equal to the uncore power of caches.

When FP operations and memory operations are simultaneously performed, they share certain components

in the processor’s pipeline. Thus, Pβc,y and Pφ

c include two main components: 1) variable power contribution

(Pv,βc,y and Pv,φ

c ); and 2) the constant power of the chip (Pqc ), due to the shared components. Hence, the power

consumption in the core domain when only FP operations are performed is expressed as Pφc = Pq

c +Pv,φc , while the

power consumption in the core domain that corresponds to different memory levels y is given by Pβc = Pq

c,y +Pv,βc,y .

Pv,φc and Pv,β

c,y correspond to the variable power of FP units and memory level “y”, respectively. Based on these

parameters, the power of the core domain, for a given memory level y, is calculated by:

Pc,y(AI) = Pqc +Pv,β

c,y min{

1,Fp

ByAI

}+Pv,φ

c min{

1,ByAIFp

}, (2.6)

where By is the bandwidth of the memory level y [5].

Furthermore, the uncore power only varies when DRAM accesses are performed. Consequently, Pφu = Pβ

u,y =

Pqu , when y 6= D→C. Therefore, the uncore domain power is given by

Pu,y(AI) = Pqu +Pv

uTD

T (AI)= Pq

u +Pvu min

{1,

Fp

BD→CAI

}, (2.7)

where Pvu is the variable power consumed by the uncore components, BD→C is the DRAM bandwidth seen from

the core and TD is the time spent when serving the DRAM requests.

Finally, the package domain power results from the sum of the equations (2.6) and (2.7), producing the analytic

power CARM, presented in Figure 2.11a. In particular, in the ridge point, since Tc = Tm = T , both FP units

and memory subsystem contribute the most to the overall power. Thus, while in performance CARM the ridge

point indicates the best operating point (lowest AI to achieve maximum performance), the ridge point in power

CARM corresponds to the maximum power consumption. As the AI increases towards the compute bound region,

19

the power consumption asymptotically decreases towards the power consumed when only FP computations are

performed. On the other hand, as the AI reduces in the memory bound region, the power consumption decreases

to the one consumed when only performing the memory transfers to/from the specific memory level y.

However, this model does not take into account the transition between memory levels, which is performed

gradually, as shown in Figure 2.10a. Hence, the total power CARM [5] is developed (presented in Figure 2.11b),

including all the possible transitional power consumption states and defining an upper-bound for the power con-

sumption of the micro-architecture.

(a) Analytic power CARM (b) Total power CARM

Figure 2.11: Power CARM Models for Intel 3370K Ivy Bridge [5]

Based on the power CARM equations, the total energy and energy-efficiency models can be derived. In the

core domain, the energy CARM for a determined memory level “y” is defined as:

Ec,y(AI) = Pc,y(AI)T (AI) = φ

[Pq

c

min{

ByAI,Fp} +

Pv,βc,y

ByAI+

Pv,φc

Fp

], (2.8)

while the energy-efficiency CARM in core domain is given by:

εc,y(AI) =Fa,y(AI)Pc,y(AI)

=φ

Ec,y(AI)=

ByAIFp

Pqc max

{Fp,ByAI

}+Pv,β

c,y Fp +Pv,φc ByAI

. (2.9)

In the memory bound region, the total energy CARM (Figure 2.12a) is almost constant, since the execution

time is completely dominated by memory operations. On the other hand, in the compute bound region, the amount

of flops increases, dominating the execution time and, consequently, increasing the energy consumption.

(a) Total Energy CARM (b) Total Energy-Efficiency CARM

Figure 2.12: Energy and Energy-Efficiency CARM for Intel 3370K Ivy Bridge

20

Regarding the total energy-efficiency CARM (presented in Figure 2.12b), in the memory bound region, the

lowest efficiency is obtained for the DRAM, while the highest efficiency corresponds to the L1 cache. In the

compute bound region, the energy-efficiency increases with AI and converges to the maximum efficiency of the

architecture, which can only be achieved for AI → ∞. Furthermore, the ridge point (i.e., intersection between

memory and compute roofs) in the performance CARM does not correspond to the point where the maximum

efficiency is achieved, since it refers to the maximum power consumption, which does not guarantee the most

energy-efficiency execution. The energy-efficiency CARM is interpreted by relying on the regions of high energy-

efficiency, which are defined starting from the minimum AI required to achieve 99% of the maximum energy-

efficiency for each level of the memory hierarchy (see Figure 2.12b).

2.2.3 Remarks on Original and Cache-aware Roofline principles

In order to better showcase the differences between ORM and CARM, Figures 2.13a and 2.13b present the

characterization of three different iterative applications in both models, namely: APP-D (limited by DRAM),

APP-L3 (limited by L3 cache) and APP-L1 (limited by L1 cache). These applications were designed to reach the

model upper-bounds corresponding to the accessed memory level.

(a) ORM [3]. (b) CARM [3].

Figure 2.13: Application with different problem sizes in Intel 3770K Ivy Bridge.

In the first iteration, the applications APP-L1 and APP-L3 are characterized equally in both models (see points

marked with “1st”), since all memory operations fetch data from DRAM. However, in the remaining iterations, the

accesses are served by the respective memories (L1 or L3), thus DRAM traffic is reduced and OI in ORM increases.

Hence, the applications move from the memory bound region to the compute bound region and, according to ORM,

it is possible to achieve FP peak performance with these applications. On the other hand, CARM AI does not

change with the number of performed iterations, since the memory traffic that is seen from the core remains the

same. In contrast with ORM, CARM shows that APP-L1 can not have its performance further improved, while

APP-L3 performance can be boosted to achieve FP peak performance. Despite the difference in characterization,

the two workloads reach the same performance in both models, since the performance in both ORM and CARM

is reflected from the core perspective (i.e., it corresponds to the throughput of FP units). Furthermore, the APP-D

represents the only one that is characterized equally by both models, since all accesses are served by DRAM.

By comparing CARM and ORM, it is possible to state some advantages of the former faced to the latter. For

example, to correctly characterize the applications with ORM, it is necessary to construct several model instances,

in order to take into account all the memory hierarchy. In particular, for the previously presented application

21

examples, since ORM’s OI is related with the data traffic between two memory levels, it would be necessary

three different plots to characterize these three applications correctly. On the other hand, CARM allows to char-

acterize all the applications in a single plot, increasing the model simplicity and insightfulness. Besides, while

CARM construction depends mainly on experimental measurements [3, 5], ORM is constructed based on man-

ufacturer datasheets (performance ORM) and/or mathematical interpolations and approximations (power, energy

and energy-efficiency ORMs) [4, 24–26]. Hence, ORM does not take into account possible architectural limita-

tions, while CARM can reflect with more accuracy the system upper-bounds, allowing to perform a more accurate

characterization of applications and selecting the best optimization techniques.

2.2.4 State-of-the-art approaches on extending the usability of insightful models

Micro-architecture modeling and application characterization are tackled in several state-of-the-art works, with

the objective to ease the application characterization and optimization, since this task can become quite challeng-

ing when considering the micro-architecture complexity and application heterogeneity. The most representative

scientific works in this research area are presented in Table 2.1.

Table 2.1: State-of-the-art works.

Paper Year Architecture Model Objective

[1, 2] 2014 CPU Other The Top-Down Method for performance analysis

[12] 2011 CPU ORM Study of NUMA systems using the roofline model

[13],[14] 2014 CPU ORM Introducing application life cycle and memory latency in roofline mode

[15] 2014 CPU ORM Extending the identification of bottlenecks in roofline model

[16] 2016 CPU ORM Application of roofline model to AMT runtimes

[20] 2016 CPU Other The Top-Down Method for power analysis

[27] 2010 CPU ORM Introducing a memory concurrency modeling approach

[28] 2012 GPU ORM Extending the roofline model to predict performance prior implementation

[29] 2013 FPGA ORM Extending the roofline model to target FPGA performance

[30, 31] 2015 CPU Other Execution-Cache-Memory Model (ECM)

[32] 2017 GPU CARM Extending CARM to GPU architectures

[33] 2017 CPU CARM Applications analysis with Intel Advisor CARM

[34] 2017 CPU CARM PIC code performance analysis

[35] 2017 CPU CARM Monte Carlo simulations optimization

[36] 2017 CPU CARM Optimization and parallelization of B-spline based orbital evaluations

In order to improve the insightfulness and portability of different roofline modeling principles, several state-

of-the-art works propose to extend this methodology to characterize new bottlenecks and platforms. The works

presented in [12–16, 27] propose several extensions to ORM by mainly targeting the CPU micro-architectures,

while studies presented in [28] and [29] focus on ORM applicability to Field Programmable Gate Array (FPGA)

and GPUs, respectively.

Regarding the ORM FPGA extension proposed in [29], the roofline model is constructed based on High-Level

22

Synthesis (HLS) tools, in order to relate algorithm performance and I/O bandwidth. In this work, the architecture

design is fully driven by the characteristics of the specific algorithms, thus it is necessary to reconstruct the model

for each algorithm. Besides, the relation between computation capabilities and resource consumption (area) is in-

troduced by combining the ORM principles and FPGA main characteristics. The GPU extension proposed in [28],

i.e., the Boat Hull model, also represents an algorithm-based model, which does not strictly rely on any architec-

tural characteristics. This work mainly aims at predicting the algorithm performance before its implementation, by

including into ORM the information about algorithm classes [37], to characterize the algorithm as memory bound

or compute bound. Both these works, i.e., [29] and [28], can be seen as the first steps towards the ORM portability

across different platforms. Since CPU+GPU and CPU+FPGA architectures are becoming increasingly popular,

these works can help developers to define the upper-bounds of these novel heterogeneous systems.

Furthermore, ORM is also extended to NUMA systems in [12–14]. The work presented in [12] adds roofs for

different performance peaks for different utilization of cores (e.g. when only 1/4 of the cores are used) and differ-

ent memory bandwidths due to memory imbalance (e.g., when 50% of the memory traffic is handled by a single

memory controller). The works [13, 14] introduce application characterization for different phases during execu-

tion and investigate the relation between performance, memory bandwidth, operational intensity and latency of

memory accesses. These extensions provide additional insights about the application behavior during its runtime,

thus allowing to identify the application kernels that represent the main performance bottlenecks.

The works presented in [15, 16, 27] specifically focus on multi-core processors. In [27], the main objective

is to evaluate the effects of concurrent cache misses with ORM principles, in order to identify possible memory

bottlenecks in multi-threading applications, e.g., race conditions. In [15] the existence of additional bottlenecks

is exploited, by characterizing different components in the processor pipeline. In detail, by using a cycle-by-

cycle analysis in a Intel Xeon processor simulator, the throughput, latency, issue and stall of several components

(e.g. reorder buffer, reservation station and load/store buffer) are obtained, in order to extend ORM memory and

compute bound regions. Finally, the work presented in [16] extends the ORM to AMT runtimes in multi-cores.

The analysis of these applications is quite challenging, since the asynchronous nature hides the runtime overheads.

To address this problem, a model based on ORM is created for sequential units.

Although this Thesis is focused on CARM and multi-core CPUs, the extensions proposed in the above-referred

works, including those targeting FPGA, GPU and NUMA systems, represent orthogonal research approaches that

can be adapted for CARM in multi-core processors, in order to further increase the insightfulness and usability

of the model in a wide range of possible scenarios. To this respect, it is also worth noting that the importance

of roofline modeling can be evidenced in works that adopt similar modeling principles to provide more elaborate

micro-architecture models, such as ECM [30, 31]. The objective of this model is to estimate application execution

time and characterize their memory bottlenecks, by estimating the access time of each memory level. Although the

ECM inherits similar approach to memory bandwidth modeling as proposed in CARM, its construction requires a

large amount of hardware performance counters to be used, which increases its complexity and limits its usability

to a specific set of platforms that support the required set of counters and expose them to the end-user.

Although recent, the performance CARM [3] was already used to aid architecture design and for optimiza-

tion and characterization of applications from different scientific domains [18, 38–42], while several tools were

also proposed to ease the CARM-based analysis [17, 43, 44]. The works presented in [33–36] use Intel Advisor

23

CARM to characterize the performance of different applications, as well as to investigate the impact of applied

optimizations. In particular, [33] compares ORM and CARM application characterization and the insights pro-

vided by both models. Furthermore, CARM principles are extended to NVIDIA GPU architectures in [32], while

the works proposed in [33, 34] use Intel Advisor CARM in Intel Xeon Phi architecture. These works demonstrate

the CARM portability across platforms and architectures for all modeling domains, i.e., performance, power and

energy-efficiency.

To the best of our knowledge, there are no scientific studies tackling CARM extensions and characterization

of real-world applications for power consumption and energy-efficiency. Thus, in order to address this problem,

the main objective of this Thesis is to correlate applications performance, power and efficiency, and to propose

additional extensions to this model, in order to further improve its insightfulness.

2.3 Open challenges in insightful modeling

Due to its simplicity, roofline modeling contains certain limitations that affect all modeled domains, i.e., per-

formance, power consumption and energy-efficiency. For example, modern applications contain a huge diversity

of instructions, such as integer and floating-point, however existing roofline modeling methodologies only consider

one type of arithmetic operations at a time (in particular, floating point arithmetic operations). To address this issue,

it is necessary to introduce the awareness for a wide range of operations in the roofline modeling, in order to allow

more accurate characterization of a wider range of applications from different scientific areas. This observation is

directly related to the complexity of contemporary micro-architectures and the ability of existing roofline modeling

approaches to fully expose their upper-bounds. As referred in Section 2.1, modern multi-cores provide the support

for a vast range of instructions and different ISA extensions, which may affect the architecture performance, power

and efficiency in many different ways. Hence, by only focusing on a subset of micro-architecture features and

pipeline components, in the existing roofline modeling approaches some important bottlenecks and insights may

be hidden from computer architects and developers for certain type of applications.

For example, different applications may require a different amount of load and store operations to be performed,

which does not necessarily need to result in the maximum utilization of the memory ports in the micro-architecture

backend. In fact, depending on the ratio of load and store operations, certain components in the memory subsystem

may be underutilized, thus provoking a significant change in the attainable bandwidth upper-bounds (for each level

of the memory hierarchy). This effect will necessarily imply the modifications in the modeling of the memory

bound regions. By introducing the load/store ratio in roofline modeling, the applications can be characterized

more accurately, providing important insights about their behavior. Furthermore, when focusing on CARM and

its Intel Advisor implementation, it should be noted that it may provide limited characterization information for

applications whose performance lies between two different roofs (slopes) in the memory bound region. Since each

memory roof corresponds to a hit rate of 100% in the respective memory level, the introduction of complementary

methods to extend the insightfulness in the memory bound region of the model.

These open challenges represent some of the main research topics tackled in this Thesis via extensive micro-

architecture benchmarking in order to uncover the upper-bounds for different instruction types and mixes. Based

on this evaluation on real hardware, different flavors and types of CARMs for all modeling domains (performance,

24

power consumption and energy-efficiency) are proposed and experimentally validated. The proposed models aim

at extending the insightfulness of the existing models when determining the application execution bottlenecks and

selecting the best optimization strategy. Furthermore, novel and alternative roofline modeling approaches are also

investigated in this Thesis, in order to provide the support for a wider range of applications, which are not necessary

dominated by the FP operations.

2.4 Summary

This chapter starts by presenting an overview regarding the main aspects of a standard Intel core pipeline. In

particular, Intel Skylake 6700K and Intel Ivy Bridge 3700K processors are analyzed in detail, stating some of

the enhancements implemented between processor generations. When comparing both processors, there is a clear

improvement in Intel Skylake 6700K throughput in FP units and memory subsystem, in order to fulfill increasingly

computational demands.

Moreover, two micro-architecture modeling methods used in the scope of this Thesis are introduced, i.e., Top-

Down Method and roofline modeling. Top-Down analysis allows to decouple the main application bottlenecks

according to the processor capabilities, dividing several performance limiters in a hierarchical tree. Since these

metrics allow to better understand the main factors that limit application performance, Top-Down method plays

an important role when validating the insights provided by the proposed CARM extensions, which are one of the

main objectives of this Thesis. Regarding roofline modeling, the two main existing approaches are introduced,

i.e., CARM and ORM, for performance, power consumption and energy-efficiency. Besides, CARM and ORM

approaches are compared, stating their main differences in application characterization. Moreover, Intel Advisor

CARM implementation is also analyzed, providing a first look at its main features and their usability and insight-

fulness.

Furthermore, several state-of-the-art works are analyzed, verifying the usability of insightful modeling and, in

particular, of roofline modeling in application characterization and micro-architecture benchmarking for different

architectures and accelerators, such as, GPUs, FPGAs and many-core systems.

Finally, the chapter finishes with the discussion of some current CARM limitations. The state-of-the-art CARM

implementations do not take into account the different processor capabilities, such as, different instructions, in-

struction set extensions and load/store ratios. Since an application can exercise different processor capabilities,

extending CARM analysis to include this information is extremely important to provide a more accurate workload

characterization, which might ease designing and optimization process of the applications.

25

3. Reaching the architecture upper-bounds

with micro-benchmarkingContemporary multi-core CPUs support a big variety of instruction types and extended instruction sets, in

order to fulfill the computational demands of modern applications. Thus, to identify the main bottlenecks in the

application execution that prevents it to exploit the maximum potential of a given processor architecture, it is

essential to firstly characterize and experimentally assess the sustainable upper-bound capabilities of the micro-

architecture (e.g., the realistically achievable bandwidth of different memory levels and the throughput of different

computational units). Since multi-core processors employ highly complex out-of-order engines, these parameters

depend on a variety of pipeline components, their internal structure and features, which directly impact the micro-

architecture capabilities.

To address this issue, an accurate micro-architecture benchmarking is extremely important, allowing to charac-

terize the system throughput upper-bounds for memory subsystem and arithmetic units. Moreover, by relying on

experimental benchmarking of the micro-architecture, it is possible to assess the realistic architectural limitations

and upper-bounds, which do not necessarily correspond to the nominal (theoretical) specifications provided by

the vendors in data-sheets. In fact, the experimental evaluation may also reveal the properties that are not even

disclosed in vendor specifications. However, designing a set of micro-architecture benchmarks to fully exercise

different components in the processor pipeline is not a trivial task, as it is shown in this chapter.

In this chapter, an extensive set of benchmarks is constructed and performed for Intel Ivy Bridge and Intel Sky-

lake micro-architectures to deeply evaluate the capabilities of different subsystems in their pipeline. In particular,

the benchmarks were created to evaluate the throughput upper-bounds for complete memory hierarchy (caches and

DRAM), as well as for different types of FP units, by considering a diverse set of instructions and/or extended

instruction sets. In particular, memory subsystem benchmarking considers different load/store ratios, which affect

the sustainable memory bandwidth of the system. Hence, assessing the impact of the load/store ratios can pro-

vide additional insights about micro-architecture capabilities and allow more accurate application characterization

(tailored according to the application characteristics/demands). Moreover, in this chapter, a specifically developed

tool is introduced, describing its workflow and hardware counters access. This tool is designed to support the fine-

grained experimental evaluation in real hardware using a set of architecture-specific benchmarks, also developed

in the scope of this Thesis. In addition, the structure of each benchmark is presented, explaining its construction

and methods selected to improve its quality and reliability.

3.1 Tool for fine-grain micro-architecture benchmarking

In order to measure the amount of elapsed cycles, the number of performed memory/arithmetic instructions,

and energy consumption in different parts of the processor chip, it is necessary to access a set of hardware counter

registers, i.e., the Model Specific Registers (MSRs) built-in the processor [45]. Each MSR is identified by its

26

Thread 0

Thread N-1

Interface Init

pthread

Configure

Configure

Start counters

Start counters

Overhead

Overhead

Stop counters

Stop counters

Repeated 1024 times

Start counters

Start counters

Computations

Computations

Stop counters

Stop counters

Repeated 1024 times

pthread Report

Read TSC Read TSCRead TSC

Read TSC Read TSC Read TSC

Read TSC

Read TSC

counters

counters

create()join() median

Figure 3.1: Benchmarking tool general layout.

unique address, which is used to read and modify the register content, e.g., to obtain measurements or to configure

the counters. These operations can be executed with the assembly instructions rdmsr (to read the counter value)

and wrmsr (to configure the counter) [45]. However, the access to MSRs can only be performed from kernel

space. Hence, to access the counters from user space (e.g., during the application run-time), it is necessary to

incorporate a separate kernel module to connect both kernel and user sides. In order to achieve this functionality

and to obtain the processor throughput upper-bounds for different memory subsystem levels and arithmetic units,

a benchmarking tool was developed, which layout is presented in Figure 3.1.

The tool relies on the kernel module from [46], which provides the communication interface between the user-

space and kernel-space trough a set of system calls. In the scope of this Thesis, the kernel module was modified

to improve its execution efficiency, reduce the overheads and incorporate additional functional functionalities. For

example, the tool is upgraded to allow the access to Running Average Power limit (RAPL) interface, in order to

measure the energy consumption in different parts of the processor chip (core, uncore and at the overall process

chip, i.e., package). Besides, it was also enriched to support the entire set of uncore events, allowing to perform

measurements in a wide range of platform components.

As shown in Figure 3.1, after the tool initialization, the threads are created using pthreads interface. In each

thread, besides Time Stamp Counter (TSC) monitoring to guarantee an accurate measurement of elapsed clock

cycles, the counters to obtain the desired performance measurements are configured. In this part, the kernel module

creates MSR configuration in the user side, which is forwarded to the kernel space with the desired counter address

and command (read or write), by using the system calls and assembly instructions rdmsr and wrmsr.

To configure the counters, three main steps are performed. First, it is necessary to enable the counters by

configuring the IA32 PERF GLOBAL CTRL MSR [45]. In this MSR, the first 8 bits enable the general propose

counters, while the bits from 32 to 34 enable the fixed counters. Thus, to enable all the counters, all these bits must

be set to 1. Next, the general propose counter must be configured by using the respective IA32 PERFEVTSEL

MSR [45]. In this register, the event select and unit mask of the desired hardware performance counter must be

written in bits 0 to 7 and 8 to 15, respectively. Moreover, it is also possible to define counter masks (e.g. count

only the number of cycles when more than 4 instructions are delivered to the core) and defining a counting in user,

kernel or both modes. Finally, the measurement is read from the respective IA32 PMC MSR [45]. It is important

to notice that Intel Skylake 6700K and Intel Ivy Bridge 3770K have a limited number of hardware counters that

can be assessed at any given time. In particular, both micro-architectures only support 4 counters per core in

hyper-threading mode and only 8 without the hyper-threading [45].

Since RAPL MSR are read-only and do not operate at the per-core level, their configuration is performed

27

through batches containing all the necessary addresses, in order to obtain all the readings in a single communi-

cation between user and kernel spaces, thus minimizing the overheads imposed by the tool when performing the

micro-architecture experimental evaluation. Independently of the executed benchmarks, the energy consumption

in Intel Ivy Bridge and Intel Skylake is reported in several registers that refer to different domains of the processor

chip, i.e., MSR PP0 ENERGY STATUS (core energy usage) and MSR PKG ENERGY STATUS (socket energy

usage). The difference between these two counters is referred herein as the uncore (off-core) energy usage. In

addition, although officially not supported in the data-sheets, the tested Intel Skylake processor (6700K) also al-

lows to measure the DRAM energy usage, with the counter MSR DRAM ENERGY STATUS, which provides an

estimation of the DRAM energy consumption [45].

After the MSR interface configuration, the tool provides two separate execution modes when performing the

micro-architecture benchmarking: 1) counter training to minimize the overheads; and 2) benchmark execution.

The first stage aims at reducing the impact of micro-benchmarking overheads since some hardware counters may

not provide the most accurate measurements (counting overheads). Besides, there are also certain portions of the

benchmark code that contain instructions from benchmark skeleton (overhead code) e.g., loop control instructions,

which are not the main subject of the experimental evaluation. Thus, it is possible to “train” the counters, by cor-

recting the measurement obtained in stage two, by subtracting the overhead measurements. The overhead code is

placed before the benchmarked code, in a distinct inline function. To implement this correction, TSC, performance

counters and energy consumption registers are read in both stages. Once the parallel execution finishes, the tool

reports the obtained median values from 1024 runs.

Algorithm 1 Generic memory benchmarkfor i < time do

for j < repeat do

MEM INST

MEM INST

(...)

end for

MEM INST

MEM INST

(...)

end for

Algorithm 2 Generic FP benchmarkfor i < time do

for j < repeat do

FP INST

FP INST

(...)

end for

FP INST

FP INST

(...)

end for

The general structure of developed micro-benchmarks for evaluation of the upper-bound capabilities of the

memory subsystem and FP units is shown in Algorithms 1 and 2, respectively. As it can be observed, both types

of test codes share a similar structure and they are constituted by two loops. The outer loop ensures the performed

test will attain a certain predefined time duration, in order to increase the evaluation accuracy for benchmarks with

small amounts of flops and bytes. In addition, since core and socket RAPL counters take 50 ms to update their

value, the outer loop is essential to guarantee the stability of energy consumption measurements. By executing a

set of tests with different time durations, it is experimentally assessed that the accurate and stable readings were

achieved when each test iteration took approximately 100 ms.

When designing the micro-benchmarks, a special attention is necessary to be taken regarding the amount of

instructions that can be placed in the micro-benchmark body (see MEM INST and FP INST in Algorithms 1 and

28

2). For example, having too many instructions may provoke evictions from the L1 instruction cache. In this

scenario, additional memory transfers occur from the unified L2 cache, whose increased utilization for instructions

may impact the L2 bandwidth evaluation for pure data transfers. Furthermore, this scenario causes an increase

in the measured power, which will severely degrade the evaluation accuracy regardless of the type of instructions

being tested. For example, for the FP benchmark, the obtained power consumption will not reflect the power

consumed exclusively by the FP units, but it will represent the superposition of power consumption of FP units

and the respective instruction cache. By performing tests with different amount of instructions, this phenomenon

was verified after approximatively 980 instructions. To overcome this issue, an inner loop with a fixed size is

introduced within the benchmark structure.

On the other hand, for the inner loop, it is also needed to take into account the opposite effect, i.e., having too

few instructions inside the loop. By creating an inner loop with a small amount of instructions, they will fit inside

the LSD. In this case, the processor may shutdown (or clock gate) some components in the decoding pipeline, thus

reducing the power consumption in the frontend. Hence, in order to avoid the unexpected power measures, the

inner loop needs to have a size of at least 64 instructions, which is sufficient to eliminate this effect.

Based on this set of restrictions, the inner loop size and the number of necessary repetitions is calculated for

each benchmark, according to the total amount of instructions to be performed, in order to maximize the number

of instructions executed in the inner loop. Since the amount of instructions may not be multiple of the loop size,

the remaining instructions are placed after the inner loop, as shown in Algorithms 1 and 2. For example, by

considering a total of 255 instructions and the inner loop size of 64 instructions, the inner loop would have three

repetitions and 63 instructions would be placed after it. It is also worth to emphasize that the tests were designed

to minimize register dependencies, which now can only occur when all the available registers are used. However,

due to the high amount of repetitions, these effects are effectively hidden by the architecture out-of-order engine.

3.2 Micro-architecture benchmarking

In this chapter, the previously referred tool (see Figure 3.1) and a set of specifically designed micro-benchmarks

are used to perform a fine-grain experimental evaluation of different Intel micro-architectures, in order to uncover

their maximum capabilities for different parts of the CPU engine. To fully characterize each core component, a set

of benchmarks are performed, with different amount of executed instructions. These experiments are performed

for two different Intel Core client processors at their nominal frequencies, namely: in Intel Ivy Bridge 3770K (3.5

GHz) and Intel Skylake 6700K (4 GHz). Both processors have three cache levels (L1 with 32 KB, L2 with 256 KB

and L3 with 8 MB) and DRAM, with 32 GB in Intel Skylake and 8GB in Intel Ivy Bridge. The micro-benchmarks

are run in CentOS 7.2.1511 as the operating system and compiled with Intel Compiler 17.0.4.196.

3.2.1 Exploring the maximum compute performance

As referred in Section 2.1, Intel Ivy Bridge and Intel Skylake micro-architectures greatly differ in the compute

capability, especially for DP FP arithmetics. In particular, Ivy Bridge only provides separate FP units for AVX

ADD and MUL operations (MAD) in two different ports, while Intel Skylake includes the full AVX FMA support

in each of the two available ports for FP arithmetics. For this reason, two different micro-benchmarks are developed

29

for each micro-architecture, i.e., the benchmark codes for MAD (Intel Ivy Bridge) and FMA (Intel Skylake), as

presented in Algorithms 3 and 4, respectively (both using AVX SIMD DP instructions).

To measure the amount of performed AVX SIMD DP instructions, the counters FP ARITH:256B PACKED DO-

UBLE (Intel Skylake) and SIMD FP 256:PACKED DOUBLE (Intel Ivy Bridge) are configured [45]. These tests

follow the structure previously presented in Algorithm 2, by substituting the macro FP INST with the following

instructions: muldp and addpd for Intel Ivy Bridge and vfmadd132pd for Intel Skylake.

Algorithm 3 MAD DP AVX Benchmark for Intel

Ivy Bridgefor i < time do

for j < repeat do

mulpd %ymm0,%ymm0,%ymm0

addpd %ymm1,%ymm1,%ymm1

(...)



end for



(...)

end for

Algorithm 4 FMA DP AVX Benchmark for Intel

Skylakefor i < time do

for j < repeat do

vfmadd132pd %ymm0,%ymm0,%ymm0


(...)



end for



(...)

end for

By using the previously elaborated micro-benchmarking methodology within the developed tool, an extensive

set of benchmarks was performed on each tested micro-architecture by varying the amount of performed FP op-

erations. In particular, in each one of the processors, more than 3000 tests were performed to obtain the desired

results. Each test was repeated 1024 times and the median value for counter measures and TSC was reported, in

order to obtain stable results and more accurate characterization of processor capabilities. The obtained experi-

mental results for single-core and multi-core MAD and FMA performance are presented in Figures 3.2a and 3.2b,

for Intel Ivy Bridge and Intel Skylake, respectively.

21

22

23

24

25

26

27

28

22 24 26 28 210 212 214 216 218 220

Per

form

anc

e [G

FL

OP

S/s

]

FLOPS

MAD AVX SIMD DP (4C)

MAD AVX SIMD DP (1C)

FP UnitsIntel Ivy Bridge 3770K

AVX SIMD DP

Filli

ng P

ipel

ine

(a) MAD performance for Intel Ivy Bridge 3770K.

21

22

23

24

25

26

27

28

22 24 26 28 210 212 214 216 218 220

Per

form

anc

e [G

FL

OP

S/s

]

FLOPS

FMA AVX SIMD DP (4C)

FMA AVX SIMD DP (1C)

FP UnitsIntel Skylake 6700K

AVX SIMD DP

Filli

ng P

ipel

ine

(b) FMA performance for Intel Skylake 6700K.

Figure 3.2: FP Units maximum performance using AVX SIMD DP instructions.

As it can be observed in Figures 3.2a and 3.2b, while filling the pipeline (slanted region), the performance

increases with the amount of flops performed, until approximately 40 flops per core in Intel Ivy Bridge and 48

flops per core in Intel Skylake. As such, it is needed to perform at least 10 AVX ADD/MUL instructions on both

Ivy Bridge ports (5 instructions per port) to reach the maximum AVX DP FP MAD performance. In contrast, Intel

30

0

10

20

30

40

50

60

70

22 24 26 28 210 212 214 216 218 220

Po

wer

Co

nsu

mp

tio

n [

W]

FLOPS

Core Power (4C)

Core Power (1C)

Uncore Power (1C/4C)

FP UnitsIntel Ivy Bridge 3770K

AVX SIMD DP

(a) MAD power consumption for Intel Ivy Bridge 3770K.

0

10

20

30

40

50

60

70

22 24 26 28 210 212 214 216 218 220

Po

wer

Co

nsu

mp

tio

n [

W]

FLOPS

Core Power (4C)

Core Power (1C)

Uncore Power (1C/4C)DRAM Power (1C/4C)

FP UnitsIntel Skylake 6700K

AVX SIMD DP

(b) FMA power consumption for Intel Skylake 6700K.

Figure 3.3: FP Units maximum power consumption using AVX SIMD DP instructions.

Skylake FMA units require 6 AVX FMAs on both ports (i.e., 3 instructions per port). These results may suggest a

much higher efficiency of FMA units implemented in the Intel Skylake architecture.

Once the pipeline is completely filled with instructions (constant region), the processor achieves maximum

throughput. It is worth to emphasize that the developed micro-benchmarks attained the theoretical maximum

performance on both architectures. In particular, the single thread benchmarking achieved 28 GFLOPS/s in Intel

Ivy Bridge (2 ports × 4 flops (1 MAD) × 3.5GHz) and 64 GFLOPS/s in Intel Skylake (2 ports × 8 flops (1 FMA)

× 4GHz). For the multi-thread test (4 cores), Intel Ivy Bridge reached to 112 GFLOPS/s (4×28) and Intel Skylake

achieved 256 GFLOPS/s (4×64), since FP performance scales linearly with the number of cores. As it can be

observed, Intel Skylake offers for about 2.3× higher performance than Intel Ivy Bridge (for both single-core and

multi-core performance), mainly due to inclusion of powerful AVX FMA units operating at the higher frequency.

The corresponding power consumption results are presented in Figures 3.3a (Intel Ivy Bridge) and 3.3b (Intel

Skylake), for three different domains of the processor chip, i.e., core, uncore and package, including both single-

core and multi-core execution scenarios. Although power consumption takes more time to stabilize at its maximum

value, a similar behavior to the one observed in the performance domain can be noticed. For single thread tests,

Intel Ivy Bridge maximum power consumption in core domain is around 13.5 W, while Intel Skylake consumes

approximately 19 W, confirming that a higher performance comes at the cost of increased power consumption

(although these two micro-architectures rely on different manufacturing technology).

When all four cores are used, Intel Ivy Bridge consumes about 40 W and Intel Skylake power consumption is

approximately 60 W. Hence, in contrast to the performance, the power consumption does not linearly scale with

the number of cores. This observation may suggest the existence of shared components in the cores domain that

are always active, regardless if a single thread or multiple threads are being used. As only AVX DP FP arithmetic

instructions are executed in the units inside the processing core, the uncore and DRAM power consumptions (Intel

Skylake) are constant (through the entire test) and do not depend on the number of cores utilized. Being the

superposition of core and uncore power domains, package power follows the trend observed in the cores domain.

For this reason, as well as to improve the readability and understanding of the presented power consumption results

in this Chapter, the package power is omitted, since it typically does not provide additional insights.

By combining the experimental results obtained when evaluating the maximum computational performance

(see Figure 3.2) and power consumption (see Figure 3.3), it is possible to provide a cross-comparison of different

micro-architectures in terms of their energy-efficiency (GFlops/J). For single-core AVX DP FP arithmetics, Intel

31

21

22

23

24

25

26

27

28

22 24 26 28 210 212 214 216 218 220

Per

form

anc

e [G

FL

OP

S/s

]

FLOPS

FP Units (4 Cores)Intel Ivy Bridge 3770K

AVX SIMD DP

MAD AVX SIMD DP

ADD/MUL AVX SIMD DPFi

lling

Pip

elin

e

(a) FP units performance for Intel Ivy Bridge 3770K.

21

22

23

24

25

26

27

28

22 24 26 28 210 212 214 216 218 220

Per

form

anc

e [G

FL

OP

S/s

]

FLOPS

FMA AVX SIMD DP

ADD/MUL AVX SIMD DP

FP Units (4 Cores)Intel Skylake 6700K

AVX SIMD DP

Filli

ng P

ipel

ine

(b) FP units performance for Intel Skylake 6700K.

Figure 3.4: FP units performance using AVX SIMD DP instructions.

0

10

20

30

40

50

60

70

22 24 26 28 210 212 214 216 218 220

Po

wer

Co

nsu

mp

tio

n [

W]

FLOPS

MAD Core Power

ADD/MUL Core Power

FP Units (4 Cores)Intel Ivy Bridge 3770K

AVX SIMD DP

Uncore Power

(a) FP units power consumption for Intel Ivy Bridge 3770K.

0

10

20

30

40

50

60

70

22 24 26 28 210 212 214 216 218 220P

ow

er C

on

sum

pti

on

[W

]

FLOPS

FMA Core Power

ADD/MUL Core Power


AVX SIMD DP

Uncore PowerDRAM Power

(b) FP units power consumption for Intel Skylake 6700K.

Figure 3.5: FP units power consumption using AVX SIMD DP instructions.

Ivy Bridge 3770K can deliver about 2 GFlops/J (28 GFlops/s at 13W), while Intel Skylake 6700K provides 3.4

GFlops/J (64 GFlops/s at 19W). For multi-core execution, Skylake also outperforms Ivy Bridge in terms of energy-

efficiency by delivering about 4.3 GFlops/J (versus 2.8 GFlops/J in Ivy Bridge). As it can be concluded, Intel

Skylake 6700K offers significant improvements in energy-efficiency when compared to the Intel Ivy Bridge 3770K,

namely about 70% for single-core and around 54% for multi-core computations.

As previously referred, modern multi-core processors support a variety of compute instructions/units, e.g.

ADD, MUL and FMA, which influence their performance upper-bounds. Hence, in order to provide a full charac-

terization of compute capabilities of modern processors, it is necessary to benchmark arithmetic units for different

FP instructions. The performance results obtained for 4 cores and different FP instructions are presented in Figures

3.4a and 3.4b, for Intel Ivy Bridge and Intel Skylake, respectively.

As expected, Intel Ivy Bridge and Intel Skylake achieve the maximum performance when MAD and FMA

instructions are performed, respectively. In both systems, ADD and MUL operations achieve the same throughput

and the maximum performance equal to the half of the one achievable for MAD instructions in the Intel 3770K

Ivy Bridge or FMA instructions in the Intel 6700K Skylake processor, respectively. As it can be observed, the

benchmark tests are able to attain maximum theoretical performance of each instruction, i.e., ADD and MUL

achieve 56 GFLOPS/s in Intel Ivy Bridge, while Intel Skylake, due to the enchantments in its architecture is able

to attain 128 GFLOPS/s for each of these instructions.

As observed in power consumption results presented in Figures 3.5a (Intel Ivy Bridge) and 3.5b (Intel Skylake),

for multi-core computations, the highest power consumption corresponds to the instruction type that guarantees

the maximum throughput in each platform. Regarding ADD and MUL instructions, their power consumption is

32

21

22

23

24

25

26

27

28

22 24 26 28 210 212 214 216

Per

form

anc

e [G

FL

OP

S/s

]

FLOPS

FMA AVX SIMD DP

FMA SSE SIMD DP


AVX | SSE | Scalar DP

Fillin

g Pip

elin

e

FMA Scalar DP

(a) FP units performance for different instruction set exten-

sions in Intel Skylake 6700K (4 cores).

0

10

20

30

40

50

60

70

22 24 26 28 210 212 214 216

Po

wer

Co

nsu

mp

tio

n [

W]

FLOPS


AVX | SSE | Scalar DP

FMA AVX SIMD DP

FMA SSE SIMD DP

FMA Scalar DP

(b) FP units power consumption for different instruction set

extensions in Intel Skylake 6700K (4 cores).

Figure 3.6: FP units performance and power consumption for different instruction set extensions in Intel Skylake

6700K (4 cores).

equal within the same processor, i.e., 33 W in Intel Ivy Bridge and 56 W in Intel Skylake, which represents an

increase of 23 W between these two generations. Besides, it can be verified that for the different instruction types

(units) and the same data precision, higher performance implies higher power consumption for both architectures.

In what concerns the energy-efficiency for multi-core FP ADD/MUL computations, a decrease of about 35%

for Ivy Bridge and 55% for Skylake can be inferred when compared to their maximum achievable energy-efficiency

for MAD or FMA instructions, respectively. However, Intel Skylake 6700K still offers a better energy-efficiency of

about 2.3 GFlops/J versus 1.7 GFlops/J in Intel Ivy Bridge 3770K, i.e., about 35% energy-efficiency improvement.

The throughput of a multi-core processor also depends on the used instruction set extension, e.g., AVX, SSE or

scalar. The performance results obtained for FP units, when using different extensions, are presented in Figure 3.6a,

for Intel Skylake 6700K (all 4 cores). In these tests, Intel Ivy Bridge results are not presented, since the analysis

is similar to the ones already performed in this chapter and it would not provide additional insights. As it can be

observed in Figure 3.6a, the maximum attainable performance is achieved for all three tests, i.e., 256 GFLOPS/s

for FMA AVX DP, 128 GFLOPS/s for FMA SSE DP and 64 GFLOPS/s for FMA Scalar DP. As expected, SSE

performance is half of AVX, since SSE vector length only handles two flops per instruction. Moreover, Scalar DP

performance is half of SSE, since each scalar instruction only computes one flop at a time.

Regarding the power consumption, presented in Figure 3.6b, the highest power consumption in FP units is

achieved when using AVX instructions (approximately 60 W). SSE and scalar instructions attain lower power

consumption, i.e., 45 W and 29.9 W, respectively. Hence, from the energy-efficiency point of view, FMA AVX DP

allows to achieve 4.27 GFLOPs/J (256 GFLOPS/s at 60 W), FMA SSE DP and FMA Scalar DP only achieve 2.84

GFLOPs/J (128 GFLOPS/s at 45 W) and 2.14 GFLOPs/J (64 GFLOPS/s at 29.9 W). Thus, in order to use all the

potential of Intel Skylake 6700K from energy-efficiency point of view, AVX instructions must be utilized.

In order to assess the quality of developed micro-benchmarks and their ability to fully exercise the FP units in

the processor architecture, the Top-Down Method [2] (see Section 2.1) was applied to the constructed benchmarks

when evaluating the maximum performance of the Intel Skylake 6700K for AVX DP FP FMA operations in all

four cores. The results obtained with the Top-Down analysis are presented in Figure 3.7, which gives a breakdown

of the predominant sources of performance bottlenecks in different parts of the processor pipeline.

Since only arithmetic operations are performed, memory subsystem does not limit performance, thus memory

33

0

0.2

0.4

0.6

0.8

1

22 24 26 28 210 212 214 216 218 220P

ow

er C

on

sum

pti

on

[W

]FLOPS

Frontend Bound (FE)

Retiring (RET)

Core Bound (Core)

FMA Top Down (4 Cores)Intel Skylake 6700K

AVX SIMD DP

Figure 3.7: Top Down Method for FMA AVX SIMD DP at nominal frequency.

bound contribution is zero. Besides, frontend does not stall the backend execution, thus also frontend bound and

bad speculation do not contribute to diminish performance. As it can be observed, before hitting maximum peak

performance, i.e., while filling the pipeline, the main bottleneck is core bound, since the amount of instructions

is insufficient for the processor to achieve its maximum retirement rate. Thus, the utilization of dispatch ports is

the main performance limiter. In contrast, when maximum performance is achieved, the processor can only retire

two FP instructions per cycle, i.e., only half of the retirement slots are used. As a result, the retiring contribution

is around 50%. Finally, since only half of the dispatch ports reserved for computations are used (the remaining

two ports do not support AVX DP FP arithmetics), the core bound contribution is around 50%. As it can be

concluded, the developed micro-benchmarks were capable of fully exploiting the processor capabilities for AVX

DP FP computations, by achieving the maximum possible retirement rate and core utilization, while exhibiting

negligible (or zero) execution overheads in the other pipeline domains.

3.2.2 Memory subsystem benchmarking

Regarding the memory subsystem, its capabilities differ between Intel Ivy Bridge and Intel Skylake micro-

architectures, as referred in Section 2.1. While Intel Ivy Bridge can deliver a maximum of 48 bytes per cycle (two

loads and one store of 16 bytes each), Intel Skylake bus width between core and L1 data cache was increased

to support a theoretical throughput of 96 bytes per cycle (two loads and one store of 32 bytes each). These

enchantments across micro-architectures have a great impact on memory subsystem capabilities, as it would be

demonstrated across this section.

The benchmarks utilized to characterize the memory subsystem are created by substituting the macro MEM INST,

2223242526272829210211

2-5 20 25 210 215 220

Ban

dw

idth

[G

B/s

]

Data Traffic [KB]

Memory Subsystem - LD TestIntel Ivy Bridge 3770K

AVX SIMD DP

1 Core

4 Cores

L1→C

L1→C

L2→C

L2→CL3→C

L3→CDRAM→C

DRAM→C

(a) LD bandwidth for Intel Ivy Bridge 3770K.

2223242526272829210211

2-5 20 25 210 215 220

Ban

dw

idth

[G

B/s

]

Data Traffic [KB]

Memory Subsystem - LD TestIntel Skylake 6700K

AVX SIMD DP

(b) LD bandiwdth for Intel Skylake 6700K.

Figure 3.8: Memory subsystem bandwidth for LD AVX SIMD DP at nominal frequency.

34

0

10

20

30

40

50

60

2-5 20 25 210 215 220

Po

wer

Co

nsu

mp

tio

n [

W]

Data Traffic [KB]

L1→CL2→C

Memory Subsystem - LD TestIntel Ivy Bridge 3770K4 Cores | AVX SIMD DP

L3→C DRAM→C

Core Power

Package Power

Uncore Power

(a) LD power consumption for Intel Ivy Bridge 3770K.

0

10

20

30

40

50

60

2-5 20 25 210 215 220

Po

wer

Co

nsu

mp

tio

n [

W]

Data Traffic [KB]

L1→C L2→C L3→C

DRAM→C

Package Power

Core Power

DRAM PowerUncore Power


4 Cores | AVX SIMD DP

(b) LD power consumption for Intel Skylake 6700K.

Figure 3.9: Memory subsystem power consumption for LD AVX SIMD DP at nominal frequency.

in Algorithm 1, by the respective memory instructions, i.e., vmovapd addr, reg for loads and vmovapd reg, addr for

stores. In order to measure the number of performed load instructions, the events MEM INST RETIRED.ALL LO-

ADS (Intel Skylake) and MEM UOP RETIRED.ALL LOADS (Intel Ivy Bridge) are configured in the tool. For

store instructions, the counters MEM INST RETIRED.ALL STORES (Intel Skylake) and MEM UOP RETIRED-

.ALL STORES (Intel Ivy Bridge) are utilized in the measurements.

In order to obtain the desired results, an extensive set of benchmarks following the described methodology are

performed in both processors. To obtain an accurate and stable memory bandwidth measurements, each presented

curve involved more than 500 tests and, exactly as FP benchmarking, each is repeated 1024 times, reporting

the median value for the counter and clock measures. The bandwidth results obtained for load instructions are

presented in Figures 3.8a and 3.8b for Intel Ivy Bridge and Intel Skylake, respectively.

As can be observed in both figures, the highest bandwidth is obtained in both systems when all the accesses are

served by L1 cache. Besides, the performed tests achieved the maximum theoretical bandwidth, i.e., 112 GB/s in

Intel Ivy Bridge (two loads of 16 bytes per cycle) and 256 GB/s in Intel Skylake (two loads of 32 bytes per cycle)

for single thread benchmarks. The bandwidth of the remaining memory levels decreases as the data is fetched

further away from the core. In the single-thread test, L2 achieved 128 GB/s in Intel Skylake and approximately

38 GB/s in Intel Ivy Bridge, while L3 cache attained 60 GB/s in Intel Skylake and 30 GB/s in Intel Ivy Bridge.

Finally, DRAM achieves around 7.5 GB/s in Intel Ivy Bridge and 14 GB/s in Intel Skylake. Since L1 and L2

caches are private to each core, their bandwidth scales linearly with the number of cores, hence L1 maximum

attainable bandwidth is 448 GB/s (4×112) in Intel Ivy Bridge and 1024 GB/s (4×256) in Intel Skylake. Despite

L3 is shared between cores, once each core has its one slice in the ring interconnection, the same effect occurs. In

contrast, DRAM bandwidth does not scale linearly with the number of cores, once it is shared by all cores and all

of them use the same connection to access it. For multi-thread execution, the DRAM bandwidth in Intel Ivy Bridge

is approximately 24.5 GB/s, while Intel Skylake attains 30.7 GB/s. Hence, in both multi-thread and single-thread

tests, Intel Skylake delivers always higher throughput than Intel Ivy Bridge for all memory levels. In particular,

Intel Skylake offers for about 3.37 times higher L2 bandwidth than Intel Ivy Bridge in both tests, corresponding to

the biggest difference when comparing both architectures. This may suggest that L2 cache suffered a considerable

number of improvements across micro-architecture generations.

The correspondent power consumption results are presented in Figures 3.9a and 3.10a (Intel Ivy Bridge) and

3.9b and 3.10b (Intel Skylake). In order to better visualize one core power consumption, the curves were placed

35

10

10.5

11

11.5

12

12.5

13

2-5 20 25 210 215 220

Po

wer

Co

nsu

mp

tio

n [

W]

Data Traffic [KB]

L1→C

L2→C

Memory Subsystem - LD TestIntel Ivy Bridge 3770K

AVX SIMD DP

L3→C

DRAM→C

1 Core

(a) LD power consumption for Intel Ivy Bridge 3770K.

14

16

18

20

22

2-5 20 25 210 215 220

Po

wer

Co

nsu

mp

tio

n [

W]

Data Traffic [KB]

L1→CL2→C


AVX SIMD DP

L3→C

DRAM→C

1 Core

(b) LD power consumption for Intel Skylake 6700K.

Figure 3.10: Memory subsystem power consumption for LD AVX SIMD DP at nominal frequency.

in different figures. As can be observed in all figures, core power consumption increases as the data is served by

higher levels of the processor memory hierarchy, due to the increasingly utilization of the cache levels, achieving its

maximum when all caches are being used, i.e., when data is fetched from L3. In L1 cache, the power consumption

is about 11.2 W in Intel Ivy Bridge and 17.5 W in Intel Skylake. Moreover, L2 and L3 caches in Intel Ivy Bridge

consume around 11.7 W and 11.9 W, respectively, while in Intel Skylake their power consumption is approximately

18.2 W (L2 cache) and 18.88 W (L3 cache). Thus, similar to FP performance, higher bandwidth is followed by

an increase in power consumption (when comparing the same cache levels). In DRAM, due to the reduction

in bandwidth when data is served by this memory level, the caches stall while waiting for data, reducing the

power consumption, achieving about 16.2 W in Intel Skylake and 11.3 W in Intel Ivy Bridge. When comparing

both architectures, there is a clear increase in power consumption from Intel Skylake to Intel Ivy Bridge, with a

maximum increase of 59% in L3 cache.

Similar to power consumption in FP units, the power consumption also does not scale with the number of cores

for the memory subsystem. For multi-thread tests, Intel Ivy Bridge power consumption in L1 cache is around 30.8

W and approximately 51 W in Intel Skylake, representing an increase of 65.6% between processors. Moreover, the

power consumption of L2 and L3 caches is about 33.2 W and 33.6 W in Intel Ivy Bridge, while in Intel Skylake

their power consumption measures are approximately 53 W and 54.3 W, respectively. In DRAM, Intel Ivy Bridge

3770 K consumes about 29.6 W, while Intel Skylake 6770 K DRAM power consumption is approximately equal

to 41.6 W.

By combining bandwidth and power consumption results, it is possible to compare the efficiency of the memory

subsystem in both processors. For single-thread benchmarking, L1 cache in Intel Ivy Bridge 3770K can deliver

about 10 GB/J (112 GB/s at 11.2 W), while Intel Skylake 6700K is able to provide approximately 14.63 GB/J (256

GB/s at 17.5 W). Through all the entire levels, Intel Skylake 6700 K is always more efficient than Intel Ivy Bridge

3770K. In particular, L3 cache in Intel Skylake 6700K can provide a maximum of 3.178 GB/J (60 GB/s at 18.88

W), while Intel Ivy Bridge 3770K only delivers 2.52 GB/J (30 GB/s at 11.9 W).

For multi-thread, since there is a much higher increase in bandwidth than power consumption, each processor

is able to deliver even higher efficiency than when working with only 1 core. Moreover, Intel Skylake 6700K

continues to provide a higher efficiency than Intel Ivy Bridge 3770K. In particular for L1 cache, Intel Skylake

6700K is able to deliver up to 20 GB/J (1024 GB/s at 51 W), while Intel Ivy Bridge 3770K only delivers 14.5 GB/J

(448 GB/s at 30.8 W). This allows to conclude that also the memory subsystem in Intel Skylake 6700K is more

36

23242526272829210

2-5 20 25 210 215 220

Ban

dw

idth

[G

B/s

]

Data Traffic [KB]

Memory Subsystem Intel Ivy Bridge 3770K4 Cores | AVX SIMD DP

L1→C

L2→CL3→C

DRAM→C

2LD/ST

LD

ST

LD/ST

(a) Memory ratios bandwidth for Intel Ivy Bridge 3770K.

23242526272829210

2-5 20 25 210 215 220

Ban

dw

idth

[G

B/s

]

Data Traffic [KB]

Memory Subsystem Intel Skylake 6700K


L1→C

L2→C

L3→C

DRAM→C

2LD/ST

LD

ST

LD/ST

(b) Memory ratios bandwidth for Intel Skylake 6700K.

Figure 3.11: Memory ratios bandwidth for AVX SIMD DP at nominal frequency.

25

30

35

40

45

2-5 20 25 210 215 220

Po

wer

Co

nsu

mp

tio

n [

W]

Data Traffic [KB]

Memory Subsystem - Core PowerIntel Ivy Bridge 3770K4 Cores | AVX SIMD DP

L1→C

L2→C

L3→C

DRAM→C

2LD/ST

LD

ST

LD/ST

(a) Memory ratios power consumption for Intel Ivy Bridge

3770K.

30

35

40

45

50

55

60

65

70

2-5 20 25 210 215 220P

ow

er C

on

sum

pti

on

[W

]

Data Traffic [KB]

Memory Subsystem - Core PowerIntel Skylake 6700K


L1→CL2→C

L3→C

DRAM→C

2LD/ST

LDST

LD/ST

(b) Memory ratios power consumption for Intel Skylake

6700K.

Figure 3.12: Memory ratios power consumption for AVX SIMD DP at nominal frequency.

efficient than in Intel Ivy Bridge 3770K.

Regarding uncore power (see Figures 3.9a and 3.9b), it is constant and equal to the uncore power in FP test

while the caches are utilized, increasing when DRAM is utilized. The DRAM power domain in Intel Skylake

follows the same tendency. The package power corresponds to the sum of uncore power and core power, having a

similar behavior to the core power consumption. Due to this, uncore, DRAM and package curves are not presented

in the following tests, since their behavior is similar to the one exposed here, which would not provide additional

insights.

Furthermore, as referred in Section 2.1, Intel Ivy Bridge and Intel Skylake contain two ports to dispatch loads

and one to dispatch stores. Thus, these micro-architectures support a wide range of load/store ratios, e.g., LD, ST,

LD/ST and 2LD/ST, which varies the memory subsystem throughput. The obtained bandwidth results with four

cores for different load/store ratios are presented in Figures 3.11a and 3.11b, for Intel Ivy Bridge and Intel Skylake,

respectively.

In both architectures, the highest bandwidth in L1 cache is achieved when two loads and one store are per-

formed together, (i.e., 2LD/ST ratio). In Intel Ivy Bridge, the tests achieved the maximum attainable bandwidth in

L1 bandwidth for all ratios, i.e., 448 GB/s for LD and LD/ST, 224 GB/s for ST and 672 GB/s for 2LD/ST. In Intel

Skylake, the L1 maximum attainable bandwidth is obtained for LD and LD/ST (1024 GB/s) and for ST (512 GB/s).

However, 2LD/ST ratio only achieved around 1355 GB/s, approximately 86.6 % of the maximum theoretical band-

width (1536 GB/s). Since all the maximum attainable bandwidths were achieved for the remaining memory ratios

and in two completely different micro-architectures, this may suggest the existence of an undisclosed bottleneck

37

in the memory dispatch ports of Intel Skylake 6700K.

Furthermore, different memory ratios utilize the data bus connecting memories at different utilization rates.

Thus, the bandwidth of the remaining memory levels is also affected by the load/store ratio. In Intel Skylake,

all the remaining levels are limited by LD bandwidth, achieving around 509 GB/s in L2, 241 GB/s in L3 and 30

GB/s in DRAM. On the other hand, Intel Ivy Bridge L2 (161 GB/s) and L3 (120 GB/s) maximum throughputs

correspond to 2LD/ST bandwidth, while DRAM upper-bound matches LD bandwidth, achieving about 24.5 GB/s.

The respective power consumption results are presented in Figures 3.12a and 3.12b, for Intel Ivy Bridge and

Intel Skylake, respectively. Here, only the power consumption correspondent to the bandwidth upper-bounds of

each memory level is evaluated, in order to simplify the memory subsystem analysis. In L1 cache, when using

2LD/ST ratio, Intel Ivy Bridge power consumption is about 34 W, while Intel Skylake 6700K consumes around

63 W, i.e., more 85% than Intel Ivy Bridge. The same analysis from energy-efficiency point of view, reveals that

Intel Skylake 6700K delivers 21.5 GB/J (1355 GB/s at 63 W), while Intel Ivy Bridge is able to provide 19.8 GB/J

(672 GB/s at 34 W). Thus, for L1 cache, Intel Skylake 6700K is 8.6% more efficient than Intel Ivy Bridge 3770K.

Moreover, in Intel Skylake, when using LD ratio, L2, L3 and DRAM power consumptions are approximately 53

W, 54.3 W and 41.6 W, respectively. On Intel Ivy Bridge, L2 and L3 power consumptions, when using 2LD/ST

ratio are about 36.3 W and 39 W, respectively, while DRAM consumes nearly 29 W (LD ratio). Thus, from energy-

efficiency point of view, Intel Skylake 6700K delivers 9.6 GB/J (509 GB/s at 53 W), 4.44 GB/J (241 GB/s at 54.3

W) and 0.72 GB/J (30 GB/s at 41.6 W), for L2, L3 and DRAM, respectively, when using LD ratio. On the other

hand, Intel Ivy Bridge 3770K only delivers 4.43 GB/J (161 GB/S at 36.3 W), 3.08 GB/J (120 GB/S at 39 W) and

0.84 GB/J (24.5 GB/S at 29 W), for L2, L3 (2LD/ST ratio) and DRAM (LD ratio), respectively. Thus, when both

processors are using the maximum bandwidth in each memory level, Intel Skylake 6700K is more efficient in L1,

L2, L3 caches than Intel Ivy Bridge 3770K. However, in DRAM, Intel Ivy Bridge 3770K is 17% more efficient

than Intel Skylake 6700K.

It is important to notice that in Intel Skylake, the ratios mixing loads and stores, i.e., LD/ST and 2LD/ST ratios,

achieve in L2 cache lower power consumption than L1 cache. Since no information about stores behavior across

the different memory levels is disclosed in Intel manuals, it is difficult to understand this behavior. However, in

Intel Skylake micro-architecture, L2 cache associativity was reduced from 8 ways in previous generations to 4

ways, which may have changed the way how loads and stores are handled when executed together.

Similar to FP units, memory subsystem throughput also depends on the utilized instruction set extension. The

bandwidth results obtained for 2LD/ST ratio using different instruction set extensions, namely AVX DP, SSE DP

and Scalar DP are presented in Figure 3.13a.

As can be observed, 2LD/ST AVX corresponds to the maximum upper-bound across all memory levels. In

particular, L1 cache bandwidth is equal to 1355 GB/s. For SSE and Scalar bandwidth curves, attain much lower

bandwidth, since their vector length is reduced. While AVX supports memory transfers of 32 bytes each in Intel

Skylake 6700K, SSE and scalar can only handle 16 bytes and 8 bytes, respectively, achieving 553 GB/s (SSE) and

298 GB/s (Scalar). In the remaining cache levels, 2LD/ST AVX attains 417.7 GB/s in L2, 214.3 GB/s in L3 and

18.6 GB/s in DRAM. Moreover, 2LD/ST SSE achieves 320 GB/s in L2, 197 GB/s in L3 and 15.3 GB/s in DRAM.

Finally, 2LD/ST Scalar test only achieves 187.3 GB/s in L2, 119.8 GB/s in L3 and 14.9 GB/s in DRAM.

Moreover, in the correspondent power consumption results, presented in Figure 3.13b, power consumption

38

23242526272829210

2-5 20 25 210 215 220

Ban

dw

idth

[G

B/s

]

Data Traffic [KB]

Memory Subsystem - 2LD/ST Test Intel Skylake 6700K

4 CoresAVX | SSE | Scalar DP

L1→C

L2→C

L3→C

DRAM→C

AVX

SSE

Scalar DP

(a) Memory subsystem performance for different instruction set

extensions in Intel Skylake 6700K (4 cores).

25303540455055606570

2-5 20 25 210 215 220

Po

wer

Co

nsu

mp

tio

n [

W]

Data Traffic [KB]

Memory Subsystem - 2LD/ST Test Intel Skylake 6700K

4 Cores | Core PowerAVX | SSE | Scalar DP

L1→CL2→C

L3→C

DRAM→C

AVX

SSEScalar DP

(b) Memory subsystem power consumption for different in-

struction set extensions in Intel Skylake 6700K (4 cores).

Figure 3.13: FP units performance and power consumption for different instruction set extensions in Intel Skylake

6700K (4 cores).

is completely dominated by 2LD/ST AVX instructions, achieving about 63 W in L1. Regarding SSE and scalar

curves, in L1 their power consumption is equal and approximately 47 W. However, in the remaining memory

levels, SSE power consumption surpasses easily scalar power. The AVX test power consumption is approximately

58.7 W in L2, 61.4 W in L3 and 41.25 W in DRAM. For the SSE test, the power consumption in remaining

memory levels is about 51.1 W in L2, 55.1 W in L3 and 34.9 W in DRAM. Finally, for the scalar benchmark, the

values obtained are 49.1 W in L2, 52.3 W in L3 and 32.9 W in DRAM. Thus, for all memory levels, AVX power

consumption is always superior to SSE and Scalar.

From energy-efficiency point of view, when accessing L1 cache, 2LD/ST AVX delivers up to 21.5 GB/J (1355

GB/s at 63 W). Since scalar and SSE power consumptions are equal in L1 but SSE instructions attain higher

performance, scalar instructions are less energy-efficient than SSE. In fact, SSE instructions can provide, in L1

cache, 11.77 GB/J (553 GB/s at 47 W), while scalar DP only deliver 6.34 GB/J (298 GB/s at 47 W). In the

remaining cache levels, 2LD/ST AVX delivers 7.12 GB/J (417.7 GB/s at 58.7 W) in L2, 3.5 GB/J (214.3 GB/s

at 61.4 W) in L3 and 0.45 GB/J (18.6 GB/s at 41.25 W) in DRAM. Furthermore, SSE instructions provide 6.26

GB/J (320 GB/s at 51.1 W) in L2, 3.57 GB/J (197 GB/s at 55.1 W) in L3 and 0.44 GB/J (15.3 GB/s at 34.9 W)

in DRAM. Finally, Scalar instructions deliver 3.81 GB/J (187.3 GB/s at 49.1 W) in L2, 2.29 GB/J (119.8 GB/s

at 52.3 W) in L3 and 0.45 GB/J (14.9 GB/s at 32.9 W) in DRAM. Hence, to fully exploit the memory subsystem

of Intel Skylake 6700K from energy-efficiency point-of-view, it is necessary to use AVX instructions, since they

allow to achieve the maximum possible efficiency.

To also evaluate memory benchmarking, Top-Down method is applied to 2LD/ST ratio test for 4 cores in Intel

Skylake. The obtained results are presented in Figure 3.14. The further data is fetched from core, more cycles

are spent in performing memory operations, increasing memory bound contribution, which achieves around 96%

in DRAM. Besides, this also reduces retiring, due to the increasingly amount of time that takes to perform the

operations. Finally, core bound metric increases until L2 cache, once the amount of cycles to fetch data is balanced

with ports utilization. However, in L3 and DRAM, the ports utilization diminishes (due to the amount of cycles to

serve data), reducing core bound metric contribution.

39

0

0.2

0.4

0.6

0.8

1

2-5 20 25 210 215 220Data Traffic [KB]

Memory Subsystem - Top-Down Intel Skylake 6700K


L1→C

L2→C

L3→C

DRAM→C

Retiring

MEM Bound

Core Bound

FrontEnd Bound

Figure 3.14: Top Down Method for 2LD/ST AVX SIMD DP at nominal frequency.

3.3 Summary

In this chapter, an extensive set of benchmarks is constructed in order to fully characterize FP units and memory

subsystem in Intel Skylake 6700K and Intel Ivy Bridge 3770K. This kind of analysis is extremely important

in order to fully characterize the main upper-bounds of multi-core processors and possible micro-architectural

limitations which are not reflected in theoretical datasheets. Besides, modern multi-core CPUs support an extensive

set of instructions, load/store ratios and instruction set extensions, e.g., AVX and SSE, which influences processor

throughput.

To accomplish this task, a tool designed to perform an accurate experimental evaluation on real hardware is

proposed in the scope of this Thesis. This tool utilizes hardware performance counters built-in the processors, in

order to obtain the necessary measurements to evaluate the system capabilities. Besides, the benchmark structure is

explained, revealing the options taken during its construction, with the objective of obtaining maximum precision,

accuracy and stability in the performed benchmarks.

Next, the results obtained with the tool were evaluated. In this part, Intel Ivy Bridge 3770K and Intel Skylake

6700k processors are compared for a diversity of instructions and load/store ratios. In general, it was demon-

strated that enchantments in Intel Skylake micro-architecture allow to achieve higher levels of performance, power

consumption and energy-efficiency than Intel Ivy Bridge.

Finally, the quality of the benchmarks is assessed with Top-Down analysis. From this, it was possible to con-

clude that the benchmarks fully exploit the micro-architecture capabilities in both memory and FP tests, revealing

an accurate characterization of the micro-architecture upper-bounds.

40

4. Proposed insightful models: Construc-

tion and experimental validationAs it was previously shown in Chapter 3, the micro-architecture upper-bounds depend on several factors, such

as the utilization of different instruction set extensions and instruction types, the ratio of different memory op-

erations (load and store instructions) etc. However, in its integral version, CARM mainly considers the absolute

performance, power-consumption and energy-efficiency upper-bounds of a given micro-architecture, e.g., by fo-

cusing on the AVX ISA extensions for FP DP arithmetics and the maximum attainable bandwidth for different

memory levels by considering the 2LD+ST ratio of memory instructions [3, 5]. As a consequence, depending on

the characteristics and demands of a specific application, this model might not provide the most accurate char-

acterization for the applications that are intrinsically unable to exploit those micro-architecture maximums, e.g.,

in cases when the applications do not employ the AVX extensions, use a specific subset of FP units and/or have

different ratio of load/store operations in their instruction mix. This is a specific gap that the work proposed in this

Thesis intends to close.

With this aim, a set of application-centric micro-architecture insightful models for performance, power con-

sumption and energy-efficiency are proposed in this Chapter, herein referred as CARM extensions. These models

aim at improving the insightfulness of the state-of-the-art models by covering a wide range of execution scenarios

from both micro-architecture and application perspectives. The proposed set of CARM extensions are presented

for Intel Skylake 6700K processor, evaluating the impact of different processor capabilities in the construction of

different models and characterization of potential execution bottlenecks. To perform this analysis, several CARM

instances for different instruction types, ratios of memory operations and instruction set extensions are proposed

and constructed for performance, power consumption and energy-efficiency.

Besides, the state-of-the-art CARM, which models the maximum throughput upper-bounds for memory sub-

system and FP units, is constructed for Intel Skylake 6700K. The insights provided by this model will be compared

in Chapter 5 with the characterization provided by proposed extensions, in order to assess the usability of the work

performed in this Thesis.

Furthermore, an extensive experimental validation for a set of proposed CARM extensions is performed on

real hardware platforms by considering two different generations of Intel client micro-architectures from Intel

Core processor family, i.e., quad-core Intel Ivy Bridge 3770K and quad-core Intel Skylake 6700K. This evaluation

was conducted by considering a range of different instruction types and mixes for both compute and memory

operations. To obtain a highly accurate experimental validation of the proposed models, the testing methodology

and tools presented in Chapter 3 were used together with a set of micro-benchmarks, specifically designed in

the scope of this Thesis. These micro-benchmarks allow to attain the maximum upper-bounds of the system

and experimentally reach the modeled maximums in memory and compute regions for all considered modeling

domains, i.e., performance, power consumption and energy-efficiency.

41

4.1 Proposed CARM extensions: Model construction

As referred in Chapter 3, several different factors can greatly affect the micro-architecture maximum computa-

tion capabilities, e.g., the utilization of a specific subset of arithmetic units. As such, in the compute bound region

of the proposed CARM extensions, several horizontal roofs can be included, in order to define the upper-bounds

for different instructions, such as, ADD, MUL and FMA.

The performance CARM extension for AVX SIMD DP FP instructions and 2LD/ST ratio is presented in Figure

4.1a. As it can be observed, the extended performance CARM contains two horizontal roofs, one for ADD/MUL

operations and the other for FMA instructions. In accordance with the micro-architecture benchmarking performed

in Chapter 3, the compute bound region upper-bound corresponds to the FMA instructions, with maximum attain-

able performance of 256 GFLOPS/s, when using all 4 cores in Intel Skylake 6700K processor. The ADD and MUL

instructions form the same horizontal roof, since both arithmetic units attain the same throughput, resulting in the

performance of 128 GFLOPS/s.

2-62-42-22022242628

2-8 2-6 2-4 2-2 20 22 24 26 28

Per

form

anc

e [G

FL

OP

S/s

]

Arithmetic Intensity [flops/byte]

CARM Performance Intel Skylake 6700K

4 Cores | AVX SIMD DP | 2LD/ST

L1→C

L2→C

L3→C

DRAM→C

FMA

ADD/MUL

(a) Performance CARM extension: ADD/MUL and FMA.

40

50

60

70

80

90

100

2-8 2-6 2-4 2-2 20 22 24 26 28

Po

wer

Co

nsu

mp

tio

n [

W]


CARM PowerIntel Skylake 6700K


L1→C

L1→C

L3→C

DRAM→C

FMA

ADD/MULL2→C

(b) Power CARM extension: ADD/MUL and FMA.

2-62-52-42-32-22-120212223

2-8 2-6 2-4 2-2 20 22 24 26 28En

erg

y-E

ffic

ien

cy [

GF

LO

PS

/J]


CARM Energy-EfficiencyIntel Skylake 6700K


L1→C

L2→C

L3→C

DRAM→C

FMA

ADD/MUL

(c) Energy-efficiency CARM extension: ADD/MUL and FMA.

Figure 4.1: Proposed CARM extensions for AVX DP FP instructions for Intel Skylake 6700K (4 Cores, 2LD/ST).

A similar trend can be observed in the respective power consumption CARM extension, presented in Figure

4.1b. As it can be observed, in the deep memory bound region of the model there is no difference between L1/ADD

curve and L1/FMA curve, since the same cache level is utilized. However, in the compute bound region, different

power consumption is attained when using different subset of arithmetic units. More precisely, the power consump-

tion in the proposed CARM extension asymptotically decreases towards the power consumption of respective FP

units being used, i.e., to the power of 56 W for ADD/MUL and 60 W for FMA, which corresponds to the power

consumption measurements obtained in Chapter 3 when performing the fine-grain micro-architecture evaluation.

In order to facilitate the analysis, only L1 roof for ADD instruction is presented. However, the same conclusions

are also valid for L2 and L3 caches, and DRAM.

42

2-62-42-22022242628

2-8 2-6 2-4 2-2 20 22 24 26 28

Per

form

anc

e [G

FL

OP

S/s

]


LD CARM Performance Intel Skylake 6700K


L1→C (LD)

L2→C (L

D)L3→

C (LD)

DRAM→C (L

D)

FMA

L1→C (2

LD/ST)

(a) Performance CARM extension: Load operations (LD).

2-62-42-22022242628

2-8 2-6 2-4 2-2 20 22 24 26 28

Per

form

anc

e [G

FL

OP

S/s

]


ST CARM Performance Intel Skylake 6700K


L1→C (S

T)

L2→C (S

T)

L3→C (S

T)

DRAM→C (S

T)

FMA

L1→C (2

LD/ST)

(b) Performance CARM extension: Store operations (ST).

40

50

60

70

80

90

100

2-8 2-6 2-4 2-2 20 22 24 26 28

Po

wer

Co

nsu

mp

tio

n [

W]


LD CARM Power Intel Skylake 6700K


L1→C (LD)

L2→C (LD)

L3→C

(LD)

DRAM→C (LD)

FMA

L1→C (2LD/ST)

(c) Power CARM extension: Load operations (LD).

40

50

60

70

80

90

100

2-8 2-6 2-4 2-2 20 22 24 26 28

Po

wer

Co

nsu

mp

tio

n [

W]


ST CARM Power Intel Skylake 6700K


L1→C (S

T)L2→C (ST)

L3→C

(ST)

DRAM→C (S

T) FMA

L1→C (2LD/ST)

(d) Power CARM extension: Store operations (ST).

2-62-52-42-32-22-120212223

2-8 2-6 2-4 2-2 20 22 24 26 28En

erg

y-E

ffic

ien

cy [

GF

LO

PS

/J]


LD CARM Energy-Efficiency Intel Skylake 6700K


L1→C (L

D)

L2→C (L

D)L3→

C (LD)

DRAM→

C (LD)

FMA

L1→C (2

LD/ST)

(e) Energy-efficiency CARM extension: Load (LD).

2-62-52-42-32-22-120212223

2-8 2-6 2-4 2-2 20 22 24 26 28En

erg

y-E

ffic

ien

cy [

GF

LO

PS

/J]


ST CARM Energy-Efficiency Intel Skylake 6700K


L1→C (S

T)

L2→C (S

T)

L3→C (S

T)

DRAM→

C (ST)

FMA

L1→C (2

LD/ST)

(f) Energy-efficiency CARM extension: Store (ST).

Figure 4.2: Proposed CARM extensions for AVX LD and ST operations for Intel Skylake 6700K (4 Cores).

Finally, in the energy-efficiency CARM extension, presented in Figure 4.1c, a substantial amount of similarities

can be observed when compared to the performance model. The energy-efficiency CARM also contains one

horizontal roof for each instruction type, delimiting the maximum energy-efficiency that is possible to achieve by

using a specific arithmetic unit. As in the power consumption and performance CARM extensions, the energy-

efficiency upper-bound in the compute bound region is limited by FMA instructions, with the efficiency of 4.3

GFlops/J, while ADD and MUL instruction allow to achieve the energy-efficiency of around 2.29 GFlops/J.

As evidenced in Chapter 3, the realistically attainable bandwidth for different memory levels significantly

varies depending on the amount of memory ports being utilized and the type of memory operations. As a result, the

memory bound region in the proposed CARM extensions will be affected depending on the used load/store ratio.

In order to show how different memory ratios affect CARM, several CARM extensions are proposed and depicted

in Figure 4.2 for load and store operations and for all modeled domains, i.e., performance, power consumption and

energy efficiency. In all figures, 2LD/ST ratio L1/FMA curve is also plotted, in order to provide better and visual

assessment of the differences between the integral version of CARM and the herein proposed CARM extensions

when characterizing the micro-architecture upper-bounds in terms of the memory bandwidth.

43

2-62-42-22022242628

2-8 2-6 2-4 2-2 20 22 24 26 28

Per

form

anc

e [G

FL

OP

S/s

]


SSE DP CARM Performance Intel Skylake 6700K

4 Cores | SSE DP | 2LD/ST

L1→C (S

SE)

L2→C (S

SE)L3→

C (SSE)

DRAM→C (S

SE)

FMA AVX

L1→C (A

VX)FMA SSE

(a) Performance CARM extension: SSE instructions.

2-62-42-22022242628

2-8 2-6 2-4 2-2 20 22 24 26 28

Per

form

anc

e [G

FL

OP

S/s

]


Scalar DP CARM Performance Intel Skylake 6700K

4 Cores | Scalar DP | 2LD/ST

L1→C (S

SE)

L2→C (S

calar)

L3→C (S

calar)

DRAM→C (S

calar)

FMA AVX

L1→C (2

LD/ST) FMA Scalar

FMA SSEL1→

C (Scalar)

(b) Performance CARM extension: Scalar DP instructions.

Figure 4.3: Proposed CARM extensions for 2LD/ST ratio with SSE and Scalar DP instructions for Intel Skylake

6700K (4 Cores).

Regarding performance CARM extensions presented in Figures 4.2a (load model) and 4.2b (store model), it can

be observed that L1(LD) bandwidth, i.e., L1 bandwidth when only load operations are performed, is much closer to

the L1(2LD/ST) than L1(ST). This corroborates with the results obtained with micro-architecture benchmarking,

where store tests achieved the lowest attainable bandwidth. In addition, when comparing the same memory level,

load CARM memory roofs correspond to higher performance than store memory roofs.

In power consumption CARM extensions, presented in Figures 4.2c (load model) and 4.2d (store model), the

same behavior does not occur completely. In the memory region, while L1(LD) attains a higher power consumption

than L1(ST), correlating with performance model, L3(ST) curve attains a higher power consumption than L3(LD).

This observation can possibly be attributed to the write-back nature of the LLC, where upon a cache miss to fetch

the data for both load and store operations, there is an additional activity required to serve the write-backs to

the DRAM for the store operations. Moreover, as it can be observed in the proposed CARM extension for only

load operations, the power consumption for cache levels when serving only load operations is lower than the one

obtained for 2LD/ST, while L3(ST) almost matches the power consumption of 2LD/ST. This behavior may suggest

a bigger impact of store instructions to the overall power consumption even for different memory operation mixes.

As expected, the proposed energy-efficiency CARM extension are similar to the performance extensions, as it

can be observed in Figures 4.2e (load model) and 4.2f (store model). However, it can be noticed that in terms of

energy-efficiency there are no significant differences between L1(2LD/ST) and L1(LD). This result confirms the

observations made in performance and power domains, where for different memory levels the load operations can

provide lower bandwidth coupled with reduced power consumption, and conversely, higher bandwidth at the cost

of increased power consumption. As such, from the energy-efficiency point of view, both ratios allow to extract

the maximum potential of the micro-architecture. In contrast, store model upper bounds in the memory region

achieves much lower efficiency than 2LD/ST ratio.

As previously referred, the utilized instruction set extension also influences both CARM memory and compute

regions. As it can be observed in Figure 4.3a for the proposed SSE CARM extension, it is not possible to attain the

maximum processor upper-bounds when using SSE instructions, when compared to the AVX upper-bounds. This

effect is even more visible when using scalar DP instructions. As presented in the respective CARM extension

(see Figure 4.3b), the upper-bounds for scalar DP instructions are even lower than SSE. Hence, in SSE and scalar

DP CARM extension, the useful area for application optimization suffers a big reduction, not allowing to achieve

44

the maximum peak performance of the processor. However, by implementing higher instruction set extensions, the

application can move across different CARM extensions. In the last instance, to fully maximize the performance,

it is necessary to use AVX instructions, as it will be experimentally demonstrated in Chapter 5 when applying a

different set of techniques to optimize the application execution. In this analysis, power consumption and energy-

efficiency CARM extensions are not presented, since the insights are similar to the obtained previously exposed

when analyzing the models presented in Figures 4.1 and 4.2.

4.1.1 State-of-the-art CARM construction

As shown in Chapter 3, the results obtained for memory subsystem throughput demonstrate that the maximum

bandwidth of each memory level does not occur for the same load/store ratio. In particular, for Intel Skylake

6700K, the maximum bandwidth between L1 cache and core is achieved for 2LD/ST ratio, while the bandwidth

upper-bounds for L2, L3 and DRAM are only attainable when LD ratio is utilized. This can be seen in Figure 4.4a,

where bandwidth results for LD and 2LD/ST ratios are presented, for Intel Skylake 6700K.

Based on these results, it is possible to construct a CARM extension containing the uppermost limits of the

micro-architecture, presented in Figure 4.4b. Since it models the maximum limits of the micro-architecture, the

roofs correspond to the processor throughput when using AVX SIMD DP instructions. Besides, the memory

region of the model mixes the bandwidth of two distinct ratios, i.e., LD (L2, L3 and DRAM) and 2LD/ST (L1)

bandwidths. In the compute bound region, the horizontal roofs match to the maximum peak performance of the

FP units, when using AVX SIMD DP instructions.

23242526272829210

2-5 20 25 210 215 220

Ban

dw

idth

[G

B/s

]

Data Traffic [KB]

Memory Subsystem Intel Skylake 6700K


(1024 GB/s)

LD

2LD/ST (1355 GB/s)

(509.5 GB/s)

(417.7 GB/s)

(214.3 GB/s)

(241.2 GB/s)

(30.8 GB/s)

(18.6 GB/s)

(a) LD and 2LD/ST AVX SIMD DP bandwdith test.

2-62-42-22022242628

2-8 2-6 2-4 2-2 20 22 24 26 28

Per

form

anc

e [G

FL

OP

S/s

]


State-of-the-art CARM Performance Intel Skylake 6700K


L1→C (2

LD/ST)

L2→C (L

D) L3→C (L

D)

DRAM→C (L

D)

FMA

ADD/MUL

(b) State-of-the-art CARM AVX DP CARM for Intel Skylake

6700K.

Figure 4.4: AVX DP LD and 2LD/ST memory bandwidth evaluation and State-of-the-art CARM for Intel Skylake

6700K (4 Cores).

As it can be concluded, this modeling approach uses the absolute maximums when considering system upper-

bounds in all FP and memory components, thus it can be considered as a state-of-the-art CARM model. In par-

ticular, it represents a combination between the integral CARM and the herein proposed LD CARM extension.

As such, this extension does not allow for modeling the entire range of processor capabilities, thus it may pro-

vide misleading characterization for a certain type of applications, which does not allow to fully uncover the main

bottlenecks that limit application performance, as it will be shown in Chapter 5.

45

4.2 Experimental validation of proposed CARM extensions

In order to experimentally validate a set of proposed CARM extensions, two different generations of Intel client

micro-architectures from Intel Core processor family were considered, namely: quad-core Intel Ivy Bridge 3770K

and quad-core Intel Skylake 6700K. To attain accurate experimental validation of the proposed models, the testing

methodology and tools presented in Chapter 3 were coupled with a set of specifically designed micro-benchmarks.

To measure the amount of performed AVX SIMD DP instructions, the counters already introduced in FP units and

memory subsystem benchmarking are configured (see Chapter 3), while RAPL facilities were relied upon to obtain

the energy consumption in different parts of the processor chip.

In contrast to the memory subsystem and FP unit benchmark tests from Chapter 3, where the amount of

performed memory transfers or FP instructions was separately increased, for CARM validation it is needed to

construct benchmarks that combine these operations in order to recreate different AIs. However, since the AI

corresponds to the amount of flops over the total amount of bytes transferred, increasing AI efficiently is not

possible by increasing the amount of bytes (memory transfers) and flops (FP instructions) at each test iteration.

Moreover, by changing the number of memory instructions retired, the validation could change from memory

level at some point of the test, which would produce inaccurate results. Hence, to perform a correct validation,

the benchmarks must fulfill two conditions: 1) the memory operations must be served by the same memory level

through the entire test; and 2) AI has to also be increased through the test. This is accomplished by maintaining

the number of memory transfers at the constant level (through the entire set of benchmarks), while increasing

the number of performed flops. The structure of developed benchmarks for validation of the proposed CARM

extensions is presented in Algorithm 5.

In order to avoid LSD and instruction cache use, the presented algorithm follows the similar optimization

approaches applied when designing the computation and memory benchmarks in Chapter 3. However, while the

Algorithms 2 and 1 (see Chapter 3) are only constituted by a single inner loop, the CARM validation benchmark

structure contains two inner loops. The first loop includes the mix between memory and FP instructions, thus

allowing to recreate the desired AI by overlapping the execution of required amount of FP instructions over the

fixed amount of memory instructions. On the other hand, the second loop only holds instruction of one type, i.e.,

memory instructions or FP instructions, which represent the remaining instructions required to construct desired

AI. In particular, when this benchmark is used for validation of the memory bound regions, the number of FP

instructions is lower than the memory instructions, thus this loop only contains memory instructions. In contrast,

once the amount of FP instructions is superior to the amount of memory instructions, the second loop is only

formed by FP instructions. Hence, in the first loop, the memory and FP instructions are always overlapped, in

order to balance the computation and memory transfer time, thus maximizing the probability of reaching the ridge

point in experimental tests.

To perform the experimental validation of proposed CARM extensions, an extensive set of benchmarks fol-

lowing the structure described in Algorithm 5 is created. In detail, approximately 1050 tests were executed to

obtain the measurements for each presented validation graphic. The experimental results obtained when validating

the performance and power consumption CARM extensions in a single-core of Intel Ivy Bridge 3770K processor,

by using the LD operations and AVX SIMD DP instruction set extension, are presented in Figures 4.5a and 4.5b,

46

Algorithm 5 Generic CARM benchmarkfor i < time do

for j < repeat1 do

MEM INST

FP INST

MEM INST

(...)

MEM INST

FP INST

MEM INST

end for

MEM INST

FP INST

MEM INST

(...)

for k < repeat2 do

MEM INST or FP INST

MEM INST or FP INST

(...)

MEM INST or FP INST

MEM INST or FP INST

end for

MEM INST or FP INST

MEM INST or FP INST

(...)

end for

respectively. As it can be observed, the performed tests were able to hit the ridge point for L1 and L2 caches in

both performance and power consumption models. In fact, an average fitness of 72.2% was obtained for the overall

performance validation and, in particular, a fitness of 99.95% was obtained for L1 cache validation. Since in the

ridge point, computation and memory transfer time must be exactly equal, achieving this point involves an enor-

mous balance between these two types of operations. However, since their throughputs are not usually a multiple

of each other, for some memory levels it is quite hard to achieve this point experimentally. This is the case of L3

cache, where the experimental tests did not achieve this point, although the experimental points are very near to

the theoretical curve.

Furthermore, Intel Skylake performance and power consumption CARM validations, with one core, using LD

ratio and AVX SIMD DP instructions, are presented in Figures 4.6a and 4.6b, respectively. This experimental

evaluation also attained the ridge point in L1 and L2 caches in both models, demonstrating the accuracy of the

utilized benchmarks. The L3 ridge point was not hit due to the difficulty in balance computations and memory

transfers. These tests achieved a fitness of 72.77% in L1 cache validation and 79.7% for L2 cache.

In order to evaluate CARM adaptation to different micro-architectural capabilities, the validation for a ratio of

two loads and one store is performed in both systems. Intel Ivy Bridge 3770K performance and power consump-

tions are able, once more, to achieve L1 and L2 caches ridge point and being very closer to L3 ridge point. Despite

not achieving L3 roof ridge point, an average fitness of 91.81% is obtained, with of fitness of 99.65% for L1 cache.

47

2-62-42-22022242628

2-8 2-6 2-4 2-2 20 22 24 26 28

Per

form

anc

e [G

FL

OP

S/s

]


LD CARM Performance Validation Intel Ivy Bridge 3770K1 Core | AVX SIMD DP

L1→C (L

D)

L2→C (L

D)L3→

C (LD)

MAD

(a) Performance LD AVX SIMD DP CARM validation.

10

11

12

13

14

15

16

2-8 2-6 2-4 2-2 20 22 24 26 28

Po

wer

Co

nsu

mp

tio

n [

W]


LD CARM Power Consumption Validation Intel Ivy Bridge 3770K1 Core | AVX SIMD DP

L1→

C (L

D)

L2→C (LD) L3→C (LD)

MAD

(b) Power consumption LD AVX SIMD DP CARM valida-

tion.

Figure 4.5: Performance and power consumption LD AVX SIMD DP CARM validations for Intel Ivy Bridge

3770K (1 core).

2-62-42-22022242628

2-8 2-6 2-4 2-2 20 22 24 26 28

Per

form

anc

e [G

FL

OP

S/s

]


LD CARM Performance Validation Intel Skylake 6700K

1 Core | AVX SIMD DP

L1→C (L

D)

L2→C (L

D)

L3→C (L

D)

FMA

(a) Performance LD AVX SIMD DP CARM validation.

16

18

20

22

24

2-8 2-6 2-4 2-2 20 22 24 26 28

Po

wer

Co

nsu

mp

tio

n [

W]


LD CARM ValidationPower Consumption Intel Skylake 6700K


L1→

C (L

D)

L2→C (LD)

L3→C (LD)

FMA

(b) Power consumption LD AVX SIMD DP CARM valida-

tion.

Figure 4.6: Performance and power consumption LD AVX SIMD DP CARM validations for Intel Ivy Skylake

6700K (1 core).

By performing the same validation test in the Intel Skylake 6700K, the CARM validations presented in Figures

4.8a (performance) and 4.8b (power consumption) are obtained. In this system, for a ratio of 2LD/ST, the tests

did not achieve L1 cache ridge point. However, this is mainly due to the possible micro-architectural limitations

previously elaborated in Chapter 3. In fact, this demonstrates that CARM is able to characterize existing micro-

architectural limitations, which are not taken into account by datasheets or other theoretical tools. Thus, since

CARM takes all these phenomena into account, increases its reliability when characterizing real applications. A

similar behavior occurs in power consumption validation, where none of the ridge points is achieved. However, as

observed from memory benchmarks, Intel Skylake power management works with more aggressive mechanisms,

which makes difficult to obtain a completely accurate relation between theory and experiences. Besides, while

performance model measures a known amount of memory transfers and flops (it is known how many instructions

each test executes), power consumption overlaps contributions from all pipeline components, which can reduce

accuracy.

The results obtained demonstrate a good tool and benchmarking accuracy when performing CARM validation,

since high values of fitness were obtained for Intel Ivy Bridge 3770K, where L1 fitness surpassed 99%. On

the other hand, there is also space for improvement, since L3 cache validation did not reach the ridge point,

48

2-62-42-22022242628

2-8 2-6 2-4 2-2 20 22 24 26 28

Per

form

anc

e [G

FL

OP

S/s

]


2LD/ST CARM Performance Validation Intel Ivy Bridge 3770K1 Core | AVX SIMD DP

L1→C (2

LD/ST)

L2→C (2

LD/ST)

L3→C (2

LD/ST)

MAD

(a) Performance 2LD/ST AVX SIMD DP CARM validation

for Intel Ivy Bridge 3770K (1 core).

10

12

14

16

18

20

2-8 2-6 2-4 2-2 20 22 24 26 28

Po

wer

Co

nsu

mp

tio

n [

W]


2LD/ST CARM ValidationPower Consumption

Intel Ivy Bridge 3770K1 Core | AVX SIMD DP

L1→C (2LD/ST)

L2→C (2

LD/ST)L3→C (2LD/ST) MAD

(b) Power consumption 2LD/ST AVX SIMD DP CARM val-

idation for Intel Ivy Bridge 3770K (1 core).

Figure 4.7: Performance and power consumption 2LD/ST AVX SIMD DP CARM validations for Intel Ivy Bridge

3770K (1 core).

2-62-42-22022242628

2-8 2-6 2-4 2-2 20 22 24 26 28

Per

form

anc

e [G

FL

OP

S/s

]


2LD/ST CARM Performance Validation Intel Skylake 6700K


L1→C (2

LD/ST)

L2→C (2

LD/ST)

L3→C (2

LD/ST)

FMA

(a) Performance CARM for Intel Skylake 6700K with 1 core.

16

18

20

22

24

2-8 2-6 2-4 2-2 20 22 24 26 28

Po

wer

Co

nsu

mp

tio

n [

W]


2LD/ST CARM ValidationPower Consumption Intel Skylake 6700K


L1→C (2LD/ST)

L2→C (2LD/ST)

L3→C (2LD/ST)

FMA

(b) Power consumption CARM for Intel Skylake 6700K with

1 core.

Figure 4.8: CARM for AVX SIMD DP at nominal frequency.

although the results were always very close to it. Finally, Intel Skylake 6700K validation for 2LD/ST demonstrated

that CARM reflects possible micro-architectural bottlenecks, differently from theoretical models such as ORM.

Besides, validation results for Intel Skylake 6700K demonstrate the necessity to tune the tool specifically to this

micro-architecture, since the enchantments in this processor faced to Intel Ivy Bridge seem to difficult obtaining

better results.

4.3 Summary

In this chapter, several CARM extensions are proposed in order to take into account several computational

and memory capabilities. Besides, these extensions can provide a more accurate characterization of real-world

applications, by adapting CARM extensions and processor capabilities to the application specifics. Finally, state-

of-the-art CARM is constructed, by using the uppermost limits of the micro-architecture for the memory and

computational roofs with the results obtained in the benchmarking chapter. It is demonstrated that this extension

does not adapt to different memory and arithmetic throughputs, thus analyzing real-world applications in this model

might be a challenging task, since these applications can use the most varied pipeline components during their

execution. Next, CARM performance and power consumption validations are presented for Intel Skylake 6700K

and 3770K, for different processor capabilities. The obtained results show a high fitting between experimental

49

results and theoretical model, achieving fitness superior to 99% in some cases.

50

5. Application characterization and opti-

mization in the proposed insightful modelsIn order to fully demonstrate the usefulness and insightfulness of the proposed models and CARM extensions,

in this chapter an in-depth experimental evaluation and analysis is performed on a real hardware platform (equipped

with the quad-core Intel Skylake 6700K processor) and on a set of different real-world applications by considering

all modeling domains, i.e., performance, power consumption and energy-efficiency. Initially, a case study for a

mini-application (Toypush) is presented, whose major execution hotspots are deeply analyzed and characterized

in the proposed models, in order to uncover the main sources of the execution bottlenecks. By following the

guidelines derived from the proposed models, a set of different optimization techniques was applied to the original

code of each Toypush hotspots, in order to further improve their performance.

In addition to Toypush mini-application, a set of real-world applications from the SPECs benchmark suite

[47] is also analyzed in the proposed extended CARMs and herein derived roofline models. This set of novel

and redefined general roofline models is investigated herein with the aim to address the shortcomings of existing

approaches for insightful micro-architecture modeling and to allow characterization of a wide range of applications

that encapsulate different types of instructions and instruction mixes. In particular, these general roofline models

are based on the total number of instructions (not only FP instructions), in order to provide a deeper analysis

for the applications which execution is not necessary dominated only by the FP operations. As such, these models

provide a foundation to derive more general insightful micro-architecture models based on the fundamental roofline

modeling principles.

To verify the correctness of the application characterization provided by the proposed models, the Top-Down

analysis [1, 2] (see Section 2.1) was also performed and used to correlate the application position in the proposed

CARM extensions with the main execution bottlenecks pinpointed by the Top-Down analysis. In addition, to

better assess the impact of the proposed CARM extensions and their ability to provide more accurate application

characterization, the information provided in the proposed CARMs is compared with the state-of-the-art CARM

implementation, presented in Chapter 4. The results of this analysis corroborate the need for application-centric

micro-architecture modeling (the research topic specifically investigated in this Thesis) in order to further boost

the model insightfulness for a certain set of applications.

5.1 Experimental setup

The experimental evaluation was performed on a computing platform with a Linux CentOS 7.2.15.11, quad-

core Intel Skylake i7-6700K processor (operating at the fixed nominal frequency of 4.0 GHz) and 32 GB of DDR4

DRAM. All applications were compiled with Intel Compiler 17.0.4.196 and during the performed tests Intel

Turbo Boost, hyperthreading and hardware prefetching were disabled. For the analysis of real-world applications,

a set of loops/functions (kernels) for each application was carefully selected according to their impact on the

51

overall application performance, i.e., only the kernels with the highest impact on the total execution time are

analyzed, since they correspond to application hotspots with the biggest potential to increase the overall application

performance. Moreover, special attention is paid on guaranteeing that each application thread is bound to a single

core, in order to avoid context switching and diminish its impact on the accuracy of presented results. Furthermore,

to assess the application characteristics and to obtain the required measures to facilitate their representation in the

proposed CARM extensions (as well as to perform the Top-Down analysis), each application hotspot is manually

instrumented with Performance Application Programming Interface (PAPI) to obtain the measurements of the

hardware counters, i.e., the total number of retired FP, load and store operations, RAPL energy measurements, etc.

5.2 Evaluation methodology

For each considered application hotspot, the distribution on a per instruction type basis is firstly assessed

by performing the binary instrumentation (i.e., by observing the assembly code). Special attention is paid on

decoupling the contribution of load and store operations (i.e., memory operations), FP instructions and all other

instruction types in the total number of retired instructions. This information is necessary to provide the preliminary

characterization of application behavior. For example, kernels mainly constituted by FP instructions are expected

to be limited by computations. On the other hand, kernels with a high percentage of memory operations in the total

number of retired instructions are expected to be more memory-bound, i.e., it suggests a very high probability that

the kernel will be positioned in the CARM memory-bound region.

By relying on the information provided by the binary instrumentation and the distribution of different instruc-

tion types, further analysis is performed in order to derive the necessary information to guide the selection of one

of the proposed CARM extensions, which better suits the characteristics of the considered application hotspot, i.e.,

to better correlate application behavior and micro-architecture capabilities. To this respect, the instruction mix of

each hotspot is analyzed in order to determine: i) the predominant instruction type used in the application hotspot,

i.e., scalar operations or SIMD ISA extensions (AVX or SSE); ii) the exact ratio of load/store operations; and iii)

considered precision for arithmetic operations, i.e., Single Precision (SP) or DP operations. By combining this in-

formation, one of the proposed CARM extensions is selected. For example, for an application hotspot that mainly

relies on SSE operations, has very low load/store ratio and uses SP arithmetics, the CARM variant that corresponds

to SEE SP FP computations and store bandwidth will be selected, i.e., SSE ST SP FP CARM. It is worth to em-

phasize that the integral version of CARM, as proposed in [3], mainly considers the performance upper-bounds of

the micro-architecture for AVX 2LD+ST FP DP operations, while state-of-the-art CARM seems to focus on mod-

eling the absolute maximums of the architecture for AVX SIMD DP instructions (see Section 4.1.1). Furthermore,

different hotspots within a single application may expose different requirements, thus imposing the use of different

CARM extensions when characterizing their behavior.

In order to provide an in-depth analysis, Top-Down method [1, 2] is applied to each of the kernels, determin-

ing the main bottlenecks that limit their performance. As referred in Section 2.1, Top-Down method decouples

different micro-architecture bottlenecks in a hierarchical way, dividing them in four main categories: frontend,

bad speculation, retiring and backend (sum of core and memory contributions). Frontend and bad speculation

bottlenecks are mainly connected with the in-order part of the CPU, reflecting backend starvation and branch mis-

52

prediction effects, respectively. In contrast, retiring and backend bound are related to the out-of-order part of the

pipeline. While retiring evaluates the total number of instructions per clock retired by the processor, the backend

metric defines if the main execution bottleneck belongs to the core (the utilization of execution ports) or to the

memory hierarchy components.

By correlating the Top-Down analysis with the application characterization provided by CARM, one can ex-

pect that a kernel with high retiring and/or core bound component should be positioned closer to the roofs corre-

sponding to the utilization of components inside the core engine (i.e., computation roofs and/or bandwidth slopes

corresponding to a set of private caches). On the other hand, a kernel with a high memory bound component (espe-

cially in the DRAM), together with lower retiring and core bound metrics, is expected to be positioned more closer

to the CARM roofs corresponding to the DRAM bandwidth. This methodology is also confirmed by the Top-Down

analysis provided for memory and FP benchmarks (see Figures 3.7 and 3.14 in Chapter 3), since core and retire-

ment are predominant factors for the performed FP benchmarks and when evaluating the bandwidth upper-bounds

for cache levels, while memory bound contribution increases when the data is fetched further away from the core.

This evaluation methodology is applied and verified when fully characterizing the behavior of a FP mini-

application (Toypush), as well as for a set of real applications, in particular to a set of applications from SPEC

benchmark suite [47]. Since some of the considered SPEC benchmarks contain a huge diversity of instructions

in their instruction mix, as well as very low amount of FP instructions, the insights derived from CARM may

be insufficient to fully characterize them. Hence, in these scenarios, a complementary and novel method for

roofline modeling is derived and applied herein, where the applications are plotted in a more generic roofline model

oriented to the total processor throughput, which relates the ratio of compute instructions (or the total amount of

instructions) and memory instructions with the upper-bound capabilities of the out-of-order CPU engine in terms

of the amount of instructions that can be retired in a single clock cycle.

5.3 Case Study: Toypush mini-application

Toypush is a single-threaded Fortran application that performs a particle in cell push computation [21]. This

application consists of three main hotspots (kernels), namely: b interpol (kernel 1), e interpol (kernel 2) and

eom eval (kernel 3). For each of these kernels, the instruction distribution on a per type basis is presented in Figure

5.1a, where the contribution of different instruction types is decoupled according to their share in the total amount

of retired instructions. As presented in Figure 5.1a, Toypush kernels involve a substantial diversity of instruction

types, such as FP DP Scalar and SSE instructions, load and store operations, as well as integer, control (branches)

and other type of instructions. In particular, the “others” category includes all the instructions that do not fit in any

of the remaining categories, such as, move and conversion operations.

As it can be observed in Figure 5.1a, kernel 1 mostly uses SSE DP instructions and it is very balanced between

the number of FP instructions (34%) and memory accesses (36%). On the other hand, kernels 2 and 3 are com-

pletely dominated by FP DP scalar and FP DP SSE instructions, respectively. Furthermore, load/store ratios vary

among different kernels. While kernels 2 and 3 have a ratio of 3.7 and 3.6, respectively, kernel 1 ratio is only 0.27.

As such, kernel 1 should be plotted in SSE DP FP ST CARM, kernel 2 in scalar DP FP 2LD/ST model (it uses

scalar instructions) and kernel 3 in SSE DP FP 2LD/ST model. As a result, it is expected for kernels 2 and 3 to

53

0

0.2

0.4

0.6

0.8

1

1 2 3

Branch ControlInteger

OthersST

LDFP DP Scalar

DP FP SSE

Toypush

(a) Toypush instruction distribution.

0

0.2

0.4

0.6

0.8

1

Frontend BoundBad Speculation

Memory BoundCore Bound

Retiring

1 2 3

Toypush

(b) Toypush Top-Down metrics.

Figure 5.1: Toypush instruction mix 5.1a and Top-Down metrics 5.1b.

be characterized as more compute-bound in their CARM extensions, provided that the AI of these kernels allows

hitting the compute bound roofs. On the other hand, since kernel 1 is very balanced between memory instructions

and computations, its position within the CARM highly depends on the accessed memory level, as well as on its

AI. For example, if the accesses are mostly performed in the L1 cache, then the kernel 1 might be compute bound

(or dominated by L1 accesses). However, if the memory transfers are mainly served by DRAM, the performance

of kernel 1 will be significantly lower, thus its overall performance can be limited by this memory level.

Further characteristics of Toypush kernels can be assessed from their Top-Down analysis, which is presented

in Figure 5.1b. As it can be observed, kernel 1 is mainly limited by memory (73.3%), in particular by the stores,

which corroborates the conclusions derived from its instruction distribution, i.e., it must be plotted in ST CARM

extension. Since the high store-bound nature is typically coupled with a low port utilization, the performance of

this kernel is expected to be quite low, and probably limited by DRAM. On the other hand, kernels 2 and 3 are

mainly limited by retiring (69.5% and 63.6%, respectively) and, since the remaining bottlenecks are marked as

core-bound, it is expected that the performance of those kernels is mainly limited by computations or cache levels

closer to the core (e.g., L1 and L2). This Top-Down analysis also corresponds to the observations made about their

instruction mixes, which are dominated by FP instructions, thus it is expected that this type of instructions has a

bigger impact in the performance of kernels 2 and 3.

To confirm this analysis, Toypush is characterized with state-of-the-art CARM, which output is presented in

Figure 5.2a. To simplify visualization of the kernels, the less important hotspots are hidden and only the three main

kernels are presented. Each of the kernels is represented by a single point within roofline chart and identified by

the respective number.

As shown in Figure 5.2a, kernel 1 is completely limited by DRAM, due to the huge amount of stores that

are inefficiently performed. However, since it is bellow DRAM bandwidth roof, it also may suggest memory

latency as the main bottleneck. In addition, kernels 2 and 3 do not match the previous analysis and observations

derived from their instruction mix and Top-Down analysis. State-of-the-art CARM places these kernels between

the DRAM and L3 roofs, meaning that their performance could be pushed down due to the inefficient accesses to

the L3 and DRAM levels. However, their instruction mix is mainly dominated by computation FP instructions,

thus the memory accesses should not have major impact on the performance of the kernels. Besides, Top-Down

method characterizes both kernels 2 and 3 as completely bound by retiring, which typically does not correspond

to any inefficient access to the off-core memory levels. In fact, it is expected that the kernels 2 and 3 are either

compute-bound or bound by memory levels closer to the core.

54

10-1

100

101

10-2 10-1 100 101

Per

form

anc

e [G

FL

OP

S/s

]


Toypush - b_interpol (kernel 1)Toypush - e_interpol (kernel 2)

Toypush - eom_val (kernel 3)

L1→C

L2→C

L3→C

DRAM→C

FMA

ADD/MUL

(a) State-of-the-art CARM: Toypush main kernels.

10-1

100

101

10-2 10-1 100 101

Per

form

anc

e [G

FL

OP

S/s

]


Toypush - b_interpol (kernel 1)

L1→C

L2→C

L3→C

DRAM→C

FMA

ADD/MUL

(b) SSE DP ST CARM: Toypush Kernel 1 (proposed).

10-1

100

101

10-2 10-1 100 101

Per

form

anc

e [G

FL

OP

S/s

]


Toypush - e_interpol (kernel 2)

L1→C L2→

C

L3→C

DRAM→C

FMA

ADD/MUL

(c) Scalar DP 2LD/ST CARM: Toypush Kernel 2 (proposed).

10-1

100

101

10-2 10-1 100 101

Per

form

anc

e [G

FL

OP

S/s

]


Toypush - eom_val (kernel 3)

L1→C

L2→C

L3→C

DRAM→C

FMA

ADD/MUL

(d) SSE DP 2LD/ST CARM: Toypush Kernel 3 (proposed).

Figure 5.2: CARM characterization of main Toypush kernels in Intel Skylake 6700K.

To better understand some of these inconsistencies in state-of-the-art CARM, each of these Toypush kernels

are plotted within the respective extended CARMs proposed in this Thesis, i.e., SSE DP ST model for kernel

1, scalar 2LD/ST model for kernel 2 and SSE DP 2LD/ST model for kernel 3. For kernel 1 (see Figure 5.2b),

the ST model characterizes it as completely DRAM bound, which partially matches the characterization provided

in the state-of-the-art CARM. However, in this case, the distance between the kernel 1 point and the DRAM

roof is much lower, suggesting that DRAM latency is not a bottleneck. Regarding kernel 2, which corresponding

CARM characterization is presented in Figure 5.2c, it can be noticed that its characterization changes drastically.

In the herein proposed model, the kernel 2 performance is now much closer to the scalar add roof, which clearly

reveals its expected compute-bound nature, correlating with its instruction mix and Top-Down insights. Finally,

kernel 3 characterization provided by the model proposed in this Thesis drastically differs from the one observed

in the state-of-the-art CARM, as depicted in Figure 5.2d. In the proposed model, the kernel 3 is bounded by

computations, as expected from the previously elaborated Top-Down and instruction distribution analysis. As a

result, the characterization of kernels 2 and 3 showcases the importance of the proposed CARM extensions in this

Thesis and it paves the way towards application-centric insightful micro-architecture roofline modeling, in order

to provide more precise application characterization and derivation of more accurate optimization hints.

5.3.1 CARM-guided application optimization example

As previously referred, Top-Down analysis indicates a high retiring contribution for scalar and SSE Toypush

kernels (kernels 2 and 3), which are characterized in the proposed models as bounded by SSE and scalar compu-

tations, respectively. From the presented kernel characterizations in the proposed models (see Figures 5.2c and

5.2d), one can derive valuable optimization hints in order how to further improve their performance. For example,

for this set of compute bound kernels, the performance can be increased by relying on the more advanced ISA

55

0

0.2

0.4

0.6

0.8

1


Memory BoundCore Bound

Retiring

1 2 3

Toypush Optimized

(a) Toypush optimization Top-Down metrics.

10-1

100

101

10-2 10-1 100 101

Per

form

anc

e [G

FL

OP

S/s

]


b_interpol (kernel 1)b_interpol_opt (kernel 1)

L1→C

L2→C L3→

C

DRAM→C

FMA

ADD/MUL

(b) CARM model: Kernel 1 of Toypush optimization charac-

terization in Intel Skylake 6700K.

10-1

100

101

10-2 10-1 100 101

Per

form

anc

e [G

FL

OP

S/s

]


e_interpol (kernel 2)e_interpol_opt (kernel 2)

L1→C

L2→C

L3→C

DRAM→C

FMA

ADD/MUL

(c) CARM model: Kernel 2 of Toypush optimization charac-


10-1

100

101

10-2 10-1 100 101

Per

form

anc

e [G

FL

OP

S/s

]Arithmetic Intensity [flops/byte]

eom_val (kernel 3)eom_eval_opt (kernel 3)

L1→C

L2→C

L3→C

DRAM→C

FMA

ADD/MUL

(d) CARM model: Kernel 3 of Toypush optimization charac-


Figure 5.3: CARM model: Toypush optimization characterization in Intel Skylake 6700K.

extensions, i.e., AVX in both kernels. In order to optimize kernels performance, AVX flag is used (“mavx”), forc-

ing the utilization of AVX instructions. However, while kernels 1 and 3 improved their performance, kernel 2 did

not suffer any change due to dependencies, which was preventing vectorization. Thus, to solve this problem, loop

unrolling was performed in kernel 2.

Top-Down metrics are also applied to the optimized kernels (see Figure 5.3a), in order to predict a possible

change in their behavior. Regarding kernel 1, although it continues to be bounded by memory (69.1%), there was

a reduction in the contribution of the stores (53.7%). Thus, it is expected to have its performance a bit higher than

DRAM roof. On the other hand, kernels 2 and 3 maintain the same behavior, i.e., bounded by retiring and core,

thus even after applying the set of optimizations, these kernels should still continue to be limited by computations.

For each optimized kernel, the respective characterization in the proposed extended CARMs is provided in

Figures 5.3b (kernel 1), 5.3c (kernel 2) and 5.3d (kernel 3). However, kernel 1 is now plotted in AVX ST model

and kernels 2 and 3 in AVX 2LD/ST CARM. Since no changes to the algorithm were introduced, all kernels

maintain the same load/store ratio, even when applying more advanced ISA extensions (AVX). In these figures, it

is also plotted the performance of original kernels, in order to better assess the achieved performance improvement.

By applying the above-mentioned optimization techniques, all three kernels of Toypush application achieved a

substantial improvement in their performance. As it can be observed, the applied optimizations barely affect the AI

of different kernels due to the fact that CARM observes the data traffic from the core perspective and considers the

true arithmetic intensity, which is the property of the algorithm itself. However, by improving the performance of

different kernels, their characterization points moved along the y-axis towards the regions of higher performance.

As a consequence, the potential limiting factors for attaining better performance have also changed for the

56

Table 5.1: Performance and arithmetic intensity of Toypush kernels before and after optimization.

ApplicationBefore Optimization After Optimization

GFLOP/s AI GFLOP/s AI

Toypush

Kernel 1 0.89 0.12 2.64 0.11

Kernel 2 6.27 0.27 40.33 0.29

Kernel 3 11.64 0.33 27.74 0.38

optimized kernels, when compared to their original versions. For example, kernel 1 (Figure 5.3b) is now bound by

L3 cache, while its unoptimized version was DRAM-bound. Kernel 2 (Figure 5.3c) almost reaches the maximum

sustainable performance of the architecture by approaching to the DP Vector FMA Roof, while it was bound by

the Scalar ADD Peak performance before applying any optimizations. The optimized version of kernel 3 (Figure

5.3d) resides very near the DP Vector ADD Roof. It is worth noting that even for the optimized Toypush kernels,

the performance characterization in the proposed set of CARMs is in accordance with their previously elaborated

Top-Down analysis.

Table 5.1 summarizes the experimentally obtained performance improvements on a per kernel basis, before

and after applying the kernel optimizations. As previously referred, no significant changes can be spotted in the

AI of different kernels when comparing their optimized and non-optimized versions. However, depending on the

kernel type, different performance gains were achieved. In particular, by code vectorization, the performance of

kernels 1 and 3 is improved by 3.15 and 2.38 times, respectively. However, the highest performance gains were

achieved for kernel 2, which performance was improved by 6.43 times when compared to the unoptimized version.

For kernels 2 and 3, their performance increase is explained by the use of AVX instructions, which allow to retire a

higher number of flops per cycles, while in kernel 1 the main factor is the reduced impact of the store instructions

to the overall performance.

5.4 Characterization of real-world applications in the proposed models

In order to confirm the usability of proposed CARM extensions when characterizing the behavior of real-world

applications, the previously elaborated evaluation methodology (also used for Toypush analysis) is applied herein

to characterize FP benchmarks from SPEC benchmark suite [47]. In these applications, the main hotspots were

identified by relying on the analysis provided by the Intel Advisor. These hotspots were subsequently instrumented

with PAPI calls to obtain the relevant measures from the hardware performance counters, as well as the energy

consumption by relying on the RAPL facility (see Section 5.2). For each considered application hotspot, the

respective instruction distributions and load/store ratios were obtained, which allowed to classify the kernels of

these applications in three categories according to the used CARM extension, namely: SP scalar LD CARM, DP

scalar 2LD/ST CARM and DP scalar LD CARM. The kernels from bwaves, zeusmp, cactusADM, gemsFDTD,

tonto, leslie3D and lbm are plotted in the DP scalar LD model. The main hotspots from milc, gromacs, soplex,

gamess and calculix are plotted in DP scalar 2LD/ST CARM extension. Finally, in the SP scalar LD CARM only

wrf kernels are plotted.

57

Similarly to the Toypush evaluation, Top-Down metrics and instruction distribution are analyzed on a per

application kernel basis. However, the instruction mix of considered FP SPEC benchmarks (as in case of any real

application with substantial complexity) contains a huge diversity of instructions, which makes the analysis based

on the instruction types and mixes a quite challenging process. For this reason, instruction distribution for each

kernel is divided in only two main components: memory instructions (represented as load and store instructions)

and non-memory instructions (typically referring to instructions performed inside the core execution engine). This

allows to decouple if the applications are mainly using memory ports or other execution ports in the CPU pipeline,

thus allowing to predict their behavior within the roofline chart. As before, the Top-Down analysis is used herein

as a complementary characterization strategy when verifying the insightfulness of the proposed CARM extensions,

which are also compared with state-of-the-art CARM characterization, in order to identify possible inconsistencies

(as in case of kernels 2 and 3 in Toypush).

Moreover, some of the most representative kernels are also evaluated in the proposed CARM extensions for

power consumption and energy-efficiency domains, thus allowing to perform a full characterization of their ex-

ecution on modern multi-core CPUs. Finally, due to the high diversity of instructions in their instruction mix,

some application kernels are plotted in a throughput-oriented roofline model, which is specifically developed in

the scope of this Thesis and it relates the ratio between the amount of computations (or the total number of retired

instructions) and memory transfers instructions with the maximum capability of the CPU engine in terms of the

amount of instructions that it can retire per clock cycle.

5.4.1 Application characterization in the SP Scalar LD CARM extension

As previously referred, the SP Scalar LD CARM extension is only applicable to the wrf application kernels,

according to their characteristics. By performing the analysis in the Intel Advisor, three main kernels are detected.

From the instruction distribution of each wrf kernel, presented in Figure 5.4a, it can be observed that all kernels

are mainly dominated by load instructions in the part corresponding to the memory instructions, with around 35%

of overall contribution for all kernels. However, as it can be seen, the kernels are mainly constituted by instructions

that are not memory related, i.e., that use the dispatch ports 0, 1, 5 and 6 (see Figure 2.1 in Chapter 2). Furthermore,

from the corresponding Top-Down analysis (see Figure 5.4b), the main application bottlenecks can be derived. For

all kernels, the main bottleneck is the retiring, which almost completely dominates the Top-Down analysis. In

kernels 1 and 2, retiring contribution surpasses the 70%, while it is around 60% for kernel 3. Besides, there is

a balance between core bound and memory bound metrics, which might influence the performance of kernels,

depending on the accessed memory level. However, due to the high retiring, all the kernels should be closer

to the private caches or computation roofs in CARM chart, since a high retiring is only possible to achieve in

these conditions. In particular, since kernel 3 has slightly increased utilization of the L1 cache, its performance is

expected to be a bit higher than the one achieved for kernels 1 and 2.

As presented in Figure 5.5a, the characterization provided by the state-of-the-art CARM completely differs

from what is expected from the Top-Down analysis. As it can be observed, all the points are characterized as

strictly DRAM bound, although kernel 3 performance is slightly above kernels 1 and 2, due to the use of L1 cache.

Hence, there is no clear correlation between Top-Down analysis and state-of-the-art CARM characterization. In

contrast, by plotting the kernels in the herein proposed CARM extension (see Figure 5.5b), all the kernels are

58

limited by private caches, i.e., kernel 1 and 2 by L2 cache and kernel 3 by L1 cache, which corresponds to the

predicted behavior when analyzing the obtained Top-Down results.

Furthermore, since these kernels have a big diversity of instructions, a novel throughput-oriented approach to

roofline modeling is also investigated in the scope of this Thesis, which provides the means for re-defining the

roofline model in general by focusing on fundamental execution principles in the CPU pipeline that are not strictly

tight to its upper-bound capabilities for the FP arithmetics. As presented in Figure 5.6, in the x-axis, instead of

relating the amount of flops and transferred bytes, it is now considered the amount of non-memory instructions

(herein referred as COMPS) over the number of memory instructions (MOPS). In the y-axis, the performance

is now given by the amount of non-memory instructions executed per clock (COMPS/CLK), i.e., the retirement

rate of the instructions that originate from the dispatch ports 0, 1, 5 and 6. Hence, the horizontal lines in the

herein proposed model correspond to different amount of COMPS that can be retired per clock. Since there

is a total of four ports to dispatch these instructions, the processor can deliver a maximum of 4 COMPS/CLK,

representing the peak performance of the processor. To be precise, in the Intel Skylake micro-architecture, a

maximum of 4 instructions can be retired at any given clock cycle regardless of the originating port [1, 45].

The remaining horizontal roofs represent different retirement rates. i.e., 3, 2 and 1 COMPS/CLK, respectively.

Moreover, the sloped lines in the new model represent the throughput of each memory level, i.e., the number

of memory instructions performed by cycle (MOPS/CLK). Similar to the original CARM, the throughput varies

across different memory levels, and it is reducing as the data is fetched further away from the core. In particular,

the maximum throughput between core and L1 cache is attained when 2LD/ST ratio is used and it is equal to

3 MOPS/CLK, while the minimum L1 throughput is obtained when using only ST instructions (1 MOP/CLK).

Thus, the memory region of the herein model corroborates clearly with CARM memory bandwidth. Besides, since

performance is now measured as the amount of COMPS executed per cycle, the model is denominated herein as

COMPS CARM.

It is worth to emphasize that the COMPS CARM presented in Figure 5.6 follows a similar approach used

to select CARM extensions, i.e., it is constructed for a specific load/store ratio and instruction type as required

by the wrf kernels (which are represented with “cross” symbols in Figure 5.6). When compared with CARM

extension presented in Figure 5.5b, there are slight differences in application characterization. While in CARM

extension the kernels are bounded by L2 (kernels 1 and 2) and L1 (kernel 3), in COMPS CARM all the kernels are

0102030405060708090100

Loads Stores Others

wrf

(a) Instruction distribution.

0

0.2

0.4

0.6

0.8

1


RetiringMemory Bound

Core Bound

1 2 3

wrf

(b) Top-Down metrics.

Figure 5.4: Instruction distribution and Top-Down analysis for SP Scalar LD applications.

59

10-1

100

101

10-2 10-1 100 101

Per

form

anc

e [G

FL

OP

S/s

]


wrf_1wrf_2wrf_3

L1→C

L2→C

L3→C

DRAM→C

FMA

ADD/MUL

(a) State-of-the-art CARM characterization.

10-1

100

101

10-2 10-1 100 101

Per

form

anc

e [G

FL

OP

S/s

]


wrf_1wrf_2wrf_3

L1→C

L2→C L3→

C

DRAM→C

FMA

ADD/MUL

(b) SP Scalar LD CARM characterization.

Figure 5.5: Application characterization within state-of-the-art CARM and proposed SP Scalar LD extension.

0

1

2

3

4

5

0 5 10 15 20

Per

form

anc

e (C

OM

PS

/CL

K)

COMPS/MOPS

wrf_1 wrf_2 wrf_3

L1→C

L2→

C

L3→C

DRAM→C

RET = 4

RET = 3

RET = 2

RET = 1

Figure 5.6: Application characterization with SP Scalar LD COMPS CARM.

limited by the compute roof that corresponds to the retirement rate of 2 COMPS per cycle. However, in contrast

to CARM, which analyzes only FP instructions, the COMPS CARM takes into account all retired instructions,

when identifying the primary bottlenecks. Hence, COMPS model is a good complement to the FP CARM, since it

allows to globally visualize the application bottlenecks (similarly to the Top-Down method), while FP CARM can

more precisely describe the execution bottlenecks and provide further optimization hints. In particular, COMPS

CARM can be used to distinguish which roofs in the FP CARM should be considered as the primary source of

execution bottlenecks, especially in situations when the application point is positioned between several bandwidth

slopes and horizontal roofs.

As in the FP CARM, it is also possible to directly correlate the Top-Down analysis with the COMPS CARM

characterization. As previously referred, all three kernels are bounded by retiring (kernel 1 and 2 around 70% and

kernel 3 with approximately 60%). As it can be observed in Figure 5.6, COMPS CARM characterizes all kernels

as bound by retiring. However, kernel 3 performance is slightly lower than the performance of kernels 1 and 2, due

to the lower retirement contribution when compared to the other kernels.

5.4.2 Application characterization in the DP Scalar 2LD/ST CARM extension

Regarding the application kernels plotted in the proposed DP Scalar 2LD/ST CARM extension, their instruction

distribution (see Figure 5.7a) is mainly dominated by memory instructions, typically surpassing 60%. The only

exception is kernel 2 from gamess application, i.e., gamess2, where 60% of the instructions are not memory related.

Hence, it is expected that the gamess2 kernel is limited by retiring in the Top-Down characterization, while the

60

0102030405060708090100

1 2 gromacs 1 2 2 calculix

Loads Stores Others

milc soplex gamess

(a) Instruction distribution.

0

0.2

0.4

0.6

0.8

1



Core Bound

1 2 gromacs 1 2 2 calculixmilc soplex gamess

(b) Top-Down metrics.

Figure 5.7: Instruction distribution and Top-Down analysis for DP Scalar 2LD/ST applications.

10-2

10-1

100

101

10-2 10-1 100 101

Per

form

anc

e [G

FL

OP

S/s

]


L1→C

L3→C

DRAM→C

FMA

ADD/MUL

L2→C

gamess_2milc_1milc_2

gromacs_1

soplex_1soplex_2calculix_1

(a) State-of-the-art CARM CARM characterization.

10-2

10-1

100

101

10-2 10-1 100 101

Per

form

anc

e [G

FL

OP

S/s

]


gamess_2milc_1milc_2

gromacs_1

soplex_1soplex_2calculix_1

L1→C

L3→C

DRAM→C

FMA

ADD/MUL

(b) DP Scalar 2LD/ST CARM characterization.

Figure 5.8: Application characterization within state-of-the-art CARM and proposed DP Scalar 2LD/ST extension.

remaining kernels may be limited by memory, depending on the accessed memory level.

However, by analyzing the results from the corresponding Top-Down analysis, presented in Figure 5.7b, not all

kernels obey to the expected behavior. While Top-Down characterization of milc, soplex and gamess2 kernels is

according to the instruction distribution ( milc and soplex are mainly memory bound, while gamess2 is bounded by

retiring), gromacs and calculix are limited by retiring and core, despite being dominated by memory instructions.

However, gromacs and calculix kernels have a considerable amount of memory access served by the L1 cache,

which may provoke the diminishing memory bound impact and increasing retiring and core bound contributions.

Hence, in CARM characterization, milc, soplex and gamess2 kernels should be limited by DRAM or between

DRAM and the shared cache (L3 cache), while the remaining hotspots should be characterized as bounded by

computational roofs or cache levels closer to the core.

As it can be observed in state-of-the-art CARM chart presented in Figure 5.8a, milc and soplex kernels are

completely bound by DRAM. However, both milc and soplex kernels also contain some contribution from retiring

and core bound, which means they should not be completely bound by DRAM, but instead to attain the performance

slightly higher than the one delimited by the DRAM slope (i.e., they should be positioned between L3 and DRAM).

This effect can also be observed in the Top-Down analysis provided for the micro-benchmarks that were performed

to evaluate the bandwidth upper-bounds of the micro-architecture for different memory levels (see Figure 3.14 in

Chapter 3). In detail, when memory accesses are served by DRAM, core bound and retiring contributions are

basically zero. Furthermore, since these kernels contain a significant amount of memory transfers, achieving the

compute bound roof should be almost impossible. However, state-of-the-art CARM hints that it might be possible

61

11121314151617181920

10-3 10-2 10-1 100 101 102

Po

wer

Co

nsu

mp

tio

n [

W]

Arithmetic Intensity [flops/bytes]

milc_1milc_2

gromacs_1soplex_1soplex_2

L1→C L3→

C

DRAM→C

FMA

Figure 5.9: Application characterization with SP Scalar LD COMPS CARM.

to improve the performance of these kernels until hitting the scalar ADD roof.

In contrast, in the CARM extension proposed herein, presented in Figure 5.8b, the milc kernels are placed

between L3 cache and DRAM, as expected according to the previously provided analyses. The soplex kernels are

also closer to the DRAM, despite not being on the top of the DRAM roof, which may suggest memory latency

as potential bottleneck. However, as expected from instruction distribution of each kernel, the characterization

proposed with DP Scalar 2LD/ST CARM extension hints that milc and soplex kernels performance can only be

boosted to achieve L1 cache roof and never the computational roof. Hence, when compared to the state-of-the-

art CARM, the characterization in the proposed CARM extension provides more accurate hints, according to the

predominant type of instructions in kernel instruction mix (instruction distribution) and the Top-Down analysis.

Regarding gromacs and gamess2 kernels, state-of-the-art CARM evaluation places their performance between

L3 cache and DRAM. However, according to the Top-Down analysis, these kernels should be limited by private

caches or computations, which does not strictly correspond to the state-of-the-art CARM characterization. By plot-

ting these kernels in the proposed CARM extension presented in Figure 5.8b, it can be observed that these kernels

are positioned closer to L3 cache roof. Although there is an improvement when compared to the state-of-the-art

CARM evaluation, this characterization still does not fully match the previously elaborated expectations from the

Top-Down analysis, i.e., that the kernel should be bounded by private caches or computational roofs. However, in

contrast to the other kernels, for gromacs and gamess2 kernels the Top-Down analysis detects a significant contri-

bution from bad speculation and frontend, respectively. Since these metrics are connected to backend starvation

and branch misprediction, their existence signals that the application might achieve a lower performance (than the

one previously expected), which also influences their characterization in the proposed CARM extension.

Finally, calculix kernel is characterized as the L3 cache bound in the state-of-the-art CARM. However, accord-

ing to the Top-Down analysis, the main bottlenecks are related to core ports and retiring, thus the performance of

this kernel should be limited by private caches or computations. This is confirmed by the kernel characterization

in the proposed DP Scalar 2LD/ST CARM, which places this hotspot on top of L2 cache roof, thus matching the

insights provided by the Top-Down analysis. In contrast, the guidelines derived from the state-of-the-art CARM

seem to be less accurate, since it shows calculix kernel to be characterized as L3 bound.

In order to obtain more insights regarding the application behavior, milc, gromacs and soplex kernels are also

plotted in the proposed CARM extension for power consumption, as presented in Figure 5.9. This analysis aims

62

Po

wer

Co

nsu

mp

tio

n [

W]


L2→C

L1→C

(a) Power consumption characterization between L1 cache

and L2 cache.

L2→C

L3→C

Po

wer

Co

nsu

mp

tio

n [

W]


(b) Power consumption characterization between L2 cache

and L3 cache.

DRAM→C

L3→C

Po

wer

Co

nsu

mp

tio

n [

W]


(c) Power consumption characterization between L3 cache

and DRAM.

DRAM→C

Po

wer

Co

nsu

mp

tio

n [

W]


(d) Power consumption characterization bellow DRAM.

Figure 5.10: Power consumption characterization methodology.

at uncovering the bottlenecks of these kernels from the perspective of the power consumed by the components

exercised during the execution of these kernels, i.e., to pinpoint the power consumption bottlenecks.

Regarding gromacs characterization in the proposed CARM extension for the power consumption, it com-

pletely correlates with the performance model characterization, i.e., the kernel is placed between DRAM and L3

cache roof. It is important to notice that extended power consumption CARM cannot be interpreted in the exact

same way as the performance model, i.e., by looking directly to the roofs above the plotted dot. In order to better

depict the interpretation methodology for the power consumption CARM and how it can be correlated with the

performance characterization, several different examples of power CARM analysis are provided in Figures 5.10a,

5.10b, 5.10c, and 5.10d. These examples include the power consumption CARM where specific areas are em-

phasized to cover a range of possible positions of the kernel in the performance CARM, i.e., kernel performance

characterization. As it can be observed, when the kernel performance lies between two cache levels, all the area

between the corresponding power consumption curves should be taken into account (see arrows in Figure 5.10).

While between L1 and L2 caches (see Figure 5.10a) and L3 and L2 caches (see Figure 5.10b) there is very little

room for inconclusive characterization between performance and power consumption domains, a special attention

should be taken when analyzing hotspots that are placed between L3 and DRAM lines. As it can be observed in

Figure 5.10c, the highlighted area also includes L2 and L1 curves. However, if the kernel performance is placed

between L3 and DRAM, its power consumption bottleneck should not be strictly attributed to L1 or L2 caches,

even if the characterization point is positioned on top of those lines. This evaluation scenario is clearly depicted

for gromacs, whose kernel is characterized between DRAM and L3 cache, thus it can have its power consumption

even lower than the one for L1 and L2 caches.

For kernels with performance close or below the DRAM line, the main area to analyze in power consumption

63

model should also be bellow or near the DRAM power curve, as shown in Figure 5.10d. This can be observed for

milc and soplex kernels, which power consumption characterization is positioned below the DRAM roof. Although

their power consumption characterization does not completely match to performance one, it should be noted that, in

contrast to the performance model that only focuses on the FP operations, the proposed power consumption CARM

extension encapsulates many different effects that occur in the CPU pipeline during the application execution,

as well as the power contributions from different components and instruction types (not only FP operations).

However, both performance and power characterizations in the proposed models show a significant amount of

consistency between them, e.g., there are no situations where a kernel bounded by the L1 in performance is bounded

by DRAM in the power consumption.

5.4.3 Application characterization in the DP Scalar LD CARM extension

The remaining set of kernels from the FP SPEC benchmarks, i.e., the main hotspots from bwaves, zeusmp,

cactusADM, gemsFDTD, tonto, leslie3D, gamess and lbm, is plotted in DP Scalar LD CARM extension, according

to kernels initially assessed characteristics based on the predominant type of FP operations and load/store ratios.

Due to the large amount of kernels in these SPEC benchmarks, their analysis will be separated in two different

groups (batches), namely: i) Group 1 with kernels from bwaves, zeusmp, cactusADM and gemsFDTD; and ii)

Group 2 with kernels from leslie3D, tonto and lbm benchmarks.

For the kernels belonging to the first group (batch), i.e., the kernels from bwaves, zeusmp, cactusADM and

gemsFDTD, the previously referred methodology was applied in order to assess the instruction distribution on a per

kernel basis, as presented in Figure 5.11a. As it can be observed, all kernels are mainly dominated by non-memory

instructions, which surpass 60% for all hotspots. Due to this observation, the Top-Down characterization is ex-

pected to identify retiring component as the main bottleneck for all kernels, although the retirement contribution

may be reduced depending on the accessed memory level. In fact, the respective Top-Down analysis presented in

Figure 5.11b confirms this assumption, since retiring dominates Top-Down analysis across all kernels. Further-

more, the memory accesses should also impact the performance of these kernels, since its contribution is superior

to 20% in almost every kernel. The only exceptions are cactusADM and gamess1 kernels, where memory contri-

bution is approximately 4% and 1%, respectively. Due to the dominance of retiring component, together with core

bound contribution, all the kernels are expected to be limited by private caches or computational roofs in CARM

plot. For example, gamess1 memory accesses are mainly served by L1 cache, thus its performance can even be

characterized between L1 and L2 roofs. However, if memory accesses are mainly served by L3 or DRAM, the

performance of these kernels can be lower than expected.

By plotting the kernels in the state-of-the-art CARM (see Figure 5.12a), the provided kernel characterization

generally does not match the Top-Down analysis, i.e., there is no kernel limited by private caches or by the com-

putation roofs. Instead, state-of-the-art CARM analysis places all kernels around the DRAM slope or between L3

and DRAM roofs. However, when the performance of these kernels is plotted in the proposed CARM extension, a

shift towards the private cache levels can be observed for all kernels, which closely matches the expectations from

the Top-Down analysis.

In detail, regarding the gamess1 kernel characterization in the proposed CARM extension, it completely

matches the expected behavior, since the proposed CARM places it slightly above the L2 cache slope. Further-

64

0102030405060708090100

1 2 3 1 2 3 4cactusADM

1 2 3 4 1

Loads Stores Others

bwaves zeusmp gemsFDTD1

gamess

(a) Batch 1: Instruction distribution.

0

0.2

0.4

0.6

0.8

1



Core Bound

1 2 3 1 2 3 4 1 2 3 4 1bwaves zeusmp gemsFDTD

1gamesscactusADM

(b) Batch 1: Top-Down metrics.

Figure 5.11: Batch 1: Instruction distribution and Top-Down analysis for DP Scalar LD applications.

10-2

10-1

100

101

10-3 10-2 10-1 100 101

Per

form

anc

e [G

FL

OP

S/s

]


L1→C

L3→C

DRAM→C

FMA

ADD/MUL

L2→C

bwaves_1bwaves_2bwaves_3gamess_1zeusmp_1zeusmp_2

zeusmp_3zeusmp_4

cactusADM_1gemsFDTD_1gemsFDTD_2gemsFDTD_3gemsFDTD_4

(a) Batch 1: state-of-the-art CARM characterization.

10-2

10-1

100

101

10-3 10-2 10-1 100 101P

erfo

rman

ce

[GF

LO

PS

/s]



zeusmp_3zeusmp_4


L1→C

L3→C

DRAM→C

FMA

ADD/MUL

(b) Batch 1: DP Scalar LD CARM characterization.

Figure 5.12: Batch 1: Application characterization within state-of-the-art CARM and proposed DP Scalar LD

CARM extension.

more, zeusmp kernels 1 and 3 and cactusADM kernel also completely correlate with Top-Down insights, since their

performance is now limited by private cache levels, in particular L2 cache. The performance of cactusADM kernel

is also affected by the frontend bound contribution, indicating backend starvation and, consequently, resulting in

a slightly lower attainable performance, as captured in the proposed CARM extension. Although the performance

of the remaining kernels is closer to core components in the proposed CARM chart, which demonstrates a more

accurate analysis, these kernels are still characterized as bounded by the shared cache, i.e., L3 cache. However, for

these kernels, the majority of their memory accesses is served by DRAM, which naturally result in a significantly

reduced performance due to the lower DRAM bandwidth. However, since memory bound is not the main bottle-

neck in Top-Down analysis, the kernels can not be completely limited by the DRAM roof, hence their performance

is closer to the L3 cache in the proposed model, thus never reaching the private caches or computational roofs.

In order to provide a different perspective when analyzing the bwaves, zeusmp, gamess1, cactusADM and

gemsFDTD kernels, their behavior, characteristics and energy-efficiency are also assessed in the corresponding

energy-efficiency CARM extension, as presented in Figure 5.13. This model is constructed by following the same

principles when selecting the performance CARM extension, i.e., DP Scalar LD CARM, thus providing a coherent

comparison between performance and energy-efficiency evaluations.

As it can be observed in Figure 5.13, energy-efficiency characterization of the kernels belonging to the first

evaluation group does not drastically differ from the one observed in the performance domain. The kernels 1

and 3 from zeusmp maintain the same relative position to the L2 roof in both models, indicating a good balance

between performance and power consumption. The position of remaining kernels is slightly shifted further away

from the limiting roof, when compared to the performance characterization, although their main execution bot-

65

10-2

10-1

100

10-3 10-2 10-1 100 101En

erg

y -

Eff

icie

ncy

[G

FL

OP

S/J

]



zeusmp_3zeusmp_4


L1→C

L3→C

DRAM→C

FMA

ADD/MUL

L2→C

Figure 5.13: Application efficiency characterization with proposed DP Scalar LD CARM extension.

tlenecks remain the same. This suggests that the power consumption of these kernels is higher and provokes

reduced energy-efficiency, thus the execution of these kernels should be optimized, especially in what concerns

the power consumed during the execution. For example, by applying different optimization techniques in the re-

spective kernels that allow achieving the performance closer to the L1 roof, the energy-efficiency of these kernels

is also expected to significantly increase (i.e., the characterization point in the energy-efficiency plot will move

nearer the L1 roof), since the power consumption for L1 access is also lower. Hence, the main objective when

optimizing applications from the energy-efficiency point of view is to maximize the efficiency, i.e., to improve

the application performance, while reducing or maintaining the similar level of power consumption. As it can

be observed in Figure 5.13, all analyzed kernels have very low AI, which prevents them from entering to the re-

gions of high energy-efficiency (i.e., the regions where it is possible to achieve 99% of the maximum processor

energy-efficiency). In order to achieve this goal, the structure of the algorithms has to be completely redesigned (if

possible at all) to provide significant changes in the kernels’ AI, i.e., to shift the kernels towards the right side of the

energy-efficiency CARM ridge point, where it would be theoretically possible to attain the maximum efficiency as

sustained by the architecture.

The second group of characterized FP SPEC benchmarks contains the kernels from leslie3D, tonto and lbm

applications. As it can be observed in the corresponding instruction distribution, presented in Figure 5.14a, leslie3D

kernels are predominantly constituted by non-memory instructions, thus they are expected to be limited by retiring

in Top-Down analysis. In contrast, tonto and lbm kernels are quite balanced between memory and non-memory

instructions. Hence, predicting their behavior is far from being a trivial task, since it depends on the accessed

memory level by the memory instructions. If the requests are all served by private caches, their retiring should be

high in Top-Down. On the other hand, if the majority of the memory accesses are referring to DRAM, the kernels

should be memory bound.

These effects are clearly visible in Top-Down evaluation, which is presented in Figure 5.14b. As expected from

the leslie3D instruction mix, these kernels are mainly limited by retiring (approximately 60% in all kernels), with

a significant contribution from memory bound component. In fact, according to the Top-Down characterization,

leslie3D memory accesses are mainly served by DRAM. Hence, despite being predominately limited by retiring,

the performance of these kernels should be placed between L3 cache and DRAM roofs. In contrast to the leslie3D

66

0102030405060708090100

1 2 3 4 5 6 1 2 3 1

Loads Stores Others

leslie3d tonto lbm

(a) Batch 2: Instruction distribution.

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 6 1 2 3 1



Core Bound

leslie3d tonto lbm

(b) Batch 2: Top-Down metrics.

Figure 5.14: Batch 2: Instruction distribution and Top-Down analysis for DP Scalar LD applications.

10-2

10-1

100

101

10-3 10-2 10-1 100 101

Per

form

anc

e [G

FL

OP

S/s

]


L1→C

L3→C DRAM→

C

FMA

ADD/MUL

L2→C

leslie3d_1

leslie3d_3leslie3d_4leslie3d_5leslie3d_6

leslie3d_2tonto_1tonto_2tonto_3lbm_1

(a) Batch 2: state-of-the-art CARM characterization.

10-2

10-1

100

101

10-3 10-2 10-1 100 101P

erfo

rman

ce

[GF

LO

PS

/s]


leslie3d_1

leslie3d_3leslie3d_4leslie3d_5leslie3d_6

leslie3d_2

tonto_1tonto_2tonto_3lbm_1

L1→C

L3→C

DRAM→C

FMA

ADD/MUL

L2→C

(b) Batch 2: DP Scalar LD CARM characterization.

Figure 5.15: Batch 2: Application characterization within state-of-the-art CARM and proposed DP Scalar LD

CARM extension.

kernels, memory bound component of tonto kernels is much lower and they are completely bound by retiring and

core bound, indicating that their performance might be limited by private caches or computations. Finally, the

lbm kernel is predominantly bounded by memory. In particular, the majority of its memory accesses are served

by DRAM, thus it should be characterized as DRAM bound in the CARM plot. However, since it is not purely

memory bound, the existence of some retiring contribution might suggest that lbm kernel performance can be

positioned slightly above the DRAM roof.

By plotting these kernels in the state-of-the-art CARM (Figure 5.14a), the insights provided by Top-Down

analysis are, once again, not completely verified. In the state-of-the-art CARM, all kernels are positioned below the

L3 cache roof, although the Top-Down metrics indicate the opposite. The only kernel that somewhat matches the

Top-Down analysis is lbm, which is completely bounded by DRAM (although the Top-Down retiring component

suggests that its performance should surpass DRAM). However, when the same set of kernels is characterized in

the proposed CARM extension, as presented in Figure 5.15b, a clear relation between the Top-Down analysis and

DP Scalar LD CARM can be observed. In detail, in the proposed roofline chart, all the kernels behave according

to the Top-Down and instruction mix analyses. In particular, tonto kernels are bounded by private caches, which

indicates their high retiring and core bound nature from the Top-Down evaluation. Moreover, lbm and leslie3D

kernels, due to the retirement contribution and DRAM accesses, have their performance placed between L3 and

DRAM roofs. In fact, leslie3D kernels are now mainly limited by the L3 cache accesses, which indicates the

previously referred contribution of DRAM accesses on the application with high or moderate retirement rates.

However, the proposed CARM does not fully explain the behavior of tonto kernels, despite providing correct

and accurate characterization. In particular, the kernel 3 achieves lower performance than kernels 1 and 2, although

67

2-1

20

21

22

23

20 21 22 23 24 25

Per

form

anc

e (I

NS

T/C

LK

)

INST/MOPS

tonto_1 tonto_2 tonto_3

L1→C

L3→C

DRAM→C

RET = 3

L2→C RET = 2

RET = 4

RET = 1

Figure 5.16: Application characterization with DP Scalar LD INST CARM.

kernel 3 has a retiring component (74%) much higher than kernels 1 and 2 (58 % in both). Since this application has

a high diversity in its instruction mix, FP based CARM may not allow for the full characterization of all execution

bottlenecks that may originate from the other instruction types present in the application instruction mix. With

the objective of explaining this small inconsistency, a similar extension to COMPS CARM is also investigated in

this Thesis. In contrast to COMPS CARM (see Section 5.4.1), which relates the ratio of COMPS and MOPS with

the amount of COMPS retired per cycle, this new redefined roofline model considers in x-axis the total amount

of instructions (INST) over the amount of memory instructions, i.e., MOPS, as shown in Figure 5.16. Hence, in

the y-axis, the performance is given by the total amount of instructions retired per clock cycle (IPC). Since the

processor can only retire 4 instructions per cycle, the horizontal roofs are equal to the ones presented in COMPS

CARM. A similar methodology is applied to the slanted roofs, which consider the throughput of each memory

level, in terms of MOPS per cycle. Furthermore, the proposed general roofline model inherits all CARM main

characteristics, including the construction for different load/store ratios and instruction types. In addition, since

this novel roofline modeling approach considers the performance via the amount of retired INST per cycle, it is

denominated herein as INST CARM.

As it can be observed in Figure 5.16, tonto kernels are characterized as compute bound in this model. In fact,

the highest performance is now achieved with kernel 3, as expected from the Top-Down analysis. The kernels 1

and 2 also corroborate with Top-Down evaluation, since their lower performance is directly connected to the lower

retiring contribution. This demonstrates that the proposed INST CARM can also complement FP CARM analysis,

similarly to one presented for the COMPS CARM.

As it can be concluded, all different flavors of roofline modeling proposed in this Thesis provide a significant

enrichment to the general insightful micro-architecture modeling. When tightly coupled, those models constitute

a powerful architecture and application analysis framework, mainly due to their ability to provide more accurate

architecture and application characterization. By relying on the proposed set of easy to understand and intuitive

models to visually represent the current execution bottlenecks, the application developers and computer architects

can now conduct very fast first-order analysis, apply different optimization techniques by following the guidelines

given by the proposed models or even make easier decisions and evaluations between different design choices.

68

5.5 Summary

In this chapter, usefulness and insightfulness of proposed CARM extensions are demonstrated, by performing

an in-depth experimental evaluation and analysis of a set of different real-world applications on real hardware

system, with a quad-core Intel Skylake 6700K processor. This analysis is performed from different modeling

domains, such as performance, power consumption and energy-efficiency.

Initially, Toypush mini-application main hotspots are deeply evaluated and characterized in their propose

CARM models, which adapt to kernels specifics (e.g. load/store ratio and instruction type). This evaluation in-

volves instruction distribution and Top-Down analyses, which are correlated with CARM behavior. Besides, a set

of optimizations are applied to Toypush kernels, in order to maximize their performance and assess optimizations

impact in proposed CARM characterization. The optimizations allowed to obtain performance improvements of

up to 6.43 times when compared to non-optimized Toypush version.

Next, a set of real-world applications from SPECs benchmark suite is also analyzed in the corresponding

CARM extensions, for performance, power consumption and energy-efficiency. Due to each application specifics,

their kernels are divided in three main proposed model categories: SP scalar LD CARM, DP scalar 2LD/ST

CARM and DP scalar LD CARM. The kernels performance analysis follows the same methodology utilized in

Toypush case-study. Besides, since SPECs benchmark suite applications are also evaluated from power con-

sumption and energy-efficiency point-of-view, the methodology to analyze correctly these CARM charts is also

presented throughout application kernels evaluation. Finally, COMPS and INST CARM extensions are also in-

troduced, in order to complement FP CARM insightfulness, by allowing to distinguish which roofs in FP CARM

should be firstly taken into account as the primary execution bottlenecks, in particular, when performance kernel

is placed between two roofs. Moreover, state-of-the-art CARM performance characterization was compared with

the proposed CARM characterization. The obtained results show a clear improvement in kernels characterization

when using the proposed CARM extensions, since the insights provided by these novel extensions are according

to the expected behavior indicated by Top-Down analysis.

The results presented throughout this chapter allow to conclude that proposed CARM extensions and perfor-

mance, power consumption and energy-efficiency methodologies are a clear improvement to the actual state-of-

the-art CARM insightfulness and usability. Thus, their inclusion in Intel Advisor can easily boost the capabilities

of this powerful tool, easing even more the designing and optimization processes of real applications.

69

6. Conclusions and Future WorksDue to the increasing micro-architecture and application complexity, optimizing applications from perfor-

mance, power consumption and energy-efficiency points of view is not an easy task for software developers.

Hence, a characterization methodology capable of providing useful insights about application behavior according

to the micro-architecture capabilities is extremely important when addressing these challenges. To address this

issue, this Thesis proposed a set of CARM extensions, aiming at increasing the model insightfulness and usabil-

ity when characterizing the real-world applications. Besides, the work performed in the scope of this Thesis has

also as an objective to investigate the CARM portability across processors from different and more recent Intel

micro-architectures, in particular, from Intel Ivy Bridge 3770K to Intel Skylake 6700K.

To achieve these objectives, a tool specially designed for Intel micro-architectures was developed, allowing

to characterize the system upper-bounds for several different computational capabilities, such as, different in-

structions, instruction set extensions and load/store ratios. The obtained results allowed to create several CARM

instances, each evaluating different processor capabilities. These instances were successfully validated on two dif-

ferent computing platforms equipped with a quad-core Intel Skylake 6700K and Intel Ivy Bridge 3770K, which also

demonstrated the accuracy of the created benchmarks. Furthermore, modern real-world applications can contain

a diverse instruction mix, which increases their complexity and, consequently, making it difficult to characterize

their behavior and to select the best optimization techniques to be applied to improve their execution efficiency.

Hence, the proposed models aim at bridging this gap, by correlating the application specifics with different micro-

architectural capabilities.

In order to demonstrate the usability of the proposed extensions, a case-study with the Toypush miniapp and

the characterization of FP benchmarks from SPEC suite were performed, by following the proposed performance,

power consumption and energy-efficiency analysis methodologies. In case of Toypush, its main hotspots were

deeply analyzed, by taking into account their instruction mix and Top-Down bottlenecks and, based on this anal-

ysis, the correct CARM instance was applied to each kernel. By relying on the insights provided by the proposed

CARM extensions, Toypush kernels were optimized, which allowed achieving performance improvements of up

to 6.43 times when compared to the non-optimized codes.

Besides, the proposed methodology was compared with the insights derived from the state-of-the-art CARM

implementation. This analysis shows the capability of the proposed models to provide a more accurate charac-

terization of the application behavior. Furthermore, it was also possible to fully correlate the results of in-depth

profiling for application execution bottlenecks (using the Top-Down analysis) and the proposed CARM characteri-

zation. On the other hand, state-of-the-art CARM typically gives a more limited set of information when explaining

the main execution bottlenecks for the majority of the kernel applications. Moreover, two extra CARM extensions

were proposed, namely COMPS CARM and INST CARM. These extensions redefine CARM throughput analysis,

by defining application performance as the relation between the number of instructions retired per cycle and the

total amount of executed instructions.

The experimental results presented in this Thesis clearly show that the proposed extensions increase the CARM

insightfulness and usability when analyzing real-world applications. Moreover, the proposed COMPS CARM and

70

INST CARM could be considered as the first step towards the derivation of more general Roofline modeling ap-

proaches, since these models consider the entire set of instructions and not only FP instructions. Finally, since

CARM is a model based on experimental measurements, it can be applied to any architecture, as it was stated in

validation results, where CARM was successfully applied and validated to model the performance, power con-

sumption and energy-efficiency upper-bounds of a quad-core Intel Skylake 6700K processor.

6.1 Future Works

As it can be observed from the performed characterization in Chapter 5, there are some aspects that could

improve CARM insightfulness when analyzing the real-world applications. Based on COMPS and INST CARM

proposed extensions, CARM horizontal roofs could be extended to include mixes between integer and FP opera-

tions. Hence, the majority of the applications, with the most diverse amount of integers and FP instructions would

be characterized more accurately. Besides, since integer instructions allow to perform memory instructions of 1

and 2 bytes, these roofs should also be included in the CARM chart.

Moreover, there are some application kernels that mix different data precision types, i.e., they contain memory

accesses and computations from SP and DP data. Hence, by including in CARM the roofs that correspond to these

mixes, different processor capabilities can be taken into account, thus improving the model usability.

Finally, further research is needed to fully uncover the impact of frontend and bad speculation on the perfor-

mance in CARM. Despite the fact that they may provoke a reduction in the kernel performance, CARM seems

not to be able to fully take into account these bottlenecks. Hence, CARM can be extended to include memory and

computational roofs representing the memory and computational throughput when frontend and bad speculation

problems occur.

71

References[1] Intel Corporation. “Intel R© 64 and IA-32 Architectures Optimization Reference Manual”, Intel. [Online].

[2] Ahmad Yasin. “A Top-Down method for performance analysis and counters architecture”. In Proocedings

of the International Symposium on Performance Analysis of Systems and Software, ISPASS’14, pages 35–44.

IEEE, 2014.

[3] Aleksandar Ilic, Frederico Pratas, and Leonel Sousa. “Cache-aware roofline model: Upgrading the loft”.

IEEE Computer Architecture Letters, 13(1):21–24, 2014.

[4] Samuel Williams, Andrew Waterman, and David Patterson. “Roofline: an insightful visual performance

model for multicore architectures”. Communications of the ACM, 52(4):65–76, 2009.

[5] Aleksandar Ilic, Frederico Pratas, and Leonel Sousa. “Beyond the roofline: Cache-aware power and energy-

efficiency modeling for multi-cores”. IEEE Transactions on Computers, 66(1):52–58, 2017.

[6] Aleksandar Ilic. “Cache-Aware Roofline Model (CARM) : Performance, Power, Energy and Energy-

Efficiency”. http://sips.inesc-id.pt/~ilic/roofline.php.

[7] J. Doweck, W. F. Kao, A. K. y. Lu, J. Mandelblat, A. Rahatekar, L. Rappoport, E. Rotem, A. Yasin, and

A. Yoaz. “Inside 6th-Generation Intel Core: New Microarchitecture Code-Named Skylake”. IEEE Micro,

37(2):52–62, Mar 2017.

[8] Herman Schmit and Randy Huang. “Dissecting Xeon + FPGA: Why the Integration of CPUs and FPGAs

Makes a Power Difference for the Datacenter: Invited Paper”. In Proceedings of the International Symposium

on Low Power Electronics and Design, ISLPED ’16, pages 152–153, New York, NY, USA, 2016. ACM.

[9] Hadi Esmaeilzadeh, Emily Blem, Renee St Amant, Karthikeyan Sankaralingam, and Doug Burger. “Dark

silicon and the end of multicore scaling”. ACM SIGARCH Computer Architecture News, 39(3):365–376,

2011.

[10] Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K Reinhardt, Ali Saidi, Arkaprava Basu, Joel

Hestness, Derek R Hower, Tushar Krishna, Somayeh Sardashti, et al. “The Gem5 Simulator”. ACM

SIGARCH Computer Architecture News, 39(2):1–7, August 2011.

[11] J. Power, J. Hestness, M. S. Orr, M. D. Hill, and D. A. Wood. “gem5-gpu: A Heterogeneous CPU-GPU

Simulator”. IEEE Computer Architecture Letters, 14(1):34–36, Jan 2015.

[12] Juan A Lorenzo, Juan C Pichel, Tomas F Pena, Marcos Suarez, and Francisco F Rivera. “Study of Perfor-

mance Issues on a SMP-NUMA System using the Roofline Model”. In Proocedings of the International Con-

ference on Parallel and Distributed Processing Techniques and Applications, PDPTA’11, volume 7, pages

18–2011, 2011.

72

http://sips.inesc-id.pt/~ilic/roofline.php

[13] O. G. Lorenzo, T. F. Pena, J. C. Cabaleiro, J. C. Pichel, and F. F. Rivera. “3DyRM: a dynamic roofline model

including memory latency information”. Journal of Supercomputing, 70(2):696–708, 2014.

[14] Oscar G Lorenzo, Tomas F Pena, Jose C Cabaleiro, Juan C Pichel, and Francisco Fernandez Rivera. “Using an

extended Roofline Model to understand data and thread affinities on NUMA systems”. Annuals of Multicore

and GPU Programming, 1(1):56–67, 2014.

[15] Victoria Caparros Cabezas and Markus Puschel. “Extending the roofline model: Bottleneck analysis with

microarchitectural constraints. In Proceedings of the IEEE International Symposium on Workload Charac-

terization, IISWC’14, pages 222–231. IEEE, 2014.

[16] Joshua D. Suetterlein, Joshua Landwehr, Andres Marquez, Joseph Manzano, and Guang R. Gao. “Extending

the Roofline Model for Asynchronous Many-Task Runtimes”. In Proceedings of the IEEE International

Conference on Cluster Computing, CLUSTER’16, pages 493–496, 2016.

[17] Diogo Antao, Luıs Tanica, Aleksandar Ilic, Frederico Pratas, Pedro Tomas, and Leonel Sousa. “Monitoring

Performance and Power for Application Characterization with the Cache-Aware Roofline Model”. In Pro-

ceedings of the International Conference on Parallel Processing and Applied Mathematics, PPAM’14, pages

747–760. Springer, 2014.

[18] Tomas Ferreirinha, Ruben Nunes, Leonardo Azevedo, Amılcar Soares, Frederico Pratas, Pedro Tomas, and

Nuno Roma. “Acceleration of stochastic seismic inversion in OpenCL-based heterogeneous platforms”.

Computers & Geosciences, 78:26–36, 2015.

[19] Alexandra Shinsel. “Intel Advisor Roofline”. https://software.intel.com/en-us/articles/

intel-advisor-roofline, 2017. [Online; posted 02-March-2017].

[20] Jawad Haj-Yihia, Ahmad Yasin, Yosi Ben Asher, and Avi Mendelson. “Fine-Grain Power Breakdown of

Modern Out-of-Order Cores and Its Implications on Skylake-Based Systems‘”. ACM Transactions on Archi-

tecture and Code Optimization, 13(4):1–25, 2016.

[21] Toypush. https://github.com/tkoskela/toypush.

[22] “Power 4 The First Multi-Core, 1GHz Processor”. http://www-03.ibm.com/ibm/history/ibm100/us/

en/icons/power4/.

[23] miniGhost 3D Halo-Exchange Mini-Application. https://github.com/Mantevo/miniGhost.

[24] Jee Choi, Marat Dukhan, Xing Liu, and Richard Vuduc. “Algorithmic time, energy, and power on candidate

HPC compute building blocks”. In Proceedings of the IEEE 28th International Symposium on Parallel and

Distributed Processing, IPDPS’14, pages 447–457. IEEE, 2014.

[25] Jee Whan Choi and Richard Vuduc. “A roofline model of energy”, technical report. Georgia Institute of

Technology, School of Computation Science and Engineering, Atlanta, GA, USA, December 2012.

73

https://software.intel.com/en-us/articles/intel-advisor-roofline

https://software.intel.com/en-us/articles/intel-advisor-roofline

https://github.com/tkoskela/toypush

http://www-03.ibm.com/ibm/history/ibm100/us/en/icons/power4/

http://www-03.ibm.com/ibm/history/ibm100/us/en/icons/power4/

https://github.com/Mantevo/miniGhost

[26] Jee Whan Choi, Daniel Bedard, Robert Fowler, and Richard Vuduc. “A roofline model of energy”. In

Proceedings of the IEEE 27th International Symposium on Parallel & Distributed Processing, IPDPS’13,

pages 661–672. IEEE, 2013.

[27] Anirban Mandal, Rob Fowler, and Allan Porterfield. “Modeling memory concurrency for multi-socket multi-

core systems”. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems &

Software, ISPASS’10, pages 66–75. IEEE, 2010.

[28] Cedric Nugteren and Henk Corporaal. “The boat hull model: adapting the roofline model to enable per-

formance prediction for parallel computing”. In Proceedings of the 17th ACM SIGPLAN symposium on

Principles and Practice of Parallel Programming, PPoPP’12, volume 47, pages 291–292. ACM, 2012.

[29] Bruno da Silva, An Braeken, Erik H D’Hollander, and Abdellah Touhafi. “Performance modeling for FP-

GAs: extending the roofline model with high-level synthesis tools”. International Journal of Reconfigurable

Computing, 2013:7, 2013.

[30] Johannes Hofmann, Jan Eitzinger, and Dietmar Fey. Execution-cache-memory performance model: Intro-

duction and validation. arXiv preprint arXiv:1509.03118, 2015.

[31] Holger Stengel, Jan Treibig, Georg Hager, and Gerhard Wellein. “Quantifying performance bottlenecks

of stencil computations using the Execution-Cache-Memory model”. In Proceedings of the 29th ACM on

International Conference on Supercomputing, ICS’15, pages 207–216, 2015.

[32] Andre Lopes, Frederico Pratas, Leonel Sousa, and Aleksandar Ilic. Exploring gpu performance, power and

energy-efficiency bounds with cache-aware roofline modeling. In Proceedings of the IEEE International

Symposium on Performance Analysis of Systems and Software, ISPASS’17, pages 259–268. IEEE, 2017.

[33] Tuomas S. Koskela, Mathieu Lobet, Jack Deslippe, and Zakhar Matveev. “Roofline Analysis in the Intel R©

Advisor to Deliver Optimized Performance for applications on Intel R© Xeon Phi Processor”. May 2017.

[34] Tuomas Koskela, Jack Deslippe, Brian Friesen, and Karthik Raman. “Fusion PIC Code Performance Analysis

on The Cori KNL System”. In Proceedings of the Cray User Group Conference, CUG’17, 2017.

[35] Amrita Mathuriya, Ye Luo, Raymond C Clay III, Anouar Benali, Luke Shulenburger, and Jeongnim Kim.

“Embracing a new era of highly efficient and productive quantum Monte Carlo simulations”. arXiv preprint

arXiv:1708.02645, 2017.

[36] Amrita Mathuriya, Ye Luo, Anouar Benali, Luke Shulenburger, and Jeongnim Kim. “Optimization and

parallelization of B-spline based orbital evaluations in QMC on multi/many-core shared memory processors”.

In Proceedings of the IEEE International Symposium on Parallel and Distributed Processing, IPDPS’17,

pages 213–223. IEEE, 2017.

[37] Cedric Nugteren and Henk Corporaal. “A modular and parameterisable classification of algorithms”. Eind-

hoven University of Technology, Tech. Rep. ESR-2011-02, 2011.

[38] Sofya Titarenko and Mark Hildyard. “Hybrid multicore/vectorisation technique applied to the elastic wave

equation on a staggered grid”. Computer Physics Communications, 216:53 – 62, 2017.

74

[39] Igor Surmin, Sergey Bastrakov, Zakhar Matveev, Evgeny Efimenko, Arkady Gonoskov, and Iosif Meyerov.

“Co-design of a Particle-in-Cell Plasma Simulation Code for Intel Xeon Phi: A First Look at Knights Land-

ing”. In Proceedings of the Algorithms and Architectures for Parallel Processing, ICA3PP’16 Collocated

Workshops, pages 319–329. Springer International Publishing, 2016.

[40] J. S. Park, H. E. Kim, H. Y. Kim, J. Lee, and L. S. Kim. “A Vision Processor With a Unified Interest-Point

Detection and Matching Hardware for Accelerating a Stereo-Matching Algorithm”. IEEE Transactions on

Circuits and Systems for Video Technology, 26(12):2328–2343, Dec 2016.

[41] Lıdia Kuan, Frederico Pratas, Leonel Sousa, and Pedro Tomas. “MrBayes sMC3: Accelerating Bayesian

inference of phylogenetic trees”. The International Journal of High Performance Computing Applications,

2016.

[42] Joao Andrade, Frederico Pratas, Gabriel Falcao, Valter Silva, and Leonel Sousa. “Combining flexibility with

low power: Dataflow and wide-pipeline LDPC decoding engines in the Gbit/s era”. In Proceedings of the

International Conference on Application-specific Systems, Architectures and Processors, ASAP’14, pages

264–269. IEEE, 2014.

[43] Luıs Tanica, Aleksandar Ilic, Pedro Tomas, and Leonel Sousa. “SchedMon: A Performance and Energy

Monitoring Tool for Modern Multi-cores”. In Proceedings of the International Workshop on Multi/Many-

Core Computing Systems, MuCoCoS/Euro-Par’14, pages 1–10. Springer, 2014.

[44] Nicolas Denoyelle, Aleksandar Ilic, Brice Goglin, Leonel Sousa, and Emmanuel Jeannot. “Automatic Cache

Aware Roofline Model Building and Validation Using Topology Detection”. In Proceedings of the Network

for Sustainable Ultrascale Computing, NESUS’16 Workshop, October 2016.

[45] Intel Corporation. “Intel R© 64 and IA-32 Architectures Software Developer Manual”, 2013. [Online].

[46] Agner Fog. “Software optimization resources’. http://www.agner.org/optimize/.

[47] FP SPEC Benchmark Suite. https://www.spec.org/cpu2006/CFP2006/.

75

http://www.agner.org/optimize/

https://www.spec.org/cpu2006/CFP2006/

Performance and Energy-E ciency Modelling for …...Ao longo dos ultimos anos, o aumento das...

Documents

Transcript of Performance and Energy-E ciency Modelling for …...Ao longo dos ultimos anos, o aumento das...