“Simple” performance modeling: The Roofline Model

“Simple” performance modeling:

The Roofline Model

Loop-based performance modeling: Execution vs. data transfer

2

Prelude: Modeling customer dispatch in a bank

03/2017 Roofline Model

Revolving door

throughput:

bS [customers/sec]

Intensity:

I [tasks/customer]

Processing

capability:

Pmax [tasks/sec]

3

Prelude: Modeling customer dispatch in a bank

How fast can tasks be processed? 𝑷 [tasks/sec]

The bottleneck is either

The service desks (max. tasks/sec): 𝑃max The revolving door (max. customers/sec): 𝐼 ∙ 𝑏𝑆

This is the “Roofline Model”

High intensity: P limited by “execution”

Low intensity: P limited by “bottleneck”

“Knee” at 𝑃𝑚𝑎𝑥 = 𝐼 ∙ 𝑏𝑆:

Best use of resources

Roofline is an “optimistic” model

(“light speed”)


𝑃 = min(𝑃max, 𝐼 ∙ 𝑏𝑆)

Intensity

Perf

orm

ance

Pmax

4

Apply this example to performance of compute devices

Maximum processing capability Peak performance: 𝑃𝑝𝑒𝑎𝑘 [𝐹

𝑠]

Rate of revolving door Memory bandwidth: 𝑏𝑆 [𝐵

𝑠]

Workload per customer Computational Intensity: 𝐼 [𝐹

𝐵]

The Roofline Model – Basics


Machine model:

𝑃𝑝𝑒𝑎𝑘 = 4𝐺𝐹

𝑠

𝑏𝑆 = 10𝐺𝐵

𝑠

Application model: I

double s=0, a[];

for(i=0; i<N; ++i) {

s = s + a[i] * a[i];}

𝐼 =2 𝐹

8 𝐵= 0.25 𝐹 𝐵

5

The Roofline Model – Basics

Compare capabilities of different machines

RLM always provides upper bound – but is it realistic?

If code is not able to reach this limit (e.g. contains add operations

only) machine parameter need to redefined (.e.g. 𝑃𝑝𝑒𝑎𝑘 𝑃𝑝𝑒𝑎𝑘/𝟐)


Assuming double

precision –

for single precision:

𝑃𝑝𝑒𝑎𝑘 2 ∗ 𝑃𝑝𝑒𝑎𝑘

6

The Roofline Model – refined


D. Callahan et al.: Estimating interlock and improving balance for pipelined architectures. Journal for Parallel and Distributed Computing 5(4),

334 (1988). DOI: 10.1016/0743-7315(88)90002-0

W. Schönauer: Scientific Supercomputing: Architecture and Use of Shared and Distributed Memory Parallel Computers. Self-edition (2000)

S. Williams: Auto-tuning Performance on Multicore Computers. UCB Technical Report No. UCB/EECS-2008-164. PhD thesis (2008)

1. Pmax = Applicable peak performance of a loop, assuming that data

comes from the level 1 cache (this is not necessarily Ppeak)

e.g., Pmax = 176 GFlop/s

2. I = Computational intensity (“work” per byte transferred) over the

slowest data path utilized (code balance BC = I -1)

e.g., I = 0.167 Flop/Byte BC = 6 Byte/Flop

3. bS = Applicable (saturated) peak bandwidth of the slowest data path

utilized

e.g., bS = 56 GByte/s

Expected performance:

𝑃 = min 𝑃max, 𝐼 ∙ 𝑏𝑆 = min 𝑃max,𝑏𝑆𝐵𝐶

[Byte/s]

[Byte/Flop]

http://dx.doi.org/10.1016/0743-7315(88)90002-0

http://www.rz.uni-karlsruhe.de/~rx03/book

http://www.eecs.berkeley.edu/Pubs/TechRpts/2008/EECS-2008-164.pdf

9

Preliminary: Estimating Pmax

Remember the Haswell port scheduler model?

Port 0 Port 1 Port 5Port 2 Port 3 Port 4 Port 6 Port 7

ALU ALU ALU

FMA FMA FSHUF

JUMP

LOAD LOAD

AGU AGU

STORE

Retire 4 uops

32b 32b 32b

AGU

Haswell

FMUL

ALU

JUMP


11

Example: Estimate Pmax of vector triad on Haswell

double *A, *B, *C, *D;

for (int i=0; i<N; i++) {

A[i] = B[i] + C[i] * D[i];

}

Minimum number of cycles to process one AVX-vectorized iteration

(one core)?

Equivalent to 4 scalar iterations

Cycle 1: LOAD + LOAD + STORE

Cycle 2: LOAD + LOAD + FMA + FMA

Cycle 3: LOAD + LOAD + STORE Answer: 1.5 cycles


13

Example: Estimate Pmax of vector triad on Haswell (2.3 GHz)

double *A, *B, *C, *D;

for (int i=0; i<N; i++) {

A[i] = B[i] + C[i] * D[i];

}

What is the performance in GFlops/s per core and the bandwidth in

GBytes/s?

One AVX iteration (1.5 cycles) does 4 x 2 = 8 flops:

2.3 ∙ 109 cy/s

1.5 cy∙ 4 updates ∙

2 flops

update= 𝟏𝟐. 𝟐𝟕

Gflops

s

6.13 ∙ 109updates

s∙ 32

bytes

update= 196

Gbyte

s


14

Pmax + bandwidth limitations: The vector triad

Example: Vector triad A(:)=B(:)+C(:)*D(:)

on a 2.3 GHz 14-core Haswell chip (AVX vectorized)

Peak memory bandwidth: bS = 50 GB/s

Memory code balance:

Bc = (4+1) Words / 2 Flops = 20 B/F (including write allocate)

I = 0.05 F/B

I ∙ bS = 2.5 GF/s (0.5% of peak performance)

Ppeak = 515.2 Gflop/s (14 cores x (8+8) Flops/cy x 2.3 GHz)

Pmax = 14 x 12.27 Gflop/s =172 Gflop/s (33% peak)

𝑃 = min 𝑃max, 𝐼 ∙ 𝑏𝑆 = min 172,2.5 GFlop s= 2.5 GFlop s


15

Code balance: more examples

double a[], b[];

for(i=0; i<N; ++i) {

a[i] = a[i] + b[i];}

BC = 24B / 1F = 24 B/F

I = 0.042 F/B

double a[], b[];

for(i=0; i<N; ++i) {

a[i] = a[i]+ s * b[i];}

BC = 24B / 2F = 12 B/F

I = 0.083 F/B

Scalar – can be kept in register

float s=0, a[];

for(i=0; i<N; ++i) {

s = s + a[i] * a[i];}

BC = 4B/2F = 2 B/F

I = 0.5 F/B


float s=0, a[], b[];

for(i=0; i<N; ++i) {

s = s + a[i] * b[i];}

BC = 8B / 2F = 4 B/F

I = 0.25 F/B



16

Machine balance for hardware characterization

For quick comparisons the concept of machine balance is useful

Machine Balance = How much input data can be delivered for each FP

operation? (“Memory Gap characterization”)

Assuming balanced MULT/ADD

Rough estimate: 𝐵𝑚 ≪ 𝐵𝑐 strongly memory-bound code

Typical values (main memory):

Intel Haswell 14-core 2.3 GHz

Bm = 60 GB/s / (14 x 2.3 x 16) GF/s ≈ 0.12 B/F

Intel Sandy Bridge 8-core 2.7 GHz ≈ 0.23 B/F

Nvidia K20x ≈ 0.14 B/F

Intel Xeon Phi Knights Corner ≈ 0.16 B/F


𝐵𝑚 =𝑏𝑆

𝑃peak

17

A not so simple Roofline example

Example: do i=1,N; s=s+a(i); enddo

in single precision on a 2.2 GHz Sandy Bridge socket @ “large” N


ADD peak

(best possible

code)

no SIMD

3-cycle latency

per ADD if not

unrolled

P (worst loop code)

𝑃 = min(𝑃max, 𝐼 ∙ 𝑏𝑆)

See first part

on how to get

these numbers

I = 1 flop / 4 byte (SP!)

141 GF/s

17.6 GF/s

5.9 GF/s

282 GF/s

Machine peak

(ADD+MULT)

Out of reach for this

code

P (better loop code)

21

Input to the roofline model

… on the example of do i=1,N; s=s+a(i); enddo

in single precision


analysis

Code analysis:

1 ADD + 1 LOAD

architectureThroughput: 1 ADD + 1 LD/cy

Pipeline depth: 3 cy (ADD)

8-way SIMD, 8 cores

measurement

Maximum memory

bandwidth 40 GB/s

Worst code: P = 5.9 GF/s (core bound)

Better code: P = 10 GF/s (memory bound)

5.9 … 141 GF/s

10 GF/s

22

Prerequisites for the Roofline Model


The roofline formalism is based on some (crucial) assumptions:

There is a clear concept of “work” vs. “traffic”

“work” = flops, updates, iterations…

“traffic” = required data to do “work”

Attainable bandwidth of code = input parameter! Determine effective

bandwidth via simple streaming benchmarks to model more complex

kernels and applications

Data transfer and core execution overlap perfectly!

Either the limit is core execution or it is data transfer

Slowest limiting factor “wins”; all others are assumed

to have no impact

Latency effects are ignored, i.e. perfect streaming mode

Be able to saturate bandwidth Full socket/node scenarios work best

23

Factors to consider in the roofline model

Bandwidth-bound (simple case)

Accurate traffic calculation (write-

allocate, strided access, …)

Practical ≠ theoretical BW limits

Erratic access patterns

Core-bound (may be complex)

Multiple bottlenecks: LD/ST,

arithmetic, pipelines, SIMD,

execution ports

Limit is linear in # of cores


24

Complexities of in-core execution

Multiple bottlenecks:

Decode/retirement

throughput

Port contention

(direct or indirect)

Arithmetic pipeline stalls

(dependencies)

Overall pipeline stalls

(branching)

L1 Dcache bandwidth

(LD/ST throughput)

Scalar vs. SIMD execution

L1 Icache (LD/ST) bandwidth

Alignment issues

…


26

Typical code optimizations in the Roofline Model

1. Hit the BW bottleneck by good

serial code(e.g., Ninja C++ Fortran)

2. Increase intensity to make

better use of BW bottleneck(e.g., loop blocking [see later])

3. Increase intensity and go from

memory-bound to core-bound(e.g., temporal blocking)

4. Hit the core bottleneck by good

serial code(e.g., -fno-alias [see later])

5. Shift Pmax by accessing

additional hardware features or

using a different

algorithm/implementation(e.g., scalar SIMD)


Perl

27

Shortcomings of the roofline model

Saturation effects in multicore chips are not explained

Reason: “saturation assumption”

Cache line transfers and core execution do sometimes not overlap perfectly

It is not sufficient to measure single-core STREAM to make it work

Only increased “pressure” on the memory

interface can saturate the bus

need more cores!

In-cache performance is not correctly

predicted

The ECM performance model gives more

insight:

A(:)=B(:)+C(:)*D(:)


G. Hager, J. Treibig, J. Habich, and G. Wellein: Exploring

performance and power properties of modern multicore chips

via simple machine models. Concurrency and Computation:

Practice and Experience (2013).

DOI: 10.1002/cpe.3180 Preprint: arXiv:1208.2908

http://dx.doi.org/10.1002/cpe.3180

http://arxiv.org/abs/1208.2908

28

Interlude: Hardware Performance Counters


29

Monitoring “live” jobs on a cluster in the Roofline diagram


Where are the “good”

and the “bad” jobs in

this diagram?

Roofline Case Studies

Dense matrix-vector multiplication (dMVM)

Sparse matrix-vector multiplication (spMVM)

Jacobi smoother

Case Study: dMVM

32

Dense matrix-vector multiplication in DP (AVX)

Assume C = R ≈ 10,000

Applicable peak performance?

Relevant data path?

Computational Intensity?

do c = 1 , C

do r = 1 , R

y(r)=y(r) + A(r,c)* x(c)

enddo

enddo


do c = 1 , C

tmp=x(c)

do r = 1 , R

y(r)=y(r) + A(r,c)* tmp

enddo

enddo

33

Vectorization strategy: 4-way inner loop unrolling

One register holds tmp in each of its 4 entries (“broadcast”)

Loop kernel requires/consume 3 AVX registers

DMVM (DP) – AVX vectorization


do c = 1,C

tmp=x(c)

do r = 1,R,4 ! R is multiple of 4

y(r) = y(r) + A(r,c) * tmp

y(r+1) = y(r+1) + A(r+1,c)* tmp

y(r+2) = y(r+2) + A(r+2,c)* tmp

y(r+3) = y(r+3) + A(r+3,c)* tmp

enddo

enddo

34

DMVM (DP) – Single core performance vs. column height


Intel Xeon E5 2695 v3 (Haswell-EP), 2.3 GHz, CoD mode, Core Ppeak=18.4 GF/s,

Caches: 32 KB / 256 KB / 35 MB, PageSize: 2 MB; ifort V15.0.1.133; bS = 32 Gbyte/s

Performance drops as number

of rows (inner loop length)

increases.

Does computational intensity

change?

35

DMVM data traffic analysis

Roofline Model

A(r,c)

do c = 1 , C

tmp=x(c)

do r = 1 , R


enddo

enddo

R

y(:) is loaded and stored in each outer

iteration for c>1 update y(:) in cache

A(:,:) is loaded from memory – no

data reuse

y(:) may not fit in innermost cache

more traffic from lower level caches for larger R

tmp stays in a register during inner loop

03/2017

= + *

C

Roofline analysis: Distinguish

code balance in memory

(𝐵𝐶𝑚𝑒𝑚) from code balance in

relevant cache level(s) (𝐵𝐶𝐿3,

𝐵𝐶𝐿2,…)!

36

DMVM (DP) – Single core data traffic analysis


size(y(1:R))

= 160 kB

size(y(1:R))

= 16 kB

likwid-perfctr

measurements

y Exceeding inner cache size:

(8+8) Byte for LD + ST on y

𝐵𝐶𝐿3 = 24B/2F

𝐵𝐶𝑚𝑒𝑚 = 24B/2F

𝐵𝐶𝐿2 = 24B/2F

𝐵𝐶𝐿3 = 8B/2F

37

Reducing traffic by blocking


A(r,c)

do c = 1 , C

tmp=x(c)

do r = 1 , R


enddo

enddo

do rb = 1 , R , RbrbS = rb

rbE = min((rb+Rb-1), R)

do c = 1 , C

do r = rbS , rbE

y(r)=y(r) + A(r,c)*x(c)

enddo

enddo

enddo

R

Rb

y(:) may

not fit into

some cache

more

traffic for

lower level

y(rbS:rbE)

may fit into

some cache ifRb is small

enough

traffic

reduction

38

Reducing traffic by blocking

LHS only updated once in some cache level if blocking is applied

Price: RHS is loaded multiple times instead of once!

How often? R / Rb times

Consequence: Traffic reduction of LHS & RHS by a factor of

R / (Rb x 2C)

Still a large reduction if block size can be made larger than about 10


Rb

= + *

39

DMVM (DP) – Reducing traffic by inner loop blocking

“1D blocking” for inner loop

Blocking factor Rb cache level

Fully reuse subset of y(rbS:rbE)

from L1/L2 cache


do rb = 1 , R , Rb

rbS = rb


do c = 1 , C

do r = rbS , rbE

y(r)=y(r) + A(r,c)*x(c)

enddo

enddo

enddo

L2 cache

blocking

L1 cache

blocking

40

DMVM (DP) – Validation of blocking optimization


Rb= 2000

41

DMVM (DP) – OpenMP parallelization


!$omp parallel do reduction(+:y)

do c = 1 , C

do r = 1 , R

y(r) = y(r) + A(r,c) * x(c)

enddo ; enddo

!$omp end parallel do

!$omp parallel do private(rbS,rbE)

do rb = 1 , R , RbrbS = rb


do c = 1 , C

do r = rbS , rbE

y(r) = y(r) + A(r,c) * x(c)

enddo ; enddo ; enddo

!$omp end parallel do

plain code

blocked code

42

DMVM (DP) – OpenMP parallelization & saturation


blocking good for

single thread

performance (reduced

in-cache traffic)

memory traffic

unchanged

saturation

unchanged!

saturation influenced

by serial performance

Intel Xeon E5 2695 v3 (Haswell-EP)

2.3 GHz base clock speed, bS = 32 GB/s

Roofline limit

BC = 4 Byte/Flop

bS = 32GB/s

So, is blocking

useless?

NO (see later)

Can we do

anything to

improve 𝐵𝐶𝑚𝑒𝑚?

NO, not here

43

DMVM (DP) – saturation & clock speed


Clock speed has high

impact on single

thread performance

Even the worst code

almost saturates the

bandwidth @1.3

GHz

Intel Xeon E5 2695 v3 (Haswell-EP)

2.3 GHz base clock speed, bS = 32 GB/s

Clock speed or

parallelism can

“heal” slow code!

saturation influenced

by clock speed and

serial performance

44

Conclusions from the dMVM example

We have found the reasons for the breakdown of single-core

performance with growing number of matrix rows

LHS vector fitting in different levels of the cache hierarchy

Validated theory by performance counter measurements

Inner loop blocking was employed to improve code balance in L3

and/or L2

Validated by performance counter measurements

Blocking led to better single-threaded performance

Saturated performance unchanged (as predicted by Roofline)

Because the problem is still small enough to fit the LHS at least into the L3

cache


Case study: A Jacobi smoother

The basics in two dimensions

Layer conditions

Optimization by spatial blocking

46

Stencil schemes

Stencil schemes frequently occur in PDE solvers on regular lattice

structures

Basically it is a sparse matrix vector multiply (spMVM) embedded

in an iterative scheme (outer loop)

but the regular access structure allows for matrix free coding

Complexity of implementation and performance depends on

stencil operator, e.g. Jacobi-type, Gauss-Seidel-type, …

spatial extent, e.g. 7-pt or 25-pt in 3D,…


do iter = 1, max_iterations

Perform sweep over regular grid: y(:) x(:)

Swap y x

enddo

y x

47

Jacobi-type 5-pt stencil in 2D


do k=1,kmax

do j=1,jmax

y(j,k) = const * ( x(j-1,k) + x(j+1,k) &

+ x(j,k-1) + x(j,k+1) )

enddo

enddo

j

k

sweep

Lattice

Update

(LUP)

y(0:jmax+1,0:kmax+1) x(0:jmax+1,0:kmax+1)

Appropriate performance metric: “Lattice Updates per second” [LUP/s](here: Multiply by 4 FLOP/LUP to get FLOP/s rate)

48

Jacobi 5-pt stencil in 2D: data transfer analysis


do k=1,kmax

do j=1,jmax

y(j,k) = const * ( x(j-1,k) + x(j+1,k) &

+ x(j,k-1) + x(j,k+1) )

enddo

enddo

SWEEP

LD+ST y(j,k)

(incl. write

allocate)LD x(j+1,k)

Available in cache

(used 2 updates before)

LD x(j,k+1)LD x(j,k-1)Naive balance (incl. write allocate):

x( :, :) : 3 LD +

y( :, :) : 1 ST+ 1LD

BC = 5 Words / LUP = 40 B / LUP (assuming double precision)

50

Jacobi 5-pt stencil in 2D: Single core performance


jmax=kmax jmax*kmax = const

L3

Ca

ch

e

Intel Xeon E5-2690 v2

(“IvyBridge”@3 GHz)

~24 B / LUP ~40 B / LUP

Code balance (BC)

measured with likwid-perfctr

Intel Compiler

ifort V13.1

jmax

Questions:

1. How to achieve

24 B/LUP also for large jmax?

2. How to sustain

>600 MLUP/s for jmax > 104 ?



Layer conditions

Optimization by spatial blocking

52

Analyzing the data flow


cachedWorst case: Cache not large enough to hold 3 layers (rows) of grid

(assume “Least Recently Used” replacement strategy)

j

k

x(0:jmax+1,0:kmax+1)

Halo

cell

sH

alo

cell

s

53



j

k

Worst case: Cache not large enough to hold 3 layers (rows) of grid

(+assume „Least Recently Used“ replacement strategy)

x(0:jmax+1,0:kmax+1)

54



Reduce inner (j-)

loop dimension

successively

Best case: 3

„layers“ of grid fit

into the cache!

j

k

x(0:jmax2+1,0:kmax+1)

x(0:jmax1+1,0:kmax+1)

55

Analyzing the data flow: Layer condition

2D 5-pt Jacobi-type stencil


do k=1,kmax

do j=1,jmax

y(j,k) = const * (x(j-1,k) + x(j+1,k) &

+ x(j,k-1) + x(j,k+1) )

enddo

enddo 3 * jmax * 8B < CacheSize/2

“Layer condition”

double

precision

3 rows of jmax Safety margin

(Rule of thumb)

Layer condition:• Does not depend on outer loop length (kmax)

• No strict guideline (cache associativity – data traffic for y not included)

• Needs to be adapted for other stencils (e.g., 3D 7-pt stencil)

56

Analyzing the data flow: Layer condition (2D 5-pt Jacobi)


3 * jmax * 8B < CacheSize/2

Layer condition fulfilled?

y: (1 LD + 1 ST) / LUP x: 1 LD / LUP

BC = 24 B / LUP

do k=1,kmax

do j=1,jmax

y(j,k) = const * (x(j-1,k) + x(j+1,k) &

+ x(j,k-1) + x(j,k+1) )

enddo

enddo

YES

do k=1,kmax

do j=1,jmax

y(j,k) = const * (x(j-1,k) + x(j+1,k) &

+ x(j,k-1) + x(j,k+1) )

enddo

enddo BC = 40 B / LUP

y: (1 LD + 1 ST) / LUP

NO

x: 3 LD / LUP

57

Analyzing the data flow: Layer condition (2D 5-pt Jacobi)

Establish layer condition for all domain sizes

Idea: Spatial blocking

Reuse elements of x() as long as they stay in cache

Sweep can be executed in any order, e.g. compute blocks in j-direction

“Spatial Blocking” of j-loop:

Determine for given CacheSize an appropriate jblock value:


do jb=1,jmax,jblock ! Assume jmax is multiple of jblock

do k=1,kmax

do j= jb, (jb+jblock-1) ! Length of inner loop: jblock

y(j,k) = const * (x(j-1,k) + x(j+1,k) &

+ x(j,k-1) + x(j,k+1) )

enddo

enddo

enddo New layer condition (blocking)3 * jblock * 8B < CacheSize/2

jblock < CacheSize / 48 B

58

Establish the layer condition by blocking


Split up

domain into

subblocks:

e.g. block

size = 5

59

Establish the layer condition by blocking


Additional data

transfers (overhead)

at block boundaries!

61

Establish layer condition by spatial blocking


jmax=kmax jmax*kmax = const

L3

Ca

ch

e

L1: 32 KB

L2: 256 KB

L3: 25 MBjmax

Which cache to block for?



Intel Compiler

ifort V13.1

jblock < CacheSize / 48 B

L2: CS=256 KBjblock=min(jmax,5333) L3: CS=25 MB

jblock=min(jmax,533333)

62

Validating the model: Memory code balance


jmax

Measured main memory

code balance (BC)

24 B / LUP

40 B / LUP



Intel Compiler

ifort V13.1

Blocking factor

(CS=25 MB) still a

little too large

Main memory access is not

reason for different performance

(but L3 access is!)

jmax

63

Validating the model: L3 cache balance


Measured L3 cache

code balance (BC)

40 B / LUP

24 B / LUP

jmax



Intel Compiler

ifort V13.1

Main memory (via L3)L1 cache

L2 or L3 cache

Data accesses to L3 cache (blocking)

Impact of total L3 traffic:

24 B/LUP vs. 40 B/LUP

jmax

67

From 2D to 3D

2D

Towards 3D understanding

Picture can be considered as 2D cut of 3D domain for (new) fixed i-coordinate:

x(0:jmax+1,0:kmax+1) x(i, 0:jmax+1,0:kmax+1)


x(0:jmax+1,0:kmax+1)j

k

68

From 2D to 3D

x(0:imax+1, 0:jmax+1,0:kmax+1) – Assume i-direction

contiguous in main memory (Fortran notation)

Stay at 2D picture and consider one cell of j-k plane as a

contiguous slab of elements in i-direction: x(0:imax,j,k)


j

k

i

j, k

69

Layer condition: From 2D 5-pt to 3D 7-pt Jacobi-type stencil


3 * jmax * 8B < CacheSize/2

BC = 24 B / LUP

do k=1,kmax

do j=1,jmax

y(j,k) = const * (x(j-1,k) + x(j+1,k) &

+ x(j,k-1) + x(j,k+1) )

enddo

enddo

do k=1,kmax

do j=1,jmax

do i=1,imax

y(i,j,k) = const * (x(i-1,j,k) + x(i+1,j,k)

+ x(i,j-1,k) + x(i,j+1,k) &

+ x(i,j,k-1) + x(i,j,k+1) )

enddo

enddo

enddo

3 * jmax *imax * 8B < CacheSize/2 BC = 24 B / LUP

2D

3D

70

3D 7-pt Jacobi-type Stencil (sequential)

do k=1,kmax

do j=1,jmax

do i=1,imax

y(i,j,k) =const.*(x(i-1,j,k) +x(i+1,j,k) &

+ x(i,j-1,k) +x(i,j+1,k) &

+ x(i,j,k-1) +x(i,j,k+1) )

enddo

enddo

enddo

“Layer condition” 3*jmax*imax*8B < CS/2 “Layer condition” OK

5 accesses to x() served by cache

Question:

Does parallelization/multi-threading change the layer condition?


73

Jacobi Stencil – OpenMP parallelization (3D)

!$OMP PARALLEL DO SCHEDULE(STATIC)

do k=1,kmax

do j=1,jmax

do i=1,imax

y(i,j,k) = 1/6. *(x(i-1,j,k) +x(i+1,j,k) &

+ x(i,j-1,k) +x(i,j+1,k)

+ x(i,j,k-1) +x(i,j,k+1) )

enddo

enddo

enddo

Layer condition (cubic domain; shared CacheSize=25 MB)

1 thread: imax=jmax < 720 10 threads: imax=jmax < 230

Basic guideline:

Parallelize outermost loop

Equally large chunks in k-direction

“Layer condition” for each thread

(c) RRZE 2017 Roofline Model case studies

Layer condition for a shared cache: nthreads * 3 * imax*jmax * 8B < CS/2

Layer condition for private caches: 3 * imax*jmax * 8B < CS/2

74

Jacobi Stencil – OpenMP parallelization (II)


do k=1,kmax

do j=1,jmax

do i=1,imax

y(i,j,k) = 1/6. *(x(i-1,j,k) +x(i+1,j,k) &

+ x(i,j-1,k) +x(i,j+1,k) &

+ x(i,j,k-1) +x(i,j,k+1) )

enddo

enddo

enddo

Intel® Xeon® Processor E5-2690 v2

10 cores@3 GHz

CS = 25 MB (L3)

bS = 48 GB/s

Roofline model:

P = min(Pmax, bS / (24 B/LUP))

“Layer condition (shared cache)”: nthreads *3*jmax*imax*8B < CS/2

BC = 24 B / LUP


“Light speed” performance:

P = 2000 MLUP/s

What is 𝑃max here? homework!

75

Socket scaling – validate Roofline model




Intel Compiler

ifort V13.1

OpenMP Parallel

bS = 48 GB/s

BC= 24 B/LUP

BC= 40 B/LUP

𝑃 = min(𝑃max, 𝑏𝑆 𝐵𝐶) What is 𝑃max here? homework!



Layer conditions

Validating the model in 3D

Optimization by spatial blocking in 3D

77

Jacobi Stencil – OpenMP parallelization (I)

Validation: Measured data traffic

from main memory [Bytes/LUP]1 thread: Layer condition OK –

but can not saturate bandwidth

10 threads: performance drops around imax=230


Layer condition

violated 40 B/LUP

78

Jacobi Stencil – OpenMP parallelization (I)

If original “Layer condition” does not hold: nthreads*3*jmax*imax*8B > CS/2

maxMLUPs = MemBW / (40 B/LUP)


do k=1,kmax

do j=1,jmax

do i=1,imax

y(i,j,k) = 1/6. *(x(i-1,j,k) +x(i+1,j,k) &

+ x(i,j-1,k) +x(i,j+1,k)

+ x(i,j,k-1) +x(i,j,k+1) )

enddo

enddo

enddo

But assume: nthreads*jmax*imax*8B < CS/2

(8+8) B/LUP for y() (ST+WA)

+ 8 B/LUP for x(i,j,k+1)

+ 8 B/LUP for x(i,j+1,k)

+ 8 B/LUP for x(i,j,k-1)

40 B/LUP

“Light speed” performance:

P = 1200 MLUP/s




Layer conditions

Validating the model in 3D

Spatial blocking in 3D

80

Jacobi Stencil – simple spatial blocking

do jb=1,jmax,jblock ! Assume jmax is multiple of jblock


do k=1,kmax

do j=jb, (jb+jblock-1) ! Loop length jblock

do i=1,imax

y(i,j,k) = 1/6. *(x(i-1,j,k) +x(i+1,j,k) &

+ x(i,j-1,k) +x(i,j+1,k)

+ x(i,j,k-1) +x(i,j,k+1))

enddo

enddo

enddo

enddo

“Layer condition” (j-Blocking)nthreads*3*jblock*imax*8B < CS/2

Test system: Intel® Xeon® Processor E5-2690 v2 (10 cores / 3 GHz), as before

Ensure layer condition by choosing jblock appropriately (cubic domain):

jblock < CS/(imax* nthreads* 48B )


81

Jacobi Stencil – simple spatial blocking

Determine: jblock < CS/(2*nthreads*3*imax*8B)

imax = jmax = kmax

#blocks

changes

CS=10 MB:

~ 90+ % roofline limit

Validation: Measured data traffic

from main memory [Bytes/LUP]


82

Why not block the inner loop level?

24B/update

performance

model

inner (i) loop

blocking

middle (j)

loop blocking

optimum j

block size


89

Conclusions from the Jacobi example

We have made sense of the memory-bound performance vs.

problem size

“Layer conditions” lead to predictions of code balance

“What part of the data comes from where” is a crucial question

The model works only if the bandwidth is “saturated”

In-cache modeling is more involved

Avoiding slow data paths == re-establishing the most favorable

layer condition

Improved code showed the speedup predicted by the model

Optimal blocking factor can be estimated

Be guided by the cache size the layer condition

No need for exhaustive scan of “optimization space”

Food for thought

Multi-dimensional loop blocking – would it make sense?

Can we choose a “better” OpenMP loop schedule?

What would change if we parallelized inner loops?


101

Roofline diagram – exercise

Draw the roofline diagram

assuming double precision

for one node of the emmy cluster – the single processor chip data can be

found on slide 79 of the first presentation (“Intel Xeon E5 2690 V2”)

Add these roofs:

Code only performs single precision computations

Code only performs scalar instructions – no vectorization or SIMD

instructions

OpenMP code which ignores the golden rule of ccNUMA


“Simple” performance modeling: The Roofline Model

Documents

Transcript of “Simple” performance modeling: The Roofline Model