“Simple” performance modeling: The Roofline Model
Transcript of “Simple” performance modeling: The Roofline Model
“Simple” performance modeling:
The Roofline Model
Loop-based performance modeling: Execution vs. data transfer
2
Prelude: Modeling customer dispatch in a bank
03/2017 Roofline Model
Revolving door
throughput:
bS [customers/sec]
Intensity:
I [tasks/customer]
Processing
capability:
Pmax [tasks/sec]
3
Prelude: Modeling customer dispatch in a bank
How fast can tasks be processed? 𝑷 [tasks/sec]
The bottleneck is either
The service desks (max. tasks/sec): 𝑃max The revolving door (max. customers/sec): 𝐼 ∙ 𝑏𝑆
This is the “Roofline Model”
High intensity: P limited by “execution”
Low intensity: P limited by “bottleneck”
“Knee” at 𝑃𝑚𝑎𝑥 = 𝐼 ∙ 𝑏𝑆:
Best use of resources
Roofline is an “optimistic” model
(“light speed”)
03/2017 Roofline Model
𝑃 = min(𝑃max, 𝐼 ∙ 𝑏𝑆)
Intensity
Perf
orm
ance
Pmax
4
Apply this example to performance of compute devices
Maximum processing capability Peak performance: 𝑃𝑝𝑒𝑎𝑘 [𝐹
𝑠]
Rate of revolving door Memory bandwidth: 𝑏𝑆 [𝐵
𝑠]
Workload per customer Computational Intensity: 𝐼 [𝐹
𝐵]
The Roofline Model – Basics
03/2017 Roofline Model
Machine model:
𝑃𝑝𝑒𝑎𝑘 = 4𝐺𝐹
𝑠
𝑏𝑆 = 10𝐺𝐵
𝑠
Application model: I
double s=0, a[];
for(i=0; i<N; ++i) {
s = s + a[i] * a[i];}
𝐼 =2 𝐹
8 𝐵= 0.25 𝐹 𝐵
5
The Roofline Model – Basics
Compare capabilities of different machines
RLM always provides upper bound – but is it realistic?
If code is not able to reach this limit (e.g. contains add operations
only) machine parameter need to redefined (.e.g. 𝑃𝑝𝑒𝑎𝑘 𝑃𝑝𝑒𝑎𝑘/𝟐)
03/2017 Roofline Model
Assuming double
precision –
for single precision:
𝑃𝑝𝑒𝑎𝑘 2 ∗ 𝑃𝑝𝑒𝑎𝑘
6
The Roofline Model – refined
03/2017 Roofline Model
D. Callahan et al.: Estimating interlock and improving balance for pipelined architectures. Journal for Parallel and Distributed Computing 5(4),
334 (1988). DOI: 10.1016/0743-7315(88)90002-0
W. Schönauer: Scientific Supercomputing: Architecture and Use of Shared and Distributed Memory Parallel Computers. Self-edition (2000)
S. Williams: Auto-tuning Performance on Multicore Computers. UCB Technical Report No. UCB/EECS-2008-164. PhD thesis (2008)
1. Pmax = Applicable peak performance of a loop, assuming that data
comes from the level 1 cache (this is not necessarily Ppeak)
e.g., Pmax = 176 GFlop/s
2. I = Computational intensity (“work” per byte transferred) over the
slowest data path utilized (code balance BC = I -1)
e.g., I = 0.167 Flop/Byte BC = 6 Byte/Flop
3. bS = Applicable (saturated) peak bandwidth of the slowest data path
utilized
e.g., bS = 56 GByte/s
Expected performance:
𝑃 = min 𝑃max, 𝐼 ∙ 𝑏𝑆 = min 𝑃max,𝑏𝑆𝐵𝐶
[Byte/s]
[Byte/Flop]
9
Preliminary: Estimating Pmax
Remember the Haswell port scheduler model?
Port 0 Port 1 Port 5Port 2 Port 3 Port 4 Port 6 Port 7
ALU ALU ALU
FMA FMA FSHUF
JUMP
LOAD LOAD
AGU AGU
STORE
Retire 4 uops
32b 32b 32b
AGU
Haswell
FMUL
ALU
JUMP
03/2017 Roofline Model
11
Example: Estimate Pmax of vector triad on Haswell
double *A, *B, *C, *D;
for (int i=0; i<N; i++) {
A[i] = B[i] + C[i] * D[i];
}
Minimum number of cycles to process one AVX-vectorized iteration
(one core)?
Equivalent to 4 scalar iterations
Cycle 1: LOAD + LOAD + STORE
Cycle 2: LOAD + LOAD + FMA + FMA
Cycle 3: LOAD + LOAD + STORE Answer: 1.5 cycles
03/2017 Roofline Model
13
Example: Estimate Pmax of vector triad on Haswell (2.3 GHz)
double *A, *B, *C, *D;
for (int i=0; i<N; i++) {
A[i] = B[i] + C[i] * D[i];
}
What is the performance in GFlops/s per core and the bandwidth in
GBytes/s?
One AVX iteration (1.5 cycles) does 4 x 2 = 8 flops:
2.3 ∙ 109 cy/s
1.5 cy∙ 4 updates ∙
2 flops
update= 𝟏𝟐. 𝟐𝟕
Gflops
s
6.13 ∙ 109updates
s∙ 32
bytes
update= 196
Gbyte
s
03/2017 Roofline Model
14
Pmax + bandwidth limitations: The vector triad
Example: Vector triad A(:)=B(:)+C(:)*D(:)
on a 2.3 GHz 14-core Haswell chip (AVX vectorized)
Peak memory bandwidth: bS = 50 GB/s
Memory code balance:
Bc = (4+1) Words / 2 Flops = 20 B/F (including write allocate)
I = 0.05 F/B
I ∙ bS = 2.5 GF/s (0.5% of peak performance)
Ppeak = 515.2 Gflop/s (14 cores x (8+8) Flops/cy x 2.3 GHz)
Pmax = 14 x 12.27 Gflop/s =172 Gflop/s (33% peak)
𝑃 = min 𝑃max, 𝐼 ∙ 𝑏𝑆 = min 172,2.5 GFlop s= 2.5 GFlop s
03/2017 Roofline Model
15
Code balance: more examples
double a[], b[];
for(i=0; i<N; ++i) {
a[i] = a[i] + b[i];}
BC = 24B / 1F = 24 B/F
I = 0.042 F/B
double a[], b[];
for(i=0; i<N; ++i) {
a[i] = a[i]+ s * b[i];}
BC = 24B / 2F = 12 B/F
I = 0.083 F/B
Scalar – can be kept in register
float s=0, a[];
for(i=0; i<N; ++i) {
s = s + a[i] * a[i];}
BC = 4B/2F = 2 B/F
I = 0.5 F/B
Scalar – can be kept in register
float s=0, a[], b[];
for(i=0; i<N; ++i) {
s = s + a[i] * b[i];}
BC = 8B / 2F = 4 B/F
I = 0.25 F/B
Scalar – can be kept in register
03/2017 Roofline Model
16
Machine balance for hardware characterization
For quick comparisons the concept of machine balance is useful
Machine Balance = How much input data can be delivered for each FP
operation? (“Memory Gap characterization”)
Assuming balanced MULT/ADD
Rough estimate: 𝐵𝑚 ≪ 𝐵𝑐 strongly memory-bound code
Typical values (main memory):
Intel Haswell 14-core 2.3 GHz
Bm = 60 GB/s / (14 x 2.3 x 16) GF/s ≈ 0.12 B/F
Intel Sandy Bridge 8-core 2.7 GHz ≈ 0.23 B/F
Nvidia K20x ≈ 0.14 B/F
Intel Xeon Phi Knights Corner ≈ 0.16 B/F
03/2017 Roofline Model
𝐵𝑚 =𝑏𝑆
𝑃peak
17
A not so simple Roofline example
Example: do i=1,N; s=s+a(i); enddo
in single precision on a 2.2 GHz Sandy Bridge socket @ “large” N
03/2017 Roofline Model
ADD peak
(best possible
code)
no SIMD
3-cycle latency
per ADD if not
unrolled
P (worst loop code)
𝑃 = min(𝑃max, 𝐼 ∙ 𝑏𝑆)
See first part
on how to get
these numbers
I = 1 flop / 4 byte (SP!)
141 GF/s
17.6 GF/s
5.9 GF/s
282 GF/s
Machine peak
(ADD+MULT)
Out of reach for this
code
P (better loop code)
21
Input to the roofline model
… on the example of do i=1,N; s=s+a(i); enddo
in single precision
03/2017 Roofline Model
analysis
Code analysis:
1 ADD + 1 LOAD
architectureThroughput: 1 ADD + 1 LD/cy
Pipeline depth: 3 cy (ADD)
8-way SIMD, 8 cores
measurement
Maximum memory
bandwidth 40 GB/s
Worst code: P = 5.9 GF/s (core bound)
Better code: P = 10 GF/s (memory bound)
5.9 … 141 GF/s
10 GF/s
22
Prerequisites for the Roofline Model
03/2017 Roofline Model
The roofline formalism is based on some (crucial) assumptions:
There is a clear concept of “work” vs. “traffic”
“work” = flops, updates, iterations…
“traffic” = required data to do “work”
Attainable bandwidth of code = input parameter! Determine effective
bandwidth via simple streaming benchmarks to model more complex
kernels and applications
Data transfer and core execution overlap perfectly!
Either the limit is core execution or it is data transfer
Slowest limiting factor “wins”; all others are assumed
to have no impact
Latency effects are ignored, i.e. perfect streaming mode
Be able to saturate bandwidth Full socket/node scenarios work best
23
Factors to consider in the roofline model
Bandwidth-bound (simple case)
Accurate traffic calculation (write-
allocate, strided access, …)
Practical ≠ theoretical BW limits
Erratic access patterns
Core-bound (may be complex)
Multiple bottlenecks: LD/ST,
arithmetic, pipelines, SIMD,
execution ports
Limit is linear in # of cores
03/2017 Roofline Model
24
Complexities of in-core execution
Multiple bottlenecks:
Decode/retirement
throughput
Port contention
(direct or indirect)
Arithmetic pipeline stalls
(dependencies)
Overall pipeline stalls
(branching)
L1 Dcache bandwidth
(LD/ST throughput)
Scalar vs. SIMD execution
L1 Icache (LD/ST) bandwidth
Alignment issues
…
03/2017 Roofline Model
26
Typical code optimizations in the Roofline Model
1. Hit the BW bottleneck by good
serial code(e.g., Ninja C++ Fortran)
2. Increase intensity to make
better use of BW bottleneck(e.g., loop blocking [see later])
3. Increase intensity and go from
memory-bound to core-bound(e.g., temporal blocking)
4. Hit the core bottleneck by good
serial code(e.g., -fno-alias [see later])
5. Shift Pmax by accessing
additional hardware features or
using a different
algorithm/implementation(e.g., scalar SIMD)
03/2017 Roofline Model
Perl
27
Shortcomings of the roofline model
Saturation effects in multicore chips are not explained
Reason: “saturation assumption”
Cache line transfers and core execution do sometimes not overlap perfectly
It is not sufficient to measure single-core STREAM to make it work
Only increased “pressure” on the memory
interface can saturate the bus
need more cores!
In-cache performance is not correctly
predicted
The ECM performance model gives more
insight:
A(:)=B(:)+C(:)*D(:)
03/2017 Roofline Model
G. Hager, J. Treibig, J. Habich, and G. Wellein: Exploring
performance and power properties of modern multicore chips
via simple machine models. Concurrency and Computation:
Practice and Experience (2013).
DOI: 10.1002/cpe.3180 Preprint: arXiv:1208.2908
28
Interlude: Hardware Performance Counters
03/2017 Roofline Model
29
Monitoring “live” jobs on a cluster in the Roofline diagram
03/2017 Roofline Model
Where are the “good”
and the “bad” jobs in
this diagram?
Roofline Case Studies
Dense matrix-vector multiplication (dMVM)
Sparse matrix-vector multiplication (spMVM)
Jacobi smoother
Case Study: dMVM
32
Dense matrix-vector multiplication in DP (AVX)
Assume C = R ≈ 10,000
Applicable peak performance?
Relevant data path?
Computational Intensity?
do c = 1 , C
do r = 1 , R
y(r)=y(r) + A(r,c)* x(c)
enddo
enddo
03/2017 Roofline Model
do c = 1 , C
tmp=x(c)
do r = 1 , R
y(r)=y(r) + A(r,c)* tmp
enddo
enddo
33
Vectorization strategy: 4-way inner loop unrolling
One register holds tmp in each of its 4 entries (“broadcast”)
Loop kernel requires/consume 3 AVX registers
DMVM (DP) – AVX vectorization
03/2017 Roofline Model
do c = 1,C
tmp=x(c)
do r = 1,R,4 ! R is multiple of 4
y(r) = y(r) + A(r,c) * tmp
y(r+1) = y(r+1) + A(r+1,c)* tmp
y(r+2) = y(r+2) + A(r+2,c)* tmp
y(r+3) = y(r+3) + A(r+3,c)* tmp
enddo
enddo
34
DMVM (DP) – Single core performance vs. column height
03/2017 Roofline Model
Intel Xeon E5 2695 v3 (Haswell-EP), 2.3 GHz, CoD mode, Core Ppeak=18.4 GF/s,
Caches: 32 KB / 256 KB / 35 MB, PageSize: 2 MB; ifort V15.0.1.133; bS = 32 Gbyte/s
Performance drops as number
of rows (inner loop length)
increases.
Does computational intensity
change?
35
DMVM data traffic analysis
Roofline Model
A(r,c)
do c = 1 , C
tmp=x(c)
do r = 1 , R
y(r)=y(r) + A(r,c)* tmp
enddo
enddo
R
y(:) is loaded and stored in each outer
iteration for c>1 update y(:) in cache
A(:,:) is loaded from memory – no
data reuse
y(:) may not fit in innermost cache
more traffic from lower level caches for larger R
tmp stays in a register during inner loop
03/2017
= + *
C
Roofline analysis: Distinguish
code balance in memory
(𝐵𝐶𝑚𝑒𝑚) from code balance in
relevant cache level(s) (𝐵𝐶𝐿3,
𝐵𝐶𝐿2,…)!
36
DMVM (DP) – Single core data traffic analysis
03/2017 Roofline Model
size(y(1:R))
= 160 kB
size(y(1:R))
= 16 kB
likwid-perfctr
measurements
y Exceeding inner cache size:
(8+8) Byte for LD + ST on y
𝐵𝐶𝐿3 = 24B/2F
𝐵𝐶𝑚𝑒𝑚 = 24B/2F
𝐵𝐶𝐿2 = 24B/2F
𝐵𝐶𝐿3 = 8B/2F
37
Reducing traffic by blocking
03/2017 Roofline Model
A(r,c)
do c = 1 , C
tmp=x(c)
do r = 1 , R
y(r)=y(r) + A(r,c)* tmp
enddo
enddo
do rb = 1 , R , RbrbS = rb
rbE = min((rb+Rb-1), R)
do c = 1 , C
do r = rbS , rbE
y(r)=y(r) + A(r,c)*x(c)
enddo
enddo
enddo
R
Rb
y(:) may
not fit into
some cache
more
traffic for
lower level
y(rbS:rbE)
may fit into
some cache ifRb is small
enough
traffic
reduction
38
Reducing traffic by blocking
LHS only updated once in some cache level if blocking is applied
Price: RHS is loaded multiple times instead of once!
How often? R / Rb times
Consequence: Traffic reduction of LHS & RHS by a factor of
R / (Rb x 2C)
Still a large reduction if block size can be made larger than about 10
03/2017 Roofline Model
Rb
= + *
39
DMVM (DP) – Reducing traffic by inner loop blocking
“1D blocking” for inner loop
Blocking factor Rb cache level
Fully reuse subset of y(rbS:rbE)
from L1/L2 cache
03/2017 Roofline Model
do rb = 1 , R , Rb
rbS = rb
rbE = min((rb+Rb-1), R)
do c = 1 , C
do r = rbS , rbE
y(r)=y(r) + A(r,c)*x(c)
enddo
enddo
enddo
L2 cache
blocking
L1 cache
blocking
40
DMVM (DP) – Validation of blocking optimization
03/2017 Roofline Model
Rb= 2000
41
DMVM (DP) – OpenMP parallelization
03/2017 Roofline Model
!$omp parallel do reduction(+:y)
do c = 1 , C
do r = 1 , R
y(r) = y(r) + A(r,c) * x(c)
enddo ; enddo
!$omp end parallel do
!$omp parallel do private(rbS,rbE)
do rb = 1 , R , RbrbS = rb
rbE = min((rb+Rb-1), R)
do c = 1 , C
do r = rbS , rbE
y(r) = y(r) + A(r,c) * x(c)
enddo ; enddo ; enddo
!$omp end parallel do
plain code
blocked code
42
DMVM (DP) – OpenMP parallelization & saturation
03/2017 Roofline Model
blocking good for
single thread
performance (reduced
in-cache traffic)
memory traffic
unchanged
saturation
unchanged!
saturation influenced
by serial performance
Intel Xeon E5 2695 v3 (Haswell-EP)
2.3 GHz base clock speed, bS = 32 GB/s
Roofline limit
BC = 4 Byte/Flop
bS = 32GB/s
So, is blocking
useless?
NO (see later)
Can we do
anything to
improve 𝐵𝐶𝑚𝑒𝑚?
NO, not here
43
DMVM (DP) – saturation & clock speed
03/2017 Roofline Model
Clock speed has high
impact on single
thread performance
Even the worst code
almost saturates the
bandwidth @1.3
GHz
Intel Xeon E5 2695 v3 (Haswell-EP)
2.3 GHz base clock speed, bS = 32 GB/s
Clock speed or
parallelism can
“heal” slow code!
saturation influenced
by clock speed and
serial performance
44
Conclusions from the dMVM example
We have found the reasons for the breakdown of single-core
performance with growing number of matrix rows
LHS vector fitting in different levels of the cache hierarchy
Validated theory by performance counter measurements
Inner loop blocking was employed to improve code balance in L3
and/or L2
Validated by performance counter measurements
Blocking led to better single-threaded performance
Saturated performance unchanged (as predicted by Roofline)
Because the problem is still small enough to fit the LHS at least into the L3
cache
03/2017 Roofline Model
Case study: A Jacobi smoother
The basics in two dimensions
Layer conditions
Optimization by spatial blocking
46
Stencil schemes
Stencil schemes frequently occur in PDE solvers on regular lattice
structures
Basically it is a sparse matrix vector multiply (spMVM) embedded
in an iterative scheme (outer loop)
but the regular access structure allows for matrix free coding
Complexity of implementation and performance depends on
stencil operator, e.g. Jacobi-type, Gauss-Seidel-type, …
spatial extent, e.g. 7-pt or 25-pt in 3D,…
03/2017 Roofline Model
do iter = 1, max_iterations
Perform sweep over regular grid: y(:) x(:)
Swap y x
enddo
y x
47
Jacobi-type 5-pt stencil in 2D
03/2017 Roofline Model
do k=1,kmax
do j=1,jmax
y(j,k) = const * ( x(j-1,k) + x(j+1,k) &
+ x(j,k-1) + x(j,k+1) )
enddo
enddo
j
k
sweep
Lattice
Update
(LUP)
y(0:jmax+1,0:kmax+1) x(0:jmax+1,0:kmax+1)
Appropriate performance metric: “Lattice Updates per second” [LUP/s](here: Multiply by 4 FLOP/LUP to get FLOP/s rate)
48
Jacobi 5-pt stencil in 2D: data transfer analysis
03/2017 Roofline Model
do k=1,kmax
do j=1,jmax
y(j,k) = const * ( x(j-1,k) + x(j+1,k) &
+ x(j,k-1) + x(j,k+1) )
enddo
enddo
SWEEP
LD+ST y(j,k)
(incl. write
allocate)LD x(j+1,k)
Available in cache
(used 2 updates before)
LD x(j,k+1)LD x(j,k-1)Naive balance (incl. write allocate):
x( :, :) : 3 LD +
y( :, :) : 1 ST+ 1LD
BC = 5 Words / LUP = 40 B / LUP (assuming double precision)
50
Jacobi 5-pt stencil in 2D: Single core performance
03/2017 Roofline Model
jmax=kmax jmax*kmax = const
L3
Ca
ch
e
Intel Xeon E5-2690 v2
(“IvyBridge”@3 GHz)
~24 B / LUP ~40 B / LUP
Code balance (BC)
measured with likwid-perfctr
Intel Compiler
ifort V13.1
jmax
Questions:
1. How to achieve
24 B/LUP also for large jmax?
2. How to sustain
>600 MLUP/s for jmax > 104 ?
Case study: A Jacobi smoother
The basics in two dimensions
Layer conditions
Optimization by spatial blocking
52
Analyzing the data flow
03/2017 Roofline Model
cachedWorst case: Cache not large enough to hold 3 layers (rows) of grid
(assume “Least Recently Used” replacement strategy)
j
k
x(0:jmax+1,0:kmax+1)
Halo
cell
sH
alo
cell
s
53
Analyzing the data flow
03/2017 Roofline Model
j
k
Worst case: Cache not large enough to hold 3 layers (rows) of grid
(+assume „Least Recently Used“ replacement strategy)
x(0:jmax+1,0:kmax+1)
54
Analyzing the data flow
03/2017 Roofline Model
Reduce inner (j-)
loop dimension
successively
Best case: 3
„layers“ of grid fit
into the cache!
j
k
x(0:jmax2+1,0:kmax+1)
x(0:jmax1+1,0:kmax+1)
55
Analyzing the data flow: Layer condition
2D 5-pt Jacobi-type stencil
03/2017 Roofline Model
do k=1,kmax
do j=1,jmax
y(j,k) = const * (x(j-1,k) + x(j+1,k) &
+ x(j,k-1) + x(j,k+1) )
enddo
enddo 3 * jmax * 8B < CacheSize/2
“Layer condition”
double
precision
3 rows of jmax Safety margin
(Rule of thumb)
Layer condition:• Does not depend on outer loop length (kmax)
• No strict guideline (cache associativity – data traffic for y not included)
• Needs to be adapted for other stencils (e.g., 3D 7-pt stencil)
56
Analyzing the data flow: Layer condition (2D 5-pt Jacobi)
03/2017 Roofline Model
3 * jmax * 8B < CacheSize/2
Layer condition fulfilled?
y: (1 LD + 1 ST) / LUP x: 1 LD / LUP
BC = 24 B / LUP
do k=1,kmax
do j=1,jmax
y(j,k) = const * (x(j-1,k) + x(j+1,k) &
+ x(j,k-1) + x(j,k+1) )
enddo
enddo
YES
do k=1,kmax
do j=1,jmax
y(j,k) = const * (x(j-1,k) + x(j+1,k) &
+ x(j,k-1) + x(j,k+1) )
enddo
enddo BC = 40 B / LUP
y: (1 LD + 1 ST) / LUP
NO
x: 3 LD / LUP
57
Analyzing the data flow: Layer condition (2D 5-pt Jacobi)
Establish layer condition for all domain sizes
Idea: Spatial blocking
Reuse elements of x() as long as they stay in cache
Sweep can be executed in any order, e.g. compute blocks in j-direction
“Spatial Blocking” of j-loop:
Determine for given CacheSize an appropriate jblock value:
03/2017 Roofline Model
do jb=1,jmax,jblock ! Assume jmax is multiple of jblock
do k=1,kmax
do j= jb, (jb+jblock-1) ! Length of inner loop: jblock
y(j,k) = const * (x(j-1,k) + x(j+1,k) &
+ x(j,k-1) + x(j,k+1) )
enddo
enddo
enddo New layer condition (blocking)3 * jblock * 8B < CacheSize/2
jblock < CacheSize / 48 B
58
Establish the layer condition by blocking
03/2017 Roofline Model
Split up
domain into
subblocks:
e.g. block
size = 5
59
Establish the layer condition by blocking
03/2017 Roofline Model
Additional data
transfers (overhead)
at block boundaries!
61
Establish layer condition by spatial blocking
03/2017 Roofline Model
jmax=kmax jmax*kmax = const
L3
Ca
ch
e
L1: 32 KB
L2: 256 KB
L3: 25 MBjmax
Which cache to block for?
Intel Xeon E5-2690 v2
(“IvyBridge”@3 GHz)
Intel Compiler
ifort V13.1
jblock < CacheSize / 48 B
L2: CS=256 KBjblock=min(jmax,5333) L3: CS=25 MB
jblock=min(jmax,533333)
62
Validating the model: Memory code balance
03/2017 Roofline Model
jmax
Measured main memory
code balance (BC)
24 B / LUP
40 B / LUP
Intel Xeon E5-2690 v2
(“IvyBridge”@3 GHz)
Intel Compiler
ifort V13.1
Blocking factor
(CS=25 MB) still a
little too large
Main memory access is not
reason for different performance
(but L3 access is!)
jmax
63
Validating the model: L3 cache balance
03/2017 Roofline Model
Measured L3 cache
code balance (BC)
40 B / LUP
24 B / LUP
jmax
Intel Xeon E5-2690 v2
(“IvyBridge”@3 GHz)
Intel Compiler
ifort V13.1
Main memory (via L3)L1 cache
L2 or L3 cache
Data accesses to L3 cache (blocking)
Impact of total L3 traffic:
24 B/LUP vs. 40 B/LUP
jmax
67
From 2D to 3D
2D
Towards 3D understanding
Picture can be considered as 2D cut of 3D domain for (new) fixed i-coordinate:
x(0:jmax+1,0:kmax+1) x(i, 0:jmax+1,0:kmax+1)
03/2017 Roofline Model
x(0:jmax+1,0:kmax+1)j
k
68
From 2D to 3D
x(0:imax+1, 0:jmax+1,0:kmax+1) – Assume i-direction
contiguous in main memory (Fortran notation)
Stay at 2D picture and consider one cell of j-k plane as a
contiguous slab of elements in i-direction: x(0:imax,j,k)
03/2017 Roofline Model
j
k
i
j, k
69
Layer condition: From 2D 5-pt to 3D 7-pt Jacobi-type stencil
03/2017 Roofline Model
3 * jmax * 8B < CacheSize/2
BC = 24 B / LUP
do k=1,kmax
do j=1,jmax
y(j,k) = const * (x(j-1,k) + x(j+1,k) &
+ x(j,k-1) + x(j,k+1) )
enddo
enddo
do k=1,kmax
do j=1,jmax
do i=1,imax
y(i,j,k) = const * (x(i-1,j,k) + x(i+1,j,k)
+ x(i,j-1,k) + x(i,j+1,k) &
+ x(i,j,k-1) + x(i,j,k+1) )
enddo
enddo
enddo
3 * jmax *imax * 8B < CacheSize/2 BC = 24 B / LUP
2D
3D
70
3D 7-pt Jacobi-type Stencil (sequential)
do k=1,kmax
do j=1,jmax
do i=1,imax
y(i,j,k) =const.*(x(i-1,j,k) +x(i+1,j,k) &
+ x(i,j-1,k) +x(i,j+1,k) &
+ x(i,j,k-1) +x(i,j,k+1) )
enddo
enddo
enddo
“Layer condition” 3*jmax*imax*8B < CS/2 “Layer condition” OK
5 accesses to x() served by cache
Question:
Does parallelization/multi-threading change the layer condition?
03/2017 Roofline Model
73
Jacobi Stencil – OpenMP parallelization (3D)
!$OMP PARALLEL DO SCHEDULE(STATIC)
do k=1,kmax
do j=1,jmax
do i=1,imax
y(i,j,k) = 1/6. *(x(i-1,j,k) +x(i+1,j,k) &
+ x(i,j-1,k) +x(i,j+1,k)
+ x(i,j,k-1) +x(i,j,k+1) )
enddo
enddo
enddo
Layer condition (cubic domain; shared CacheSize=25 MB)
1 thread: imax=jmax < 720 10 threads: imax=jmax < 230
Basic guideline:
Parallelize outermost loop
Equally large chunks in k-direction
“Layer condition” for each thread
(c) RRZE 2017 Roofline Model case studies
Layer condition for a shared cache: nthreads * 3 * imax*jmax * 8B < CS/2
Layer condition for private caches: 3 * imax*jmax * 8B < CS/2
74
Jacobi Stencil – OpenMP parallelization (II)
!$OMP PARALLEL DO SCHEDULE(STATIC)
do k=1,kmax
do j=1,jmax
do i=1,imax
y(i,j,k) = 1/6. *(x(i-1,j,k) +x(i+1,j,k) &
+ x(i,j-1,k) +x(i,j+1,k) &
+ x(i,j,k-1) +x(i,j,k+1) )
enddo
enddo
enddo
Intel® Xeon® Processor E5-2690 v2
10 cores@3 GHz
CS = 25 MB (L3)
bS = 48 GB/s
Roofline model:
P = min(Pmax, bS / (24 B/LUP))
“Layer condition (shared cache)”: nthreads *3*jmax*imax*8B < CS/2
BC = 24 B / LUP
03/2017 Roofline Model
“Light speed” performance:
P = 2000 MLUP/s
What is 𝑃max here? homework!
75
Socket scaling – validate Roofline model
03/2017 Roofline Model
Intel Xeon E5-2690 v2
(“IvyBridge”@3 GHz)
Intel Compiler
ifort V13.1
OpenMP Parallel
bS = 48 GB/s
BC= 24 B/LUP
BC= 40 B/LUP
𝑃 = min(𝑃max, 𝑏𝑆 𝐵𝐶) What is 𝑃max here? homework!
Case study: A Jacobi smoother
The basics in two dimensions
Layer conditions
Validating the model in 3D
Optimization by spatial blocking in 3D
77
Jacobi Stencil – OpenMP parallelization (I)
Validation: Measured data traffic
from main memory [Bytes/LUP]1 thread: Layer condition OK –
but can not saturate bandwidth
10 threads: performance drops around imax=230
03/2017 Roofline Model
Layer condition
violated 40 B/LUP
78
Jacobi Stencil – OpenMP parallelization (I)
If original “Layer condition” does not hold: nthreads*3*jmax*imax*8B > CS/2
maxMLUPs = MemBW / (40 B/LUP)
!$OMP PARALLEL DO SCHEDULE(STATIC)
do k=1,kmax
do j=1,jmax
do i=1,imax
y(i,j,k) = 1/6. *(x(i-1,j,k) +x(i+1,j,k) &
+ x(i,j-1,k) +x(i,j+1,k)
+ x(i,j,k-1) +x(i,j,k+1) )
enddo
enddo
enddo
But assume: nthreads*jmax*imax*8B < CS/2
(8+8) B/LUP for y() (ST+WA)
+ 8 B/LUP for x(i,j,k+1)
+ 8 B/LUP for x(i,j+1,k)
+ 8 B/LUP for x(i,j,k-1)
40 B/LUP
“Light speed” performance:
P = 1200 MLUP/s
03/2017 Roofline Model
Case study: A Jacobi smoother
The basics in two dimensions
Layer conditions
Validating the model in 3D
Spatial blocking in 3D
80
Jacobi Stencil – simple spatial blocking
do jb=1,jmax,jblock ! Assume jmax is multiple of jblock
!$OMP PARALLEL DO SCHEDULE(STATIC)
do k=1,kmax
do j=jb, (jb+jblock-1) ! Loop length jblock
do i=1,imax
y(i,j,k) = 1/6. *(x(i-1,j,k) +x(i+1,j,k) &
+ x(i,j-1,k) +x(i,j+1,k)
+ x(i,j,k-1) +x(i,j,k+1))
enddo
enddo
enddo
enddo
“Layer condition” (j-Blocking)nthreads*3*jblock*imax*8B < CS/2
Test system: Intel® Xeon® Processor E5-2690 v2 (10 cores / 3 GHz), as before
Ensure layer condition by choosing jblock appropriately (cubic domain):
jblock < CS/(imax* nthreads* 48B )
03/2017 Roofline Model
81
Jacobi Stencil – simple spatial blocking
Determine: jblock < CS/(2*nthreads*3*imax*8B)
imax = jmax = kmax
#blocks
changes
CS=10 MB:
~ 90+ % roofline limit
Validation: Measured data traffic
from main memory [Bytes/LUP]
03/2017 Roofline Model
82
Why not block the inner loop level?
24B/update
performance
model
inner (i) loop
blocking
middle (j)
loop blocking
optimum j
block size
03/2017 Roofline Model
89
Conclusions from the Jacobi example
We have made sense of the memory-bound performance vs.
problem size
“Layer conditions” lead to predictions of code balance
“What part of the data comes from where” is a crucial question
The model works only if the bandwidth is “saturated”
In-cache modeling is more involved
Avoiding slow data paths == re-establishing the most favorable
layer condition
Improved code showed the speedup predicted by the model
Optimal blocking factor can be estimated
Be guided by the cache size the layer condition
No need for exhaustive scan of “optimization space”
Food for thought
Multi-dimensional loop blocking – would it make sense?
Can we choose a “better” OpenMP loop schedule?
What would change if we parallelized inner loops?
03/2017 Roofline Model
101
Roofline diagram – exercise
Draw the roofline diagram
assuming double precision
for one node of the emmy cluster – the single processor chip data can be
found on slide 79 of the first presentation (“Intel Xeon E5 2690 V2”)
Add these roofs:
Code only performs single precision computations
Code only performs scalar instructions – no vectorization or SIMD
instructions
OpenMP code which ignores the golden rule of ccNUMA
03/2017 Roofline Model