Effectively Addressing Memory David Skinner, NERSC Division, Berkeley Lab
-
Upload
thornton-cedric -
Category
Documents
-
view
29 -
download
2
description
Transcript of Effectively Addressing Memory David Skinner, NERSC Division, Berkeley Lab
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
1
Effectively Addressing Memory
David Skinner, NERSC Division, Berkeley Lab
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
2
Abstract:
A demonstration of how loop control structures impact memory bandwidth and program performance is presented. The performance of various loops and memory addressing idioms in different languages withvarying strides and access patterns is examined. The focus is on making clean looking code perform well.
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
3
•John McCalpin points out with the Streams Benchmark, CPUs are fast outpacing memory.
•The megahertz war in commodity computing impacts HPC.
•Main memory access is often the bottleneck to performance
CPU and Memory Imbalance
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
4
The Streams Benchmark
• Should you expect to see the stream memory bandwidth in your code?
• How drastically are you willing to modify your code in improve memory access rates?
• Good main memory bandwidth at the machine level is a necessary but not sufficient condition for good program performance. Need good algorithms and compilers too.
Operation Seaborg (MB/s)
Copy a = b 475
Scale a = s*c 467
Add a = b+c 659
Triad a = b + s*c 660
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
5
SMP’s themselves are increasingly unbalanced
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
6
Hardware Overview
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
7
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
8
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
9
Performance at what cost?
SUBROUTINE stream_triad (a,b,c,scalar,n) REAL*8 c(n),a(n),b(n),scalar INTEGER AHEAD PARAMETER (AHEAD=128)
IF (n.LE.AHEAD+1) THEN DO j = 1,n a(j) = b(j) + scalar*c(j) END DO ELSE DO j = 1,n-AHEAD,16!IBM* CACHE_ZERO (a(j+AHEAD)) DO i=0,15 a(j+i) = b(j+i) + scalar*c(j+i) END DO END DO DO j = n-AHEAD+1,n a(j) = b(j) + scalar*c(j) END DO END IF END
SUBROUTINE stream_triad (a,b,c,scalar,n) REAL*8 c(n),a(n),b(n),scalar DO i=1,n a(i) = b(i) + scalar*c(i) END DO END
Will improving memory bandwidthmake my code unintelligible, not portable, more prone to breaking?
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
10
Scientific Applications
• Linear algebra
• Bioinformatics
• PDEs
• Sparse/Mesh free methods
In what follows we’ll focus on kernels which represent these stanzas, in a variety of programming languages.
All share certain language constructs
All address data structures of appreciable size
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
11
Schematic of a Scientific Code
0 20 40 60 80 100
0
1
2
3 FLOPI/OSYNCFLOPI/OSYNC
Loop1 Loop2 Loop3 Loop4
•fill•scale•copy•add•daxpy
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
12
Case Study: MCTDH
% cumulative self self total time seconds seconds calls ms/call ms/call name 28.9 20.03 20.03 229 87.47 87.47 .qtxxzz [9] 20.8 34.39 14.36 200 71.80 71.80 .qtxxzza [12] 15.3 44.95 10.56 8534 1.24 1.24 .cpvxz [20] 5.8 48.99 4.04 4574 0.88 0.88 .qtxxdz [32] 4.4 52.01 3.02 1038 2.91 2.91 .zerovxz [36] 4.0 54.80 2.79 42 66.43 66.43 .rmmxxxzz [38] 2.6 56.57 1.77 92 19.24 19.24 .xvxxzza [42] 2.0 57.96 1.39 408 3.41 3.41 .mmaxzz [47] 1.7 59.17 1.21 176 6.88 6.88 .mattens [49] 1.3 60.08 0.91 24 37.92 37.92 .rm1hxxxzz [53] 1.3 60.99 0.91 3 303.33 1884.19 .csilstep [31] 0.9 61.61 0.62 176 3.52 3.52 .mqxxzz [61] 0.9 62.22 0.61 176 3.47 3.47 .mqxtzz [62] 0.8 62.80 0.58 2741 0.21 0.21 .vvaxzz [65] 0.8 63.34 0.54 1919 0.28 0.28 .addmxxzo [67] 0.7 63.84 0.50 790 0.63 0.63 .zeromxz [68] 0.5 64.17 0.33 184 1.79 1.79 .overmxz [74] 0.5 64.49 0.32 .IOWrite [75]
FillCopyScaleAddTriad
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
13
Loop Constructs
• Direct– for(i=0;i<n;i++) {
a[i] = b[i]
}
– do i=1,n
a(i) = b(i)
enddo
• Indirect– Gather: a[i] = b[ib[I]]
– Scatter: a[ia[I]] = b[I]
– Gascat: a[ia[I]] = b[ib[I]]
• Strided– for(i=0;i<n;i+=stride) { a[i] = b[i] }
– do i=1,n,stride a(i) = b(i) enddo
• Multidimensional do k = 1,m do j = 1, n do i = 1,o a(i,j,k) = b(i,j,k)…
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
14
Alternatives to Loops
• libc <string.h>
fill: memset, bzero
copy: memmove, memcpy
• BLAS
copy : dcopy
scale: dcopy,dscal
add: dcopy,daxpy
triad: dcopy, daxpy
• Fortran 90 Intrinsics
copy: a(1:n:stride) = b(1:n:stride)
triad: a = b + s*c
• STL
vector<double> a,b,c;
fill: fill(a.begin(), a.end(), scalar)
copy: copy(b.begin(), b.end(),a.begin())
add: transform(b.begin(),b.end(),
c.begin(),a.begin(), plus<double>())
There’s definitely more than one way to do it.
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
15
Implementation of Testss00513>./xtream -h usage: xtream [options] ndim dim1 [dim2 ...] [inc1 inc2 ...] -stride n uses stride n -libc 1d bzero,memset,memcpy,memmove ops -blas 1d blas ops dcopy, dscal, daxpy -stl STL algorithms -scatter a(ia(i)) = b(i) -gather a(i) = b(ib(i)) -gascat a(ia(i)) = b(ib(i)) -null check timers with a null op -nit n average results over n iterations -rate show MB/sec results instead of times -scan increase sizes as dim(i) += inc(i) -scanx increase sizes as dim(i) += dim(i)/inc(i)+1
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
16
Example
s00513>./xtream 2 256 256host s00513 006006564C00/AIX word=8 (Mon Dec 16 08:16:03 2002) cmd "./xtream 2 256 256 " dimension 65536 2 { 256 256 } nit = 128 mb = 5.000e-01construct N t_fill t_copy t_scale t_add t_triadc_for,1d 65536 4.749e-04 5.260e-04 4.758e-04 1.896e-03 1.617e-03 f_do,1d 65536 4.745e-04 5.258e-04 4.777e-04 1.846e-03 1.846e-03 c++_stl,1d 65536 5.621e-04 4.840e-04 4.864e-04 7.353e-03 7.834e-03 c_blas 65536 0.000e+00 3.851e-04 3.472e-04 8.185e-04 8.151e-04 c_for 65536 4.752e-04 5.209e-04 4.543e-04 1.870e-03 1.861e-03 c++_for 65536 4.748e-04 5.207e-04 4.494e-04 1.858e-03 1.874e-03 f_do 65536 4.728e-04 5.228e-04 4.734e-04 5.228e-04 1.848e-03 f90_intr 65536 4.733e-04 5.241e-04 4.752e-04 1.830e-03 1.835e-03 wallclock 65536 6.323e+00 sec
A lot of information for the programmer!
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
17
Scanning over problem size (copy)
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
18
Compiler Options
• By default IBM’s compilers provide no optimization
•By default you get only 1 (256 MB) memory segment (use –bmaxdata 0x7000000 to get more)
• Optimization Levels• none•-O2 •-O3 –qstrict –qarch=auto –qtune=auto
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
19
-O2
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
20
-O3 –qstrict –qtune=pwr3 –qarch=pwr3
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
21
Misalignment
Mflip/s % Efficiency Misalignment 5304 100 None; just real*8 arrays in common 4482 85 4-bytes; integer as first item in common 4317 81 4-bytes; character*4 as first item in common 397 7 1-byte; character*1 as first item in common 397 7 2-bytes; character*2 as first item in common 397 7 3-bytes; character*3 as first item in common
2000x2000 ESSL-SMP DGEMMs
Has xlf ever told you?
1514-008 (W) Variable mres is misaligned. This may affect the efficiency of the code.
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
22
Data Locality
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
23
Indirect Addressing
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
24
Multidimensional Loops and Loop Overhead
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
25
Multidimensional Loops
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
26
References
•Stream Benchmark
•OOPack Benchmark
•Stepanov C/C++ Benchmarks
•Haney Kernels
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
27
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
28
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
29
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
30
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
31
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
32
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
33
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
34
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER
35