Effectively Addressing Memory David Skinner, NERSC Division, Berkeley Lab

35
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER 1 Effectively Addressing Memory David Skinner, NERSC Division, Berkeley Lab

description

Effectively Addressing Memory David Skinner, NERSC Division, Berkeley Lab. Abstract: A demonstration of how loop control structures impact memory bandwidth and program performance is presented. The performance of various loops and memory addressing idioms in different languages with - PowerPoint PPT Presentation

Transcript of Effectively Addressing Memory David Skinner, NERSC Division, Berkeley Lab

Page 1: Effectively Addressing Memory David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

1

Effectively Addressing Memory

David Skinner, NERSC Division, Berkeley Lab

Page 2: Effectively Addressing Memory David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

2

Abstract:

A demonstration of how loop control structures impact memory bandwidth and program performance is presented. The performance of various loops and memory addressing idioms in different languages withvarying strides and access patterns is examined. The focus is on making clean looking code perform well.

Page 3: Effectively Addressing Memory David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

3

•John McCalpin points out with the Streams Benchmark, CPUs are fast outpacing memory.

•The megahertz war in commodity computing impacts HPC.

•Main memory access is often the bottleneck to performance

CPU and Memory Imbalance

Page 4: Effectively Addressing Memory David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

4

The Streams Benchmark

• Should you expect to see the stream memory bandwidth in your code?

• How drastically are you willing to modify your code in improve memory access rates?

• Good main memory bandwidth at the machine level is a necessary but not sufficient condition for good program performance. Need good algorithms and compilers too.

Operation Seaborg (MB/s)

Copy a = b 475

Scale a = s*c 467

Add a = b+c 659

Triad a = b + s*c 660

Page 5: Effectively Addressing Memory David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

5

SMP’s themselves are increasingly unbalanced

Page 6: Effectively Addressing Memory David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

6

Hardware Overview

Page 7: Effectively Addressing Memory David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

7

Page 8: Effectively Addressing Memory David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

8

Page 9: Effectively Addressing Memory David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

9

Performance at what cost?

SUBROUTINE stream_triad (a,b,c,scalar,n) REAL*8 c(n),a(n),b(n),scalar INTEGER AHEAD PARAMETER (AHEAD=128)

IF (n.LE.AHEAD+1) THEN DO j = 1,n a(j) = b(j) + scalar*c(j) END DO ELSE DO j = 1,n-AHEAD,16!IBM* CACHE_ZERO (a(j+AHEAD)) DO i=0,15 a(j+i) = b(j+i) + scalar*c(j+i) END DO END DO DO j = n-AHEAD+1,n a(j) = b(j) + scalar*c(j) END DO END IF END

SUBROUTINE stream_triad (a,b,c,scalar,n) REAL*8 c(n),a(n),b(n),scalar DO i=1,n a(i) = b(i) + scalar*c(i) END DO END

Will improving memory bandwidthmake my code unintelligible, not portable, more prone to breaking?

Page 10: Effectively Addressing Memory David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

10

Scientific Applications

• Linear algebra

• Bioinformatics

• PDEs

• Sparse/Mesh free methods

In what follows we’ll focus on kernels which represent these stanzas, in a variety of programming languages.

All share certain language constructs

All address data structures of appreciable size

Page 11: Effectively Addressing Memory David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

11

Schematic of a Scientific Code

0 20 40 60 80 100

0

1

2

3 FLOPI/OSYNCFLOPI/OSYNC

Loop1 Loop2 Loop3 Loop4

•fill•scale•copy•add•daxpy

Page 12: Effectively Addressing Memory David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

12

Case Study: MCTDH

% cumulative self self total time seconds seconds calls ms/call ms/call name 28.9 20.03 20.03 229 87.47 87.47 .qtxxzz [9] 20.8 34.39 14.36 200 71.80 71.80 .qtxxzza [12] 15.3 44.95 10.56 8534 1.24 1.24 .cpvxz [20] 5.8 48.99 4.04 4574 0.88 0.88 .qtxxdz [32] 4.4 52.01 3.02 1038 2.91 2.91 .zerovxz [36] 4.0 54.80 2.79 42 66.43 66.43 .rmmxxxzz [38] 2.6 56.57 1.77 92 19.24 19.24 .xvxxzza [42] 2.0 57.96 1.39 408 3.41 3.41 .mmaxzz [47] 1.7 59.17 1.21 176 6.88 6.88 .mattens [49] 1.3 60.08 0.91 24 37.92 37.92 .rm1hxxxzz [53] 1.3 60.99 0.91 3 303.33 1884.19 .csilstep [31] 0.9 61.61 0.62 176 3.52 3.52 .mqxxzz [61] 0.9 62.22 0.61 176 3.47 3.47 .mqxtzz [62] 0.8 62.80 0.58 2741 0.21 0.21 .vvaxzz [65] 0.8 63.34 0.54 1919 0.28 0.28 .addmxxzo [67] 0.7 63.84 0.50 790 0.63 0.63 .zeromxz [68] 0.5 64.17 0.33 184 1.79 1.79 .overmxz [74] 0.5 64.49 0.32 .IOWrite [75]

FillCopyScaleAddTriad

Page 13: Effectively Addressing Memory David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

13

Loop Constructs

• Direct– for(i=0;i<n;i++) {

a[i] = b[i]

}

– do i=1,n

a(i) = b(i)

enddo

• Indirect– Gather: a[i] = b[ib[I]]

– Scatter: a[ia[I]] = b[I]

– Gascat: a[ia[I]] = b[ib[I]]

• Strided– for(i=0;i<n;i+=stride) { a[i] = b[i] }

– do i=1,n,stride a(i) = b(i) enddo

• Multidimensional do k = 1,m do j = 1, n do i = 1,o a(i,j,k) = b(i,j,k)…

Page 14: Effectively Addressing Memory David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

14

Alternatives to Loops

• libc <string.h>

fill: memset, bzero

copy: memmove, memcpy

• BLAS

copy : dcopy

scale: dcopy,dscal

add: dcopy,daxpy

triad: dcopy, daxpy

• Fortran 90 Intrinsics

copy: a(1:n:stride) = b(1:n:stride)

triad: a = b + s*c

• STL

vector<double> a,b,c;

fill: fill(a.begin(), a.end(), scalar)

copy: copy(b.begin(), b.end(),a.begin())

add: transform(b.begin(),b.end(),

c.begin(),a.begin(), plus<double>())

There’s definitely more than one way to do it.

Page 15: Effectively Addressing Memory David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

15

Implementation of Testss00513>./xtream -h usage: xtream [options] ndim dim1 [dim2 ...] [inc1 inc2 ...] -stride n uses stride n -libc 1d bzero,memset,memcpy,memmove ops -blas 1d blas ops dcopy, dscal, daxpy -stl STL algorithms -scatter a(ia(i)) = b(i) -gather a(i) = b(ib(i)) -gascat a(ia(i)) = b(ib(i)) -null check timers with a null op -nit n average results over n iterations -rate show MB/sec results instead of times -scan increase sizes as dim(i) += inc(i) -scanx increase sizes as dim(i) += dim(i)/inc(i)+1

Page 16: Effectively Addressing Memory David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

16

Example

s00513>./xtream 2 256 256host s00513 006006564C00/AIX word=8 (Mon Dec 16 08:16:03 2002) cmd "./xtream 2 256 256 " dimension 65536 2 { 256 256 } nit = 128 mb = 5.000e-01construct N t_fill t_copy t_scale t_add t_triadc_for,1d 65536 4.749e-04 5.260e-04 4.758e-04 1.896e-03 1.617e-03 f_do,1d 65536 4.745e-04 5.258e-04 4.777e-04 1.846e-03 1.846e-03 c++_stl,1d 65536 5.621e-04 4.840e-04 4.864e-04 7.353e-03 7.834e-03 c_blas 65536 0.000e+00 3.851e-04 3.472e-04 8.185e-04 8.151e-04 c_for 65536 4.752e-04 5.209e-04 4.543e-04 1.870e-03 1.861e-03 c++_for 65536 4.748e-04 5.207e-04 4.494e-04 1.858e-03 1.874e-03 f_do 65536 4.728e-04 5.228e-04 4.734e-04 5.228e-04 1.848e-03 f90_intr 65536 4.733e-04 5.241e-04 4.752e-04 1.830e-03 1.835e-03 wallclock 65536 6.323e+00 sec

A lot of information for the programmer!

Page 17: Effectively Addressing Memory David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

17

Scanning over problem size (copy)

Page 18: Effectively Addressing Memory David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

18

Compiler Options

• By default IBM’s compilers provide no optimization

•By default you get only 1 (256 MB) memory segment (use –bmaxdata 0x7000000 to get more)

• Optimization Levels• none•-O2 •-O3 –qstrict –qarch=auto –qtune=auto

Page 19: Effectively Addressing Memory David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

19

-O2

Page 20: Effectively Addressing Memory David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

20

-O3 –qstrict –qtune=pwr3 –qarch=pwr3

Page 21: Effectively Addressing Memory David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

21

Misalignment

Mflip/s % Efficiency Misalignment 5304 100 None; just real*8 arrays in common 4482 85 4-bytes; integer as first item in common 4317 81 4-bytes; character*4 as first item in common 397 7 1-byte; character*1 as first item in common 397 7 2-bytes; character*2 as first item in common 397 7 3-bytes; character*3 as first item in common

2000x2000 ESSL-SMP DGEMMs

Has xlf ever told you?

1514-008 (W) Variable mres is misaligned. This may affect the efficiency of the code.

Page 22: Effectively Addressing Memory David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

22

Data Locality

Page 23: Effectively Addressing Memory David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

23

Indirect Addressing

Page 24: Effectively Addressing Memory David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

24

Multidimensional Loops and Loop Overhead

Page 25: Effectively Addressing Memory David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

25

Multidimensional Loops

Page 26: Effectively Addressing Memory David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

26

References

•Stream Benchmark

•OOPack Benchmark

•Stepanov C/C++ Benchmarks

•Haney Kernels

Page 27: Effectively Addressing Memory David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

27

Page 28: Effectively Addressing Memory David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

28

Page 29: Effectively Addressing Memory David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

29

Page 30: Effectively Addressing Memory David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

30

Page 31: Effectively Addressing Memory David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

31

Page 32: Effectively Addressing Memory David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

32

Page 33: Effectively Addressing Memory David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

33

Page 34: Effectively Addressing Memory David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

34

Page 35: Effectively Addressing Memory David Skinner, NERSC Division, Berkeley Lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

35