Demanding Parallel FFTs: Slabs & Rods

Demanding Parallel FFTs: Slabs & Rods

Ian Kirker

August 22, 2008

MSc in High Performance ComputingThe University of EdinburghYear of Presentation: 2008

Abstract

Fourier transforms of multidimensional data are an important component of many sci-

entific codes, and thus the efficient parallelisation of these transforms is key in obtaining

high performance on large numbers of processors. For the three-dimensional case, two

common decompositions of input data can be employed to perform this parallelisation

– one-dimensional ("slab"), and two-dimensional ("rod") processor grid divisions.

In this report, we demonstrate implementation of the three-dimensional FFT routine in

parallel, using component routines from a library capable of performing one-dimensional

FFTs, and briefly examine considerations of creating a portable application which can

use multiple libraries in C.

We then examine the performance of the two decompositions on seven different high-

performance platforms – HECToR, HPCx, Ness, BlueSky, MareNostrum, HLRB II, and

Eddie, using each of the FFT libraries available on each of these platforms. Efficient

scaling is demonstrated up to more than a thousand processors for sufficiently large data

cube sizes, and results obtained indicate that the slab decomposition generally obtains

greater performance.

Additionally, our results indicate that FFT libraries installed on platforms we tested are

largely comparable in performance, whether vendor-provided or open-source, except

for the esoteric Blue Gene/L architecture, upon which ESSL proved to be superior.

Contents

1 Introduction 11.1 Fourier Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 All-to-All Communication . . . . . . . . . . . . . . . . . . . . . . . . 51.3 FFT Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.3.1 FFTW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3.2 ESSL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.3.3 ACML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.3.4 MKL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.4 HPC Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.4.1 HECToR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.4.2 HPCx . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.4.3 Ness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.4.4 BlueSky . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.4.5 Eddie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121.4.6 MareNostrum . . . . . . . . . . . . . . . . . . . . . . . . . . . 121.4.7 HLRB II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2 Benchmarking 142.1 2D FFTs for the Slab Decomposition . . . . . . . . . . . . . . . . . . . 152.2 Benchmarking on an HPC Platform . . . . . . . . . . . . . . . . . . . 162.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.3.1 Language and Communication . . . . . . . . . . . . . . . . . . 172.3.2 Test Data and Verification . . . . . . . . . . . . . . . . . . . . 172.3.3 Multidimensional FFT Data Rearrangement . . . . . . . . . . . 192.3.4 Rod Decomposition Dimensions . . . . . . . . . . . . . . . . . 192.3.5 Memory Limitations . . . . . . . . . . . . . . . . . . . . . . . 20

2.4 FFT Library Porting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.4.1 FFTW3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.4.2 FFTW2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.4.3 MKL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.4.4 ESSL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.4.5 ACML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.5 System Porting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.6 Scripting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

ii

Contents Contents

3 Results 263.1 HECToR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.1.1 Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.1.2 1D vs 2D FFT . . . . . . . . . . . . . . . . . . . . . . . . . . 283.1.3 Slabs vs Rods . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.1.4 Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2 Ness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.2.1 Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.2.2 Slabs vs Rods . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.2.3 Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.3 HPCx . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.3.1 Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.3.2 Slabs vs Rods . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.3.3 Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.4 BlueSky . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.4.1 Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.4.2 Slabs vs Rods . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.4.3 Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.5 Eddie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.5.1 Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.5.2 Slabs vs Rods . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.5.3 Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.6 MareNostrum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.6.1 Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.6.2 Slabs vs Rods . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.6.3 Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.7 HLRB II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463.7.1 Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463.7.2 Slabs vs Rods . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.8 Automatic Parallel Routines . . . . . . . . . . . . . . . . . . . . . . . 48

4 Discussion and Conclusions 524.1 Rods & Slabs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524.2 Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.3 Automatic Routines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.4 Improving the Benchmark . . . . . . . . . . . . . . . . . . . . . . . . 544.5 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

A All-to-all Data Rearrangement 61

B Patching FFTW3 to Blue Gene/L 64

C Readme for Software Package 66

D Work Plan and Organisation 69

iii

List of Tables

1.1 Software versions used on each platform . . . . . . . . . . . . . . . . . 13

2.1 FFT libraries and their complex double-precision floating-point numberspecification for the C language interface. . . . . . . . . . . . . . . . . 22

iv

List of Figures

1.1 A directed acyclic graph showing the data dependencies for each stepof an FFT operation on an 8 element array. . . . . . . . . . . . . . . . . 3

1.2 Slab and rod decompositions as applied to a 4 x 4 x 4 data cube decom-posed over 4 processors. . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Steps and data rearrangement in the slab decomposition of the 3D FFT.Reproduced from [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.4 Steps and data rearrangement in the rod decomposition of the 3D FFT.Reproduced from [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.5 All-to-all communication between processors, as applied to a commonmatrix transposition. . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1 Graphical representation of the ability to break down a three-dimensionalFFT into different dimensionality function calls. Each method is equiv-alent, and the operations are commutative. Axes may be arbitrarily re-ordered. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2 2D Decomposition Code . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.1 A comparison of the time spent in FFT calls for a rod decomposition3D FFT, for different libraries, using 64 processors, on HECToR. . . . . 27

3.2 A comparison of the total time taken for the purely 1D and 2D & 1DFFT-using slab decomposition 3D FFT, for different libraries, using 16processors, on HECToR. . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.3 A comparison of the total time taken for the two different decomposi-tions of the 3DFFT on HECToR, using the ACML library. . . . . . . . 29

3.4 A comparison of the total time taken for the 3DFFT on a data cube ofthe given extents (x) on varying numbers of MPI tasks on HECToR. . . 30

3.5 A comparison of the time spent in FFT calls for a rod decomposition3D FFT, for different libraries, using 8 processors, on Ness. . . . . . . . 31

3.6 A comparison of the time spent in FFT calls for a rod decomposition3D FFT, for MKL on the first call of the FFT routine and the second,using 8 processors, on Ness. . . . . . . . . . . . . . . . . . . . . . . . 32

3.7 A comparison of the total time taken for the two different decomposi-tions of the 3DFFT on Ness, using the FFTW3 library. . . . . . . . . . 33

3.8 A comparison of the total time taken for the 3DFFT on a data cube ofthe given extents (x) on varying numbers of MPI tasks on Ness. . . . . 34

v

List of Figures List of Figures

3.9 A comparison of the time spent in FFT calls for a rod decomposition3D FFT, for different libraries, using 32 processors, on HPCx. . . . . . 35

3.10 A comparison of the total time taken for the two different decomposi-tions of the 3DFFT on HPCx, using the ESSL library. . . . . . . . . . . 35

3.11 A comparison of the total time taken for the 3DFFT on a data cube ofthe given extents (x) on varying numbers of MPI tasks on HPCx. . . . . 36

3.12 A comparison of the time spent in FFT calls for a rod decomposition3D FFT, for different libraries, using 128 processors, on BlueSky. . . . 37

3.13 A comparison of the total time taken for the two different decomposi-tions of the 3DFFT on BlueSky, using the ESSL library. . . . . . . . . . 38

3.14 A comparison of the total time taken for the two different decomposi-tions of the 3DFFT on BlueSky, using the FFTW3 library. . . . . . . . 38

3.15 A comparison of the total time taken for the 3DFFT on a data cubeof the given extents (x) on varying numbers of MPI tasks on BlueSkyusing ESSL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.16 A comparison of the total time taken for the 3DFFT on a data cubeof the given extents (x) on varying numbers of MPI tasks on BlueSkyusing FFTW3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.17 A comparison of the time spent in FFT calls for a rod decomposition3D FFT, for different libraries, using 16 processors, on Eddie. . . . . . 41

3.18 A comparison of the total time taken for the two different decomposi-tions of the 3DFFT on Eddie, using the FFTW3 library. . . . . . . . . . 42

3.19 A comparison of the total time taken for the 3DFFT on a data cube ofthe given extents (x) on varying numbers of MPI tasks on Eddie. . . . . 43

3.20 A comparison of the time spent in FFT calls for a rod decomposition3D FFT, for different libraries, using 32 processors, on MareNostrum. . 44

3.21 A comparison of the total time taken for the two different decomposi-tions of the 3DFFT on MareNostrum, using the ESSL library. . . . . . . 45

3.22 A comparison of the total time taken for the 3DFFT on a data cube ofthe given extents (x) on varying numbers of MPI tasks on MareNostrum. 46

3.23 A comparison of the time spent in FFT calls for a rod decomposition3D FFT, for different libraries, using 64 processors, on HLRB II. . . . . 47

3.24 A comparison of the total time taken for the two different decomposi-tions of the 3DFFT on HLRB II, using the FFTW3 library. . . . . . . . 47

3.25 A comparison of the total time taken for the 3DFFT on a data cube ofthe given extents (x) on varying numbers of MPI tasks on HLRB II. . . 48

3.26 A comparison of the time taken for a slab decomposition 3D FFT, forthe automatic parallel routines, as a percentage of the time taken for ourwritten routines, using FFTW2, on HECToR. . . . . . . . . . . . . . . 49

3.27 A Vampir trace timeline taken on Ness, showing the barrier proceeding,and the communication performed within, the FFTW2 parallel routine. . 50

3.28 A comparison of the time taken for a slab decomposition 3D FFT, forthe automatic parallel routines using PESSL, on HPCx. . . . . . . . . . 50

vi

List of Figures List of Figures

3.29 A comparison of the time taken for a slab decomposition 3D FFT, forthe automatic parallel routines, as a percentage of the time taken for ourwritten routines, using our compiled version of FFTW3.2a3, on Ness. . 51

A.1 Data rearrangement code . . . . . . . . . . . . . . . . . . . . . . . . . 62A.2 Data rearrangement code (cont.) . . . . . . . . . . . . . . . . . . . . . 63

B.1 The FFTW3 Patch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

vii

Acknowledgements

Principally, I’d like to thank Dr. Gavin Pringle, for his immeasurable guidance, support,and encouragement throughout this project.

I’d also like to thank Dr. Joachim Hein, for his greatly appreciated assistance, especiallyconcerning the BlueSky system.

Acknowledgement and thanks are also due for the assistance of David Vicente of theBarcelona Supercomputing Centre, who enabled us to have results for MareNostrum,and the Leibniz-Rechenzentrum München, for their allowing us to use their HLRB IIsystem.

Finally, many thanks to David Weir, and my assorted family members, for all theirsupport throughout my time in Edinburgh.

Chapter 1

Introduction

Fourier transforms are an oft-used technique for signal analysis, where they can be used

to perform complex operations on signals or simply retrieve a list of frequencies present,

and linear algebra, where they are widely used in multiple dimensions to solve prob-

lems that could take an untenable amount of time if done by iterative methods. Fourier

transforms are often the most computationally intensive region of the application, there-

fore parallelisation of the operation should benefit the overall runtime of a calculation

significantly.

It is therefore in our interest to determine an optimal method of parallelisation for this

transformation. Commonly used methods use data parallelism, decomposing the multi-

dimensional input data along one or more of its dimensions and giving each processor

a subset of the data to transform. For the three-dimensional case, the one-dimensional

decomposition is more common, but with increasing numbers of processors on mod-

ern systems, demand for higher degrees of parallelism in applications has made the

two-dimensional decomposition more of a viable option than previously.

Previous work [1][2] has studied these methods and the tuning thereof on specific plat-

forms, however in this study we aim to provide a view, over many different libraries

1

Chapter 1. Introduction 1.1. Fourier Transforms

and platforms using several libraries and our written parallel decompositions, of the

performance of these two decompositional methods.

1.1 Fourier Transforms

The Fourier transform is a mathematical method of representing a function in terms of

sinusoidal components, constructing a new function that gives the phase and amplitude

of each of these components that is required to reconstruct the original function.

It is often used to devolve a function in terms of time into its component frequencies,

e.g. and the resulting transformed function has many useful properties that can be used

to combine functions in a way that would be much more complex to perform on the

function in the time domain.

The transform can be discretised in order to perform a transformation on numerical

sequence – in this case, the numerical sequence is assumed to be a single instance

of a repeating signal, and the resolution is limited to that of the initial input data. The

discrete Fourier transform is used in many computational techniques, often to transform

the data into a form where a certain operation can be applied more easily, and then

perform the reverse transformation to return the data to the former domain.

The mathematical form of the discrete Fourier transform are largely irrelevant to the

context of this study, except to say that there are essentially two common mechanisms

that can be used – the ordinary, discrete Fourier transform method, which takes O(n2)

time, and a recursive method thought to be first described by Gauss[3], re-formalised

by Cooley and Tukey[4], and much investigated since [5], known as the fast Fourier

transform (FFT), in which the transform is constructed by recursively performing sim-

pler, smaller transforms via a divide and conquer-type algorithm, taking only O(n log n)

time.

2


Figure 1.1: A directed acyclic graph showing the data dependencies for each step ofan FFT operation on an 8 element array.

This technique is thus limited to sequences which can be divided in this way, while

sequences of prime length must be operated on using either the full O(n2) mechanism,

or by using one of a set of more advanced technique in which the FFT is expressed as a

convolution of two sequences, which are themselves expressible as the product of two

FFTs of sequences of non-prime length [6][7].

The divide and conquer mechanism leads to complex data dependencies in the one-

dimensional FFT (see Figure 1.1), leading to it being inefficient to parallelise except for

very large data sets or using vector platforms. However, the two and three-dimensional

forms of FFT are often used in scientific computing, which are mathematically equiv-

alent to performing one-dimensional FFTs along each dimension in any order. In the

three-dimensional case, the data can thus be decomposed over a one-dimensional array

of processors along any of the dimensions of the data cube and still leave two dimen-

sions of the cube intact to perform whole FFTs without communication (Figure 1.3).

This type of decomposition is commonly known as a ’slab’ decomposition.

Due to the computational intensivity of these operations, and the relatively small dimen-

sions of the cube of data (usually smaller than 104), it would be ideal for parallelisation

to divide up one additional dimension of the data cube, allowing only one dimension to

be operated on without communication. This form is known commonly as a "rod" or

3


"pencil" decomposition. Performing the three-dimensional FFT in this case, however,

then requires an extra communication step over the slab decomposition, as shown in

Figure 1.4. The additional cost of the communication involved may make this operation

perform significantly worse than any of the more simple one-dimensional decomposi-

tions. However, the two steps of the rod decomposition are simpler than the one step

of the slab decomposition, requiring only O(√

p) messages to and from each proces-

sor, rather than O(p) in the case of the slab decomposition. Each message contains the

same quantity of data, so this reduction in message count comes with a corresponding

decrease in the overall quantity of data that has to be inserted into the network connect.

These factors may improve performance over the slab decomposition if its performance

is limited by any of the properties of the network.

We therefore investigate which decomposition gives better performance, and to what

degree, for a range of data set sizes and processor counts, on a range of platforms.

This will also provide us with expectable timings and scaling measures for these three-

dimensional FFTs.

1D - ‘slab’ 2D - ‘rod’

Figure 1.2: Slab and rod decompositions as applied to a 4 x 4 x 4 data cube decom-posed over 4 processors.

4

Chapter 1. Introduction 1.2. All-to-All Communication

y

x z

y

x z

x

y z

Proc 0

Proc 1

Proc 2

Proc 3

problem sizeL x M x N

perform 1st 1D-FFTalong y-dimension and

2nd 1D-FFT along z-dimension(a)

perform 1D-FFTalong x-dimension

(b)

ALL-

to-

ALLto get data over

x-dimension locally

Figure 1.3: Steps and data rearrangement in the slab decomposition of the 3D FFT.Reproduced from [1]. Note that the axes have been rotated to better dis-play data reorganisation.

y

x z

Proc 0

Proc 3Proc 2

Proc 1

Proc 4

Proc 7Proc 6

Proc 5

Proc 8

Proc 11Proc 10

Proc 9

Proc 12

Proc 15Proc 14

Proc 13

z

x y

x

z yperform 1D-FFT

along y-dimension(a)

perform 1D-FFTalong z-dimension

(a)

perform 1D-FFTalong x-dimension

(a)

ALL-

to-

ALL

WITHINEACH

sub-group

to getdata over

z-dimensionlocally

ALL-

to-

ALL

WITHINEACH

sub-group

to getdata over

x-dimensionlocally

Figure 1.4: Steps and data rearrangement in the rod decomposition of the 3D FFT. Re-produced from [1]. Note that the axes have been rotated to better displaydata reorganisation.

1.2 All-to-All Communication

All-to-all communication, as the name suggests, is a style of communication in which

every discrete processing unit in a parallel computing group needs to communicate

with every other. In the message-passing paradigm, this means taking n processing

units each with n items of data, and redistributing them such that the nth unit receives

the nth item from each other task. This is equivalent to performing a transposition of

5

Chapter 1. Introduction 1.2. All-to-All Communication

0

4

8

12

1

5

9

13

2

6

10

14

3

7

11

15

Proc

0

Proc

3

Proc

2

Proc

1

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

Proc

0

Proc

3

Proc

2

Proc

1

Figure 1.5: All-to-all communication between processors, as applied to a commonmatrix transposition.

an n-row matrix, as shown in Figure 1.5.

This is one of the most expensive operations an interconnect between tasks can perform

in a distributed memory environment, and depends heavily on the insertion and bisec-

tional bandwidths of the sector of the network used. The insertion bandwidth is the

limit on individual messages – how fast each processor can send data into the network,

while the bisection bandwidth is the limit of connection bandwidth across the network’s

minimum cross-section, i.e. the smallest quantity of bandwidth-link that every message

must travel through.

Often the quoted figures for such bandwidths are the theoretical maxima – often the

message-passing library implementation is implemented on top of a lower level net-

work library, and overheads can add much overhead to messaging operations, reducing

performance. Also, sub-optimal task assignment to physical processors can cause un-

necessary latency in each message sent, especially in the case of the rod decomposition,

where compartmentalising each all-to-all such that messages from different instances

do not route through the same links can provide significant performance benefits [1][8].

For message-passing, libraries implementing the Message-Passing Interface have be-

come the de-facto standard, being implemented on many platforms, even where mes-

sage passing is not the most powerful paradigm available, e.g. on shared-memory sys-

6

Chapter 1. Introduction 1.3. FFT Libraries

tems, where all memory is de-localised and shared with all processors, and ccNUMA

(cache-coherent non-uniform memory access) systems, where memory is localised but

accessible by all processors.

1.3 FFT Libraries

The process of performing a fast Fourier transform can be implemented as needed,

however, there exist several well-known libraries that provide generic implementations

that typically perform well for a variety of data extents. The recursive nature of the FFT

algorithm means that an FFT library will typically have a number of optimised base

cases for small prime factors, so many libraries have a list of prime factors of extents

they can perform very well. This set of optimised factors was once a major factor in

choice of library, but now most libraries support any size of extent, but may not perform

optimally for extents that are not factorable into small primes.

Libraries provided by vendors will typically be optimised for their platform, however,

libraries that attempt to provide good performance across a range of platforms also

exist. Below we briefly describe the libraries we used on the platforms we tested.

1.3.1 FFTW

FFTW ("Fastest Fourier Transform in the West") is a free, portable, open source li-

brary for performing Fourier transforms, developed and maintained by Matteo Frigo

and Steven Johnson of the MIT Laboratory for Computer Science. [9] It claims speed

on many platforms by self-optimising – different methods for performing the same op-

eration ("codelets") can be speed-tested at run-time and optimal ones used. The current

stable version is 3.1.2, although as the API was changed between version 2 and 3, many

software packages have not been updated and still use the most recent update of version

7

Chapter 1. Introduction 1.3. FFT Libraries

2, 2.1.5. Version 3 does not yet have fully tested MPI transforms, as these have been

introduced only in the 3.2 alpha versions (currently 3.2 alpha 3 is the most recent). Both

the MPI transforms in versions 2 and 3 only allow for a slab decomposition of the data

operated on.

FFTW claims to perform best on arrays with extents that are multiples of 2, 3, 5 and

7, but also uses O(n log n) algorithms for all extents, even large primes [9]. It also has

"codelet generation" facilities with which code to support a particular factor optimally

can be generated, for advanced use.

1.3.2 ESSL

IBM’s Engineering and Scientific Subroutine Library [10] is provided on – and typi-

cally specifically optimised for – IBM-supplied Linux and AIX systems, and provides

functionality used in many scientific applications – not only Fourier transforms, but

also linear algebra, matrix operations, and random number generators. Also included

is the parallel extension, PESSL, which uses an interface to an implementation of the

BLACS library (Basic Linear Algebra Communication Subprograms [11]) and an ex-

tension of the ESSL API to perform many of these routines in parallel, however, for the

FFT parallel routines, it only provides a slab decompositional method.

ESSL will only perform FFTs with length:

n = (2h)(3i)(5j)(7k)(11m) ≤ 37748736

where h ∈ {1, 2, ..., 25}, i ∈ {0, 1, 2}, j, k, m ∈ {0, 1}

Attempting to perform an FFT that does not conform to this specification will produce

an error.

8

Chapter 1. Introduction 1.4. HPC Systems

1.3.3 ACML

AMD produces the AMD Core Math Library [12], providing much the same linear

algebra, matrix, random number and transform functionality as ESSL, optimised for

AMD processors. It does not include any parallel routines.

ACML claims best efficiency on array extents that are powers of two, but also good

efficiency on extents that have only small prime factors up to 13. It will perform FFTs

on arrays of any length, however.

1.3.4 MKL

Intel’s Math Kernel Library [13] is an optimised mathematical library for Intel’s pro-

cessors, providing similar functionality again to ACML and ESSL. It provides some

parallel routines, based on a BLACS library implementation, but unlike PESSL keeps

this within the library itself, not requiring explicit BLACS usage by the programmer.

Like PESSL, it only allows for slab decomposition of the data.

MKL claims optimal performance on array extents that are powers of 2, 3, 4, and 5 for

most architectures, and 7 and 11 for Intel’s IA-64 architecture, but supports any extent.

1.4 HPC Systems

We wanted to cover a broad range of types of high-performance computing (HPC) plat-

form for our tests, and so enlisted seven different systems, each with a different combi-

nation of technologies. These are described below.

9


1.4.1 HECToR

HECToR [14] is a a Cray XT4 system, and serves as the current primary national general

purpose capability computing service for UK universities. It was installed into the

University of Edinburgh’s Advanced Computer Facility in 2007 [15].

The HECToR system is comprised of 1,416 compute blades, each with 4 dual-core

AMD 2.8 GHz Opteron processors for a total of 11,328 cores, and 24 services blades,

each with 2 similar processors, which manage user access, network services, and I/O

from the compute blades.

Each dual-core processor is connected to 6 GB of RAM, and a Cray SeaStar2 communi-

cation processor with an embedded PowerPC440 processor. Each SeaStar2 is connected

in a 3D toroidal network topology to six others, giving a theoretical point-to-point band-

width of 2.17 GB/s and minimum bisection bandwidth of 4.1 TB/s. [14]

HECToR compute nodes use Cray’s own operating system, UNICOS/lc, based on a cus-

tomised Linux kernel called Compute Node Linux (CNL). [16] Two high performance

compilers are available – the Portland Group compilers [17] and the PathScale compiler

[18]. The Portland Group compiler is much more commonly used, and it is the one we

have used.

1.4.2 HPCx

HPCx [19] was the national capability computing service for UK universities before

HECToR was installed, and is now the complementary service. HPCx is located at

STFC’s Daresbury Laboratory near Cheshire where it was installed in 2004 [20].

HPCx contains 168 IBM eServer 575 nodes, with 160 being allocated for computation,

and 8 for user access and I/O. Each node contains 8 dual-core 1.5 GHz POWER5 pro-

10


cessors and 32 GB of shared RAM, giving a total of 2560 compute cores, and nodes are

linked by IBM’s clos-topology High Performance Switch [21].

HPCx uses IBM’s own AIX 5 operating system on both user access and compute nodes.

1.4.3 Ness

Ness [22] is the EPCC’s most recent training and testing server, consisting of two Sun-

fire X4600 shared memory compute units with 8 dual-core 2.6 GHz AMD Opteron

processors and 30 GB of shared RAM each, with a Sunfire X2100 serving as login

node.

The software environment is largely designed to be very similar to HECToR to allow

development and testing comparisons. The operating system is based on Scientific

Linux, with Portland Group compilers as on HECToR.

1.4.4 BlueSky

BlueSky [23] is a single IBM Blue Gene/L cabinet maintained by the EPCC for the

University of Edinburgh’s School of Physics. It has 1024 dual core 700 MHz PowerPC

440 processors, with 512 MB per processor. The processors are connected by five

separate networks, but the principle computational network is the three-dimensional

toroidal network, in which each processing unit is connected to all six of its neighbours

by 154 MB/s channels[24].

Blue Gene/L systems use a specialised, cut down OS on the compute CPUs, designed

to be as lightweight as possible – even some very common features, such as threading,

are not implemented.

11


1.4.5 Eddie

Eddie [25], operated by the ECDF ("Edinburgh Compute and Data Facility"), is the

University of Edinburgh’s own cluster-computing service, consisting of 128 dual-core

and 118 quad-core nodes, using 3 GHz Intel Xeon processors and 2 GB of RAM per

core, connected by a gigabit Ethernet network, however, a small partition of 60 dual-

core nodes are connected by a faster Infiniband network, and this is the region of the

machine employed.

Eddie uses Scientific Linux for both user access and compute nodes, with the Intel

compiler for high-performance applications.

1.4.6 MareNostrum

MareNostrum [26] is the Barcelona Supercomputing Centre’s computing cluster, formed

of 2282 IBM eServer BladeCenter JS20 servers, each with two dual-core 2.3 GHz Pow-

erPC 970MP processors and 8GB of RAM. The servers are connected to a Myrinet op-

tical fibre clos-switched network for performance as well as a Gigabit ethernet network

for administration.

The software environment is based on a Linux 2.6 kernel and the SUSE distribution,

with IBM’s own compiler and library distribution.

1.4.7 HLRB II

HLRB II [27] is the largest system operated by the Leibniz Supercomputing Centre, an

SGI Altix 4500 platform with 4864 dual-core Intel Itanium2 processors. Each processor

is linked to 8GB per core of RAM, except the first processor on each partition, which

is linked to 16GB. The nodes are connected with SGI’s NUMAlink 4 hardware in a

12


’fat-tree’ fashion, allowing processors to access any memory installed in the system

directly.

Platform Compiler Message-Passing FFT LibrariesHECToR pgcc 7.1-4 XT-MPT 3.0.0 FFTW 3.1.1

FFTW 2.1.5ACML 4.0.1a

HPCx xlc 08.00.0000.0013 POE 4.2 FFTW 3.0.1LAPI 2.4.4.4 FFTW 2.1.5

ESSL 4.2.0.0PESSL 3.2

HLRB II icc 9.1 SGI MPI/Altix 1.16 FFTW 3.0.1FFTW 2.1.5MKL 9.1

MareNostrum xlc 08 MPICHGM 1.2.5.2 FFTW 3.1.1FFTW 2.1.5ESSL 4.1

Eddie icc 10.1 Infinipath 2.1 FFTW 3.1.2FFTW 2.1MKL 10.0.1.014

Ness pgcc 7.0-7 MPICH2 - 1.0.5p4 FFTW 3.1.2FFTW 2.1.5ACML 3.6.0MKL 10.0.1.014

BlueSky xlc 08.00.0000.0001 MPICH2 - 1.0.3 FFTW 3.2a3FFTW 2.1.5ESSL 4.2.2

Table 1.1: Software versions used on each platform

13

Chapter 2

Benchmarking

High performance computing platforms are designed to offer the best performance pos-

sible from their available resources. As such, components, hardware or software, are

often upgraded or altered at short notice.

To allow easy comparison of platforms and libraries in a way that would allow rapid

repetition in case of, e.g. changes to the computational environment, it was decided

to create a single portable benchmarking system that could be rapidly deployed on all

the platforms targeted within a narrow window of time. This would have to allow

both slab and rod decompositions, and the use of each of the available libraries, as

well as allowing comparison between using a library’s 1D and 2D FFT calls for the

planes of data in the slab decomposition. It would also have to employ a library’s

own parallel routines, if present, as well as ours. Despite adding these in, to maintain

fair comparison, we compare our own slab and rod decomposition routines, and then

compare the automatic routines with our own.

14

Chapter 2. Benchmarking 2.1. 2D FFTs for the Slab Decomposition

2.1 2D FFTs for the Slab Decomposition

As previously stated, for a three dimensional data set, a three-dimensional Fourier

transform is equivalent to a one-dimensional transform along each dimension. Given

that a two-dimensional transform for a two-dimensional data set is likewise equivalent

to a one-dimensional transform along each dimension, it can be shown that a three-

dimensional transform can also be performed by performing a two-dimensional trans-

form on each layer followed by a one-dimensional transform in the remaining dimen-

sion (Figure 2.1). We therefore added to our tests the option of using a two-dimensional

transform library routine for each plane of data in the slab decomposition, as compared

with using two one-dimensional transforms.

While performing a two-dimensional FFT, it is usual to transpose the data in memory

to allow contiguous access to the second dimension. Most of the libraries we used –

FFTW2, FFTW3, ACML, MKL offer the ability to return it transposed, to save having

to re-transpose it back into its original alignment. We have used this option where

available, as typically, a programmer can account for this and operate on the data after

the transpose in the way they originally intended without having to re-transpose it.

1D FFT - X

2D FFT - XY

1D FFT - Y

1D FFT - Z

1D FFT - Z

3D FFT - XYZ

Figure 2.1: Graphical representation of the ability to break down a three-dimensionalFFT into different dimensionality function calls. Each method is equiv-alent, and the operations are commutative. Axes may be arbitrarily re-ordered.

15

Chapter 2. Benchmarking 2.2. Benchmarking on an HPC Platform

2.2 Benchmarking on an HPC Platform

One of the main issues with benchmarking on an HPC platform is that unless whole

partitions of the machine are entirely reserved for jobs, as on the Blue Gene system,

communications may suffer from congestion with other jobs. The obvious way around

this is to reserve the whole machine before running each job, but this is very expen-

sive in terms of processor time – for example, reserving thousands of processors to run

a 32 processor job is not economical. A possible solution to this could be to reserve

the whole machine and pack as many jobs into all reserved space as possible – how-

ever, this essentially ensures that other jobs will be running simultaneously with a very

similar communications pattern, meaning that if congestion could occur, it is almost

guaranteed.

We instead chose to intentionally ignore this problem – by running jobs in the normal

way, we get typical timing figures rather than optimal measures, but we also guarantee a

pseudo-random distribution of congestion on the system, if any, as other users run jobs.

There is also an issue, especially with the more layered hardware architectures, that

using a small number of processors can lead to obtained results that do not scale up

beyond a layer. This is especially true of HPCx, with its shared memory nodes of 16

processors, where scaling benefits have been observed by performing all-to-all commu-

nication across two boxes, linked by a network connect, rather than within the solely

shared memory environment within one box [8]. We therefore avoided such results,

instead setting our minima for the larger systems at a level where we would include at

least two units of the secondary communication layer – e.g. 32 processors on HPCx –

two shared memory nodes; or 4 processors on HECToR – two dual-core processors . In

any case, the rod decomposition can only be constructed for four processors or more,

as three and two are prime and thus could only produce slab decompositions. For the

sake of simplicity, we only used processor counts that were integer powers of two.

16

Chapter 2. Benchmarking 2.3. Implementation Details

2.3 Implementation Details

In this section, we discuss some of the particular choices we made in implementing the

benchmark software.

2.3.1 Language and Communication

Due to the mutually exclusive compile-time choices that would need to be made in our

code, and familiarity with the language, we elected to use ISO C 99, despite the complex

number issues that would arise (see Section 2.4). The powerful standard preprocessor

allows the use of logic and macro definition to create the code that the compiler will

then operate on.

To allow communication, we used MPI routines, using strict MPI-1, as Eddie’s Infini-

path implementation of the MPI did not support MPI-2 routines.

2.3.2 Test Data and Verification

To allow verification of a correct transform, test data was created in the form of a trivari-

ate sine function over the three Cartesian co-ordinates of the distributed cubic array:

f(x, y, z) = sin

(2πx

X+

2πy

X+

2πz

X

)(2.1)

17


where X is the length of one edge of the cube. This function can be analytically shown

[1] to produce transformed data thus:

F (x, y, z) =

−1

2.i.X3 if x = y = z = 1

12.i.X3 if x = y = z = X − 1

0 else

(2.2)

This output signal is conveniently symmetric about the three diagonal axes of the data

cube, which allows us to ignore whether a given library’s 2D FFT call returns data

in a transposed form or not, in the comparison between 1D and 2D FFT calls in the

slab decomposition. One approach to writing a 2D FFT function involves scrambling

or transposing data during the call to improve performance overall, and this can often

result in the output data being in a transposed form from the original – if a library’s

function does this, it will usually offer an option to transpose the data back into the

original form, but we may ignore this to enhance performance.

We may use a simple algorithm to determine where the peaks are located within the

decomposed data, and produce, in an existing buffer, the exact output data for each

processor. Then we may calculate the absolute difference between the transformed data

and this exact comparison data, summing over processors, and divide by the number of

elements in the array.

residue =1

X3

X∑x=0

X∑y=0

X∑z=0

|Foutput(x, y, z)− Fideal(x, y, z)|

This value is then checked against a set tolerance, ≤ 10−10, to determine accuracy.

18


2.3.3 Multidimensional FFT Data Rearrangement

One of the most complex issues in implementation is the method used to rearrange the

data for the all-to-all calls required between steps. MPI offers a number of automatic

data packing routine specifications commonly called ’derived datatypes’ to aid in sim-

ilar operations, however, they are less than ideal for this usage, due to the necessary

combination of at least three datatypes and type-resizing, which is particularly complex

with MPI-1. It was found that rearranging the data into contiguous blocks for each

processor using modulo arithmetic was the simplest to implement. Routines to perform

this operation are reproduced in Appendix A, but essentially this operation involves

performing many small transposes of subarrays within the larger array, with variable,

overlapping strides. This is also expected to perform relatively well, as the limited im-

plementation may allow more optimisation by the compiler than the generic datatype

implementations of an MPI library.

The MPI_Alltoall call was used to perform the actual communication between FFT

steps; this may not, in fact, be the fastest method for performing this operation on all

systems, however, it is the call designed for this type of use, and is thus most likely

to be optimised for such. Studies performed in the past have examined mechanisms

for performing this operation, and it has been suggested that this method is optimal for

many types of system in the past[28], though obviously it is entirely problem dependant.

2.3.4 Rod Decomposition Dimensions

In a two-dimensional decomposition, the obvious choice of dimensions for the proces-

sor grid is what gives the most square form – this minimises the number of processors

involved each communication in both steps. We have assumed that this is the optimal

case, particularly as we are comparing against the slab decomposition, as this gives us

19


for(i = (int) sqrt( (double) processors);i>0;i--)

{if ( (0 == i%2) && ( 0 == processors%i ) ){

dimensions[0] = i;dimensions[1] = processors/i;break;

}}

Figure 2.2: Code used to obtain the dimensions of the two-dimensional processor ar-ray.

the most different processor arrangement to the slab form. In the case that the number

of processors is not a square number, we have chosen to make one of the dimensions

a factor of two of the other, since we are always using powers of two numbers of pro-

cessors. The code for obtaining the dimensions of the array of processors is shown in

Figure 2.2. We assume that a future investigation of the properties of altered processor

array shapes could be useful.

2.3.5 Memory Limitations

The data cube uses a significant quantity of memory in every case – a 1024 x 1024 x

1024 data cube is 16 GB of data, and there has to be another buffer of identical size,

as MPI_Alltoall cannot be performed in-place. Adding working space for common

variables, FFT calls and additional buffers allocated by MPI, and the data required by

the benchmark can easily overflow the memory available on a processor. Attempts

were made to mitigate the effects of this by predicting the data usage and comparing

with the system’s available RAM, but if not all the processors failed gracefully (for

example, if all but one processors allocate the necessary arrays on a shared-memory

array), then evidence suggests the large core files produced automatically can cripple

the I/O handlers of an HPC system.

20

Chapter 2. Benchmarking 2.4. FFT Library Porting

2.4 FFT Library Porting

Because the standard specifications of ISO C lacked a native complex type until 1999

[29] (ISO C having first been publically internationally specified as a copy of ANSI C

in 1990 [30]), many libraries that can interface with C code and use complex numbers

construct their own complex number data structures, as shown in Table 2.1. This can

make interchangeability difficult to achieve. We solved this problem by using prepro-

cessor directives to allow determination of a library being used at compile-time, and

creating one executable per available FFT library. In retrospect, this was a sub-optimal

solution – despite the different definitions, the different types used can be demonstrated

to be bit-compatible, and this fact could have been used to operate on them in a library-

independent manner for arithmetic. The existing implementation uses this fact to send

the complex data through MPI by treating every type as merely two contiguous double-

precision floating-point numbers in memory.

All the libraries used have a preparation step that must be performed to pre-calculate

invariant data used to calculate the FFT, and a performance step which uses that data

to perform the actual calculation. Because we were primarily interested in only the

performance step, we omitted the preparatory steps from the resultant timing data.

We now discuss our experience of including each library into our benchmark code.

2.4.1 FFTW3

The benchmark was originally written using FFTW3, and it was found to be easy to use,

well-documented save for the experimental MPI routines, and trivial in compilation,

save for the adjustment between versions with and without parallel routines with MPI

support, as with FFTW2.

Both FFTW libraries have multiple interface functions, according to the level of neces-

21


Library Complex Number Format(Native C) _Complex double;

FFTW3 _Complex double;

FFTW3 (alt) double [2];

FFTW2 struct { double re, im; }

MKL _Complex double;

ACML struct{ double real, imag; };

ESSL union{ struct{ double _re, _im; };

double _align };

Table 2.1: FFT libraries and their complex double-precision floating-point numberspecification for the C language interface. The FFTW3 has two entriesbecause it will only use the native complex number format if the filecomplex.h has already been included.

sary complexity in the particular user’s need – for example, for a single FFT calculation

on a single array, uses a different, simpler preparatory function to performing multiple

FFTs on a strided array that is contained within a larger array.

Compiling the FFTW3 libraries on the BlueSky system proved to be somewhat chal-

lenging – a patch had to be produced and applied to modify the library’s cycle counters

to the PowerPC 440 processor. This patch is reproduced in Appendix B.

2.4.2 FFTW2

FFTW2 was generally well-documented, C-oriented though with additional documen-

tation for the Fortran interface, and follows the same structure as FFTW3. It was,

therefore, fairly easy to port from FFTW3 to FFTW2, especially since the main shift in

interface, FFTW3’s introduction of pre-definition of the operating array, did not signif-

icantly affect our use of the library.

The main complication came in compiling with FFTW2 on different platforms, where

the FFTW2-MPI library was not available, such as on Eddie, or the library files were

named differently to specify whether the library used double or single precision, such

as on HPCx and HECToR. These were minor complications once made apparent, but

22


their presence was not indicated until attempts to link with FFTW2 failed.

2.4.3 MKL

Generally, MKL was quite easy to write with, though when it came to actually com-

piling the application, MKL is divided up into approximately 35 different linkable sec-

tions, some apparently mutually exclusive, and the documentation does not explicitly

state which sections any given function call will need. To eventually obtain a success-

fully linked executable, a trial and error approach was required. In addition, the sections

available changed over the two different major versions we used – MKL 9 on HLRB II,

and MKL 10 on Ness and Eddie.

2.4.4 ESSL

ESSL, while quite explicitly documented, appears to be written with a highly Fortran-

biased view, and has some approaches to data handling that can be quite unfamiliar

to a C programmer. For example, while FFTW and MKL libraries have a mechanism

for preparation, and a mechanism for performing the actual FFT, in ESSL one merely

calls the same library routine with a preparation flag and then without, using a large,

pre-allocated array for storing pertinent information. In addition, it was occasionally

found to be advantageous to search through the ESSL header files to find the actual C

prototype for functions.

Using the parallel extensions to ESSL (PESSL) in particular proved to be a challenge, as

these use the BLACS – Basic Linear Algebra Communication Subprograms – interface

for parallelism. BLACS is another portable specification which can be used for per-

forming linear algebra operations in parallel, and can usually be layered on top of MPI.

Indeed, the BLACS tools are often written using MPI, however using both in one code

23

Chapter 2. Benchmarking 2.5. System Porting

adds an unnecessary layer of complexity to what is already a fairly complex operation,

especially as BLACS documentation is also often sparse and, again, Fortran-oriented.

2.4.5 ACML

ACML is a library written in Fortran, and seems to be documented largely for Fortran

users. When using the C interface, this can mean lead to having to fill in missing details

as necessary in a few cases, and hunt through header files in others. During compilation,

it is often necessary to link against your compiler’s Fortran mathematical libraries. It is

quite similar in use to ESSL, having similar "call twice" FFT routines, similarly styled

large pre-allocated arrays, and the same long, complex function prototypes.

2.5 System Porting

Being unfamiliar with the common automatic configuration tools such as autoconf [31]

and pkg-config [32], we decided instead to simply make one Makefile that contained

the necessary flags and commands for every system, and a Bash script to use this make

file for each library executable the system supported. Occasionally we found that con-

figurational tools to aid use of software packages were out of date or had not been

maintained, or found that despite these, a search for the library files we needed was

required. Module files for FFTW3 on Eddie had to be fixed to bring the version number

up to date with the actual installed library before they would function, and the library

paths for MKL on HLRB II had to be located manually.

Allowing the benchmark to compile and run on all the libraries on each platform was

the longest single stage in this study – documentation was often found to be sparse,

especially for Eddie, and as we could only obtain access to MareNostrum via David

Viscente, it was necessary to ensure that everything could work with as little work on

24

Chapter 2. Benchmarking 2.6. Scripting

his part as possible. A "readme" file was provided with a compressed copy of the code

and the surrounding scripts – this is reproduced in Appendix C. Nevertheless, this

package was not designed for any sort of public release – it still requires some manual

configuration.

2.6 Scripting

To allow complete data sets to be obtained easily, a script to create sets of job files was

created, using the Bash shell scripting language and a number of common tools con-

tained in every UNIX operating system [33], including bc, sed, and cat, for porta-

bility. These tools allowed us to create a quick and easy to use terminal interface that

greatly sped up the testing and benchmarking process. Initially we had thought it would

be useful to create an additional script that would insert output result sets into an SQL

database, but this was later found to be much less efficient for processing data than

initially thought, and results were processed as comma-separated value files.

25

Chapter 3

Results

The benchmark executable was run on each system from a small number of processors

corresponding to the smallest number of processors that still made use of the commu-

nication facilities of the system as described in Section 2.2, up to the highest number

of processors available or 1024, whichever was smaller. Given that our FFT planning

preparation was not parallelisable, every doubling of processors also doubled the com-

pute time used in this step, and to go higher would use an even more significant portion

of budget with each step.

All executions were repeated at least three times, four where budgeting allowed, and

the minimum taken.

3.1 HECToR

3.1.1 Libraries

Figure 3.1 shows the total time spent in FFT calls in one run of the executable for HEC-

ToR using 64 processors. In this case, the 3 libraries are very similar in performance,

26

Chapter 3. Results 3.1. HECToR

0.0001

0.001

0.01

0.1

1

10

32 64 128 256 512 1024 2048

Tim

e sp

ent i

n F

FT

cal

ls (

s)

Extent

ACMLFFTW3FFTW2

Figure 3.1: A comparison of the time spent in FFT calls for a rod decomposition 3DFFT, for different libraries, using 64 processors, on HECToR.

though ACML consistently performs slightly better for higher extents, and worse for

small extents. The performance increase could be due to AMD’s optimised library hav-

ing better indications of the cache sizes of the AMD processors, and thus being able to

make more use of them – insufficient time was available to test, but the Portland Group

compiler can be informed of the cache sizes of the processors used, and these used to

optimise further during compilation. If this information was provided to the FFTW3

compilation, it could result in performance increases over ACML, but obviously this is

purely speculative.

FFTW3 also performs better for every extent than FFTW2, but due to the multitudinous

algorithms used by both FFTW versions, this could be due to a differing choice or a

new algorithm being used in 3 that provides slightly better performance, e.g. improved

SSE engine support.

27


0.001

0.01

0.1

1

10

64 128 256 512 1024

Tot

al T

ime

(s)

Extent

ACML, 1D FFTACML, 2D FFT

FFTW3, 1D FFTFFTW3, 2D FFTFFTW2, 1D FFTFFTW2, 2D FFT

Figure 3.2: A comparison of the total time taken for the purely 1D and 2D & 1DFFT-using slab decomposition 3D FFT, for different libraries, using 16processors, on HECToR.

3.1.2 1D vs 2D FFT

Figure 3.2 shows the timings for the slab-decomposed 3D FFT using the three 1D FFT

calls, against the 2D followed by 1D FFT call. Aside from a slight difference when

using FFTW2 at the largest extent, no significant difference is exhibited between the

two methods.

We found this was true across all the platforms we tested, and as such have omitted

further comparison of the two methods.

3.1.3 Slabs vs Rods

Figure 3.3 shows the total time taken for the two different decompositions of the 3DFFT

on HECToR for the ACML library. As might be expected, the slab decomposition is

faster than the rod decomposition for every extent. It should be noted, however, that

28


0.0001

0.001

0.01

0.1

1

10

100

16 32 64 128 256 512 1024 2048

Tot

al T

ime

(s)

Extent

Slab, p=2Rod, p=4Slab, p=4Rod, p=8Slab, p=8

Rod, p=16Slab, p=16Rod, p=32Slab, p=32Rod, p=64Slab, p=64


Rod, p=1024Slab, p=1024

Figure 3.3: A comparison of the total time taken for the two different decompositionsof the 3DFFT on HECToR, using the ACML library.

there is no overlap between numbers of processors – according to these performance

figures, using the slower decomposition of the two will never be slower than using half

the number of processors with the faster decomposition.

3.1.4 Scaling

Figure 3.4 shows the effects on the time taken to perform the 3D FFT of increasing the

number of processors, for given extents. As is usual for parallel programs involving

significant communication during calculation, we can see that as the data per processor

increases in quantity, better scaling (i.e. closer to ideal) can be demonstrated on larger

numbers of tasks. In general, this application seems to demonstrate consistent scaling

on HECToR up to large numbers of processors, meaning that it is neither saturating the

insertion nor bisectional bandwidth of the network. Since this is the condition under

which the rod decomposition would benefit, it is unsurprising that the rod decomposi-

29


0.0001

0.001

0.01

0.1

1

10

100

2 4 8 16 32 64 128 256 512 1024

Tim

e (s

)

Tasks

Rod, x=24Slab, x=32Rod, x=32Rod, x=48Slab, x=64Rod, x=64Rod, x=96

Slab, x=128Rod, x=128Rod, x=192Slab, x=256Rod, x=256Rod, x=384Slab, x=512Rod, x=512Rod, x=768

Slab, x=1024Rod, x=1024Rod, x=1536

Ideal Gradient

Figure 3.4: A comparison of the total time taken for the 3DFFT on a data cube of thegiven extents (x) on varying numbers of MPI tasks on HECToR, using theACML library. The ideal gradient is perfect scaling – in which doublingthe number of tasks halves the time taken.

tion does not perform better on this platform.

We can see from these figures, however, that the greater number of messages sent in the

slab decomposition’s method causes it to lose scaling consistency for smaller numbers

of tasks than the rod decomposition as latency starts to dominate the communication

time, e.g. for x = 128, we can see that the rod decomposition shows significant de-

viation from ideal scaling at 128 tasks, while the rod decomposition is still equally

efficient. This effect decreases as the quantity of data in each message and thus work

involved in the FFT steps increases, however, and at x = 512 it is unnoticeable at this

resolution.

30

Chapter 3. Results 3.2. Ness

1e-05

0.0001

0.001

0.01

0.1

1

10

16 32 64 128 256 512 1024

Tim

e sp

ent i

n F

FT

cal

ls (

s)

Extent

MKLACML

FFTW3FFTW2

Figure 3.5: A comparison of the time spent in FFT calls for a rod decomposition 3DFFT, for different libraries, using 8 processors, on Ness.

3.2 Ness

3.2.1 Libraries

As could be expected, Figure 3.5 shows that the MKL library for Intel processors per-

forms less well, and seemingly more noisily, than AMD’s library on an AMD processor

platform, though there is a crossover between the extents of 512 and 1024 that might

bear investigating to determine whether it is a statistical anomaly. Similarly to HEC-

ToR, and unsurprisingly given that Ness has similar processors, the other FFT libraries

perform quite similarly, though ACML performs better with larger extents, and the

FFTW libraries perform better, by a small margin, with smaller extents.

When using MKL, we were surprised by the large overhead we found to be dominating

the timing, and subsequent testing revealed that there was quite a large speed difference

for the first execution of the FFT using a given ’descriptor’ and the second, as shown in

31


0.0001

0.001

0.01

0.1

1

10

16 32 64 128 256 512 1024

Tim

e sp

ent i

n F

FT

cal

ls (

s)

Extent

MKL, 2nd FFT callMKL, 1st FFT call

Figure 3.6: A comparison of the time spent in FFT calls for a rod decomposition 3DFFT, for MKL on the first call of the FFT routine and the second, using 8processors, on Ness.

Figure 3.6. We expect that this is due to the various data-independent ’twiddle’ values

being calculated on the first execution of the FFT, rather than when the descriptor is

prepared, as expected. To compensate for this, we were forced to alter the code to rerun

the benchmark for this library twice within the code after we had collected the majority

of the results. For this reason, the figures for Eddie and HLRB II reflect the unaltered

code, and thus show the values with the overhead.

3.2.2 Slabs vs Rods

Given that Ness is a shared-memory system, we might expect that these results would

be much more similar than their equivalents on HECToR, but we can still see a marked

difference between performance at high extents for the different decompositions. Given

the erratic results at low extents, it is possible that the uniformity at high extents is

caused by higher overhead and buffering used to maintain greater efficiency in memory

32


0.0001

0.001

0.01

0.1

1

10

100

16 32 64 128 256 512 1024

Tot

al T

ime

(s)

Extent

Slab, p=1Slab, p=2Rod, p=4Slab, p=4Rod, p=8Slab, p=8

Rod, p=16Slab, p=16

Figure 3.7: A comparison of the total time taken for the two different decompositionsof the 3DFFT on Ness, using the FFTW3 library.

usage for the transfer of large quantities of data. The difference between conformance

for high and low extents could bear further investigation.

3.2.3 Scaling

The two principle features to extract from Figure 3.8 are the uniform sub-ideal scal-

ing for high extents, and the very poor scaling of small extents using the slab decom-

position, compared with the slightly lesser effect on the rod decomposition method –

the slab decomposition becomes a lot less efficient (and predictable) at lower extents,

whereas the rod decomposition retains a great degree of uniformity and maintains time

improvements, however slight, up to 16 tasks in all cases save x = 16.

33

Chapter 3. Results 3.3. HPCx

0.0001

0.001

0.01

0.1

1

10

100

1 2 4 8 16

Tim

e (s

)

Tasks


Slab, x=128Rod, x=128Rod, x=192Slab, x=256Rod, x=256Rod, x=384Slab, x=512Rod, x=512

Ideal Gradient

Figure 3.8: A comparison of the total time taken for the 3DFFT on a data cube ofthe given extents (x) on varying numbers of MPI tasks on Ness, using theFFTW3 library. The ideal gradient is perfect scaling – in which doublingthe number of tasks halves the time taken.

3.3 HPCx

3.3.1 Libraries

Figure 3.9 shows the library comparison for HPCx – as might be expected, IBM’s own

library optimised for IBM’s processors and other hardware performs better than the

more portable libraries – however, it is notable just how much more clear-cut the differ-

ence between ESSL and FFTW libraries is here, compared to ACML and MKL on the

non-IBM systems.

3.3.2 Slabs vs Rods

The decomposition comparison in this case seems much less clear-cut in this case than

for HECToR – Figure 3.3 – and in many cases it is suggested that using more processors

34


1e-05

0.0001

0.001

0.01

0.1

1

10

32 64 128 256 512 1024

Tim

e sp

ent i

n F

FT

cal

ls (

s)

Extent

FFTW3FFTW2

ESSL

Figure 3.9: A comparison of the time spent in FFT calls for a rod decomposition 3DFFT, for different libraries, using 32 processors, on HPCx.

0.001

0.01

0.1

1

10

100

32 64 128 256 512 1024 2048 4096

Tot

al T

ime

(s)

Extent

Rod, p=32Slab, p=32Rod, p=64Slab, p=64


Rod, p=1024Slab, p=1024

Figure 3.10: A comparison of the total time taken for the two different decompositionsof the 3DFFT on HPCx, using the ESSL library.

35


0.001

0.01

0.1

1

10

100

32 64 128 256 512 1024

Tim

e (s

)

Tasks

Rod, x=32Rod, x=48Slab, x=64Rod, x=64Rod, x=96


Slab, x=1024Rod, x=1024Rod, x=1536

Ideal Gradient

Figure 3.11: A comparison of the total time taken for the 3DFFT on a data cube of thegiven extents (x) on varying numbers of MPI tasks on HPCx, using theESSL library. The ideal gradient is perfect scaling – in which doublingthe number of tasks halves the time taken.

results in slower overall timings for either decomposition. To a certain degree, this could

be blamed to combinations of noise in the machine, however, the digressions at high

numbers of processors are fairly extreme, and could indicate that the synchronisations

for such numbers in HPCx are less than optimal.

3.3.3 Scaling

As with Figure 3.4 for HECToR, Figure 3.11 shows that the slab decomposition scales

more badly at lower numbers of processors than the rod decomposition. HPCx appears

to exhibit worse overall scaling than HECToR for this measure – efficiency is generally

lower, as indicated by the mean gradient of each line and its deviation towards the

horizontal from ideal.

36

Chapter 3. Results 3.4. BlueSky

1e-05

0.0001

0.001

0.01

0.1

1

10

32 64 128 256 512 1024

Tim

e sp

ent i

n F

FT

cal

ls (

s)

Extent

FFTW3FFTW2

ESSL

Figure 3.12: A comparison of the time spent in FFT calls for a rod decomposition 3DFFT, for different libraries, using 128 processors, on BlueSky.

3.4 BlueSky

3.4.1 Libraries

As we expected, the ESSL library performs significantly better on the Blue Gene system

than either of the FFTW libraries – the Blue Gene processor is a moderately esoteric

type, and the ESSL library distributed with the Blue Gene library packages will have

been specifically optimised by IBM to take advantage of the special features in the

PowerPC 440 processor, e.g. the double floating point unit, which has special double-

instructions for complex number-type operations. The FFTW packages compiled up

simply may not have been able to take advantage of these features, and so will not be

using the processor most efficiently.

37


0.0001

0.001

0.01

0.1

1

10

32 64 128 256 512 1024

Tot

al T

ime

(s)

Extent

Slab, p=2Rod, p=4Slab, p=4Rod, p=8Slab, p=8


Rod, p=128Slab, p=128

Figure 3.13: A comparison of the total time taken for the two different decompositionsof the 3DFFT on BlueSky, using the ESSL library.

0.0001

0.001

0.01

0.1

1

10

32 64 128 256 512 1024

Tot

al T

ime

(s)

Extent



Rod, p=1024Slab, p=1024

Figure 3.14: A comparison of the total time taken for the two different decompositionsof the 3DFFT on BlueSky, using the FFTW3 library.

38


3.4.2 Slabs vs Rods

Unfortunately in this case, our processing budget was depleted before we could capture

results for the higher numbers of processors with ESSL, but the timings we have ob-

tained for this library – shown in Figure 3.13 – demonstrate a similar relationship to the

others shown so far – that using the slab decomposition is faster than using a rod decom-

position. In this case, there is even overlap between the two between 32 and 64 proces-

sors – using 64 processors with the rod decomposition proves to be slower than using

32 with the slab decomposition. In the case of Blue Gene this may be solvable with

controlled process positioning within the node, however, this is considering the naive

case. With FFTW3 we have obtained results for higher numbers of processors, Figure

3.14, which demonstrate even more overlap between slab and rod decomposition tim-

ings, indicating a communications library that performs well even for high-congestion

communications.

3.4.3 Scaling

In the graphs demonstrating scaling on BlueSky, we see the effect of not filling par-

titions of the machine combined with non-optimal default process placing within the

partition – we see poorer scaling in processor counts that do not use all the processors

in the partition, as the mean link count between active processors is longer, leading to

higher latency. (Partition sizes are 32, 128, and 512 processors.) These leads to the

’stepped’ effect evident in 3.16. This can be eliminated by careful placing of the active

processors, as demonstrated by Heike Jagode [1], but usually there is no reason not to

fill the partition.

If we discount the unfilled partitions, we see excellent scaling for all extents upwards

of 32 - for this size the communication overheads seem to dominate past p ≥ 512.

39


0.0001

0.001

0.01

0.1

1

10

2 4 8 16 32 64 128

Tim

e (s

)

Tasks

Slab, x=32Rod, x=32Rod, x=48Slab, x=64Rod, x=64Rod, x=96

Slab, x=128Rod, x=128Rod, x=192Slab, x=256Rod, x=256Rod, x=384Slab, x=512Rod, x=512

Ideal Gradient

Figure 3.15: A comparison of the total time taken for the 3DFFT on a data cube of thegiven extents (x) on varying numbers of MPI tasks on BlueSky, using theESSL library. The ideal gradient is perfect scaling – in which doublingthe number of tasks halves the time taken.

0.0001

0.001

0.01

0.1

1

10

32 64 128 256 512 1024

Tim

e (s

)

Tasks



Slab, x=1024Rod, x=1024

Ideal Gradient

Figure 3.16: A comparison of the total time taken for the 3DFFT on a data cube of thegiven extents (x) on varying numbers of MPI tasks on BlueSky, using theFFTW3 library. The ideal gradient is perfect scaling – in which doublingthe number of tasks halves the time taken.

40

Chapter 3. Results 3.5. Eddie

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

10

16 32 64 128 256 512 1024

Tim

e sp

ent i

n F

FT

cal

ls (

s)

Extent

MKLFFTW3FFTW2

Figure 3.17: A comparison of the time spent in FFT calls for a rod decomposition 3DFFT, for different libraries, using 16 processors, on Eddie.

3.5 Eddie

3.5.1 Libraries

Library performance figures in this case, as with the figures for Ness, show that MKL

has a uniquely high amount of overhead attached to its FFT call. Otherwise, the per-

formance for FFTW2 and FFTW3 is very similar, with FFTW3 appearing to perform

slightly better for higher extents, but not to a significant degree.

3.5.2 Slabs vs Rods

Similarly to the other performance figures for this comparison, Eddie demonstrates bet-

ter performance generally for the slab decomposition. In this case, however, there is

much less of a difference between the two, and there is even one case where the rod

decomposition has outperformed the slab decomposition in a stable series for 64 pro-

41


0.0001

0.001

0.01

0.1

1

10

100

1000

10000

16 32 64 128 256 512 1024

Tot

al T

ime

(s)

Extent

Rod, p=8Slab, p=8


Figure 3.18: A comparison of the total time taken for the two different decompositionsof the 3DFFT on Eddie, using the FFTW3 library.

cessors with an extent of 256. There is some overlap between processor counts, but only

at very low extent values and high processor counts, where it could be expected that the

synchronisation inherent in the communication could overwhelm the computation time.

The lack of difference between the two decompositions would seem to indicate that ei-

ther the interconnect is not being used to its maximum efficacy, that the communication

performance is comparable to the local memory performance, or that the communica-

tion steps are insignificant compared to the FFT calls. Comparison with Figure 3.17,

and the technology on the Eddie platform would suggest that the former is more likely,

and if this is true, it is possible that the MPI implementation suffers significantly with

higher complexity all-to-all communication. Investigation of these issues on this plat-

form could bear investigation.

42


0.0001

0.001

0.01

0.1

1

10

100

8 16 32 64

Tim

e (s

)

Tasks



Ideal Gradient

Figure 3.19: A comparison of the total time taken for the 3DFFT on a data cube of thegiven extents (x) on varying numbers of MPI tasks on Eddie, using theFFTW3 library. The ideal gradient is perfect scaling – in which doublingthe number of tasks halves the time taken.

3.5.3 Scaling

Analysing the timing with respect to extent on Eddie would seem to indicate that the

network latency may suffer somewhat from random placing as in the case of BlueSky,

or else from network congestion adding to latency – there are slight incongruities in the

graph that seem to resemble the steps on BlueSky, but are more random. In general,

however, Eddie appears to exhibit good scaling for x ≥ 48 – at this count we get

good performance at 64 processors, while below this we fail to gain any time benefit

from adding more processors past 16, and at x = 16 adding more processors becomes

detrimental not only to efficiency but also to total time taken.

43

Chapter 3. Results 3.6. MareNostrum

1e-05

0.0001

0.001

0.01

32 64 128 256

Tim

e sp

ent i

n F

FT

cal

ls (

s)

Extent

FFTW3FFTW2

ESSL

Figure 3.20: A comparison of the time spent in FFT calls for a rod decomposition 3DFFT, for different libraries, using 32 processors, on MareNostrum.

3.6 MareNostrum

3.6.1 Libraries

Save for one extent value, 64, IBM’s ESSL library outperforms the two FFTW libraries

slightly for each extent. Again, this could be due to IBM having the opportunity to

carefully optimise the library for the platform – IBM supplied MareNostrum in its en-

tirety and could have tailored certain routines to take most advantage of the hardware,

or optimally adjusted parameters in compilation.

3.6.2 Slabs vs Rods

Performance figures in this category once again demonstrate that the slab decomposition

is significantly faster than the rod decomposition, with a few cases where the lower

number of processors proves to be faster for the slab decomposition than the higher for

44

Chapter 3. Results 3.6. MareNostrum

the rod decomposition.

0.0001

0.001

0.01

0.1

1

32 64 128 256 512

Tot

al T

ime

(s)

Extent


Rod, p=128Slab, p=128Rod, p=256Slab, p=256Rod, p=512

Figure 3.21: A comparison of the total time taken for the two different decompositionsof the 3DFFT on MareNostrum, using the ESSL library.

3.6.3 Scaling

Timings on MareNostrum seem to not scale particularly well (Figure 3.22), with per-

formance increases with increasing numbers of processors tailing off slightly sooner

than we would expect. It is possible that this is due to a sub-optimal process placing as

suggested for Eddie and BlueSky, as a high message latency, for whatever reason, could

cause figures of this type.

45

Chapter 3. Results 3.7. HLRB II

0.0001

0.001

0.01

0.1

1

32 64 128 256 512

Tim

e (s

)

Tasks


Slab, x=128Rod, x=128Rod, x=192Slab, x=256Rod, x=256

Ideal Gradient

Figure 3.22: A comparison of the total time taken for the 3DFFT on a data cube ofthe given extents (x) on varying numbers of MPI tasks on MareNostrum,using the ESSL library. The ideal gradient is perfect scaling – in whichdoubling the number of tasks halves the time taken.

3.7 HLRB II

3.7.1 Libraries

As with other platforms, HLRB II shows no significant performance difference between

FFTW3 and FFTW2, and shows the significant performance overhead for MKL, though

in this case it seems to be mitigated. This could be due to MKL in this case running

on fast, high-performance Intel hardware, rather than an AMD processor or a slightly

older Intel chip.

3.7.2 Slabs vs Rods

Results for this platform are somewhat chaotic, with no clear trend emerging other than

that for most processor counts at high extents, using more processors caused the cal-

46

Chapter 3. Results 3.7. HLRB II

1e-05

0.0001

0.001

0.01

0.1

1

10

100

32 64 128 256 512 1024 2048

Tim

e sp

ent i

n F

FT

cal

ls (

s)

Extent

MKLFFTW3FFTW2

Figure 3.23: A comparison of the time spent in FFT calls for a rod decomposition 3DFFT, for different libraries, using 64 processors, on HLRB II.

0.001

0.01

0.1

1

10

100

1000

32 64 128 256 512 1024 2048

Tot

al T

ime

(s)

Extent



Rod, p=1024Slab, p=1024Rod, p=2048

Figure 3.24: A comparison of the total time taken for the two different decompositionsof the 3DFFT on HLRB II, using the FFTW3 library.

47

Chapter 3. Results 3.8. Automatic Parallel Routines

0.0001

0.001

0.01

0.1

1

10

100

1000

32 64 128 256 512 1024 2048

Tim

e (s

)

Tasks



Slab, x=1024Rod, x=1024Rod, x=1536

Ideal Gradient

Figure 3.25: A comparison of the total time taken for the 3DFFT on a data cube of thegiven extents (x) on varying numbers of MPI tasks on HLRB II, using theFFTW3 library. The ideal gradient is perfect scaling – in which doublingthe number of tasks halves the time taken.

culation to take less time, but even this is not dependable. It has been suggested that

these noise is due to the different regions of the large quantity of memory the program

requires not all being allocated locally initially, which would cause unexpected per-

formance when transferring them between tasks. The performance of the interconnect

under large all-to-all data transfers could bear further investigation.

3.8 Automatic Parallel Routines

For all our comparison timings, we used our own coded routines. However, we also

tested the parallel calls for FFTW2 and ESSL on platforms where available. Figure

3.26 shows a comparison of the time taken by the FFTW2 automatic MPI routines to

the time taken by our routines by percentage, on HECToR.

48


20

40

60

80

100

120

140

32 64 128 256 512 1024

Per

cent

age

of M

anua

l Tim

e

Extent

p=2p=4p=8

p=16p=32p=64

p=128p=256p=512

p=1024

Figure 3.26: A comparison of the time taken for a slab decomposition 3D FFT, forthe automatic parallel routines, as a percentage of the time taken for ourwritten routines, using FFTW2, on HECToR.

The FFTW2 parallel routines demonstrated an extreme performance increase over our

routines for smaller extents, but were slower for larger extents. We assume that FFTW2

is performing not only optimised FFT routines, but optimised MPI calls. Instrumenting

this using the MPI profiling tool, Vampir [34] on Ness, however, revealed that, for 4

processors at least, internally FFTW2 is using the same MPI_Alltoall as our routines,

as shown in Figure 3.27. The FFTW2 MPI code appears to have two different methods,

one using all-to-all calls explicitly which it uses to perform out-of-place operations, and

one using non-blocking point-to-point communications, to perform in-place. We only

tested the out-of-place operations, however.

We found the results found on HECToR to be generally true across platforms – the

automatic routines were up to 2.5 times faster, especially for low extents, but could be

slightly slower for extents x ≥ 512.

Using PESSL on HPCx, however, we found the opposite to be true – smaller extents

49


were slightly slower than our routines, and it was only at x ≥ 512 that the PESSL

implementation was faster (Figure 3.28). Admittedly this may be able to be improved

upon using a pure BLACS implementation rather than our BLACS on MPI within MPI

approach.

Process 0 63 128 122 MPI_Alltoall

Process 1 63 128 122 MPI_Alltoall

Process 2 MPI_Barrier 128 122 MPI_Alltoall

Process 3 MPI_Barrier main fftwnd_mpi 122 MPI_Alltoall

MPIApplication

3.409 s3.408 s

a.otf (3.408 s - 3.409 s = 1.334 ms) Printed by Vampir

Figure 3.27: A Vampir trace timeline taken on Ness, showing the barrier proceeding,and the communication performed within, the FFTW2 parallel routine.

For the sake of comparison, we compiled the alpha release of FFTW3, FFTW3.2a3, on

Ness, and compared the routines in a similar fashion (Figure 3.29). In a similar fashion

50

100

150

200

250

300

64 128 256 512 1024

Per

cent

age

of M

anua

l Tim

e

Extent

p=64p=128p=256p=512

p=1024

Figure 3.28: A comparison of the time taken for a slab decomposition 3D FFT, for theautomatic parallel routines using PESSL, on HPCx.

50


0

10

20

30

40

50

60

70

80

90

100

110

8 16 32 64 128 256 512

Per

cent

age

of M

anua

l Tim

e

Extent

p=1p=2p=4p=8

p=16

Figure 3.29: A comparison of the time taken for a slab decomposition 3D FFT, forthe automatic parallel routines, as a percentage of the time taken for ourwritten routines, using our compiled version of FFTW3.2a3, on Ness.

to FFTW2, we found that the FFTW3 parallel routines could be much more efficient for

small extents, however, the improvement, while diminished, was maintained for larger

extents. We did not, unfortunately, have time to test this on a larger system, however,

the results are promising for the parallel routines in FFTW3.

51

Chapter 4

Discussion and Conclusions

We have made a number of specific observations, however, we may also bring these

together to discuss, in more overarching terms, the results of this investigation, and how

it can be taken further.

4.1 Rods & Slabs

The results obtained strongly suggest that for our cubic data objects, on a system with a

high-performance interconnect, a slab decomposition should outperform a rod decom-

position in almost every case, with the slab decomposition tending to lose scaling at

approximately p = x2. If flexibility is needed, however, code to generate the rod decom-

position can easily be modified to be able to generate both types – in fact, this is the

approach we used – and the slab decomposition used where ever possible.

52

Chapter 4. Discussion and Conclusions 4.2. Libraries

4.2 Libraries

Our main intention in testing speeds of different libraries was to see how the vendor-

supplied libraries compared to the oft-used and cross-platform FFTW libraries. In these

cases, it seems that although the FFTW libraries do not get the best performance on ev-

ery platform, it is only on BlueSky, an unusual architecture, that they are outperformed

by a uniform wide margin, otherwise getting very similar and often slightly better per-

formance to vendor-libraries. We can attribute this to the optimisation of ESSL for the

Blue Gene/L platform – optimising for this means designing your routines to take best

advantage of the PowerPC440’s unusual double floating-point unit, a process akin to,

but different from, optimising for the Streaming SIMD Extensions feature available in

recent x86 processors to speed floating-point calculations. ESSL seems to generally

perform well for every platform upon which it is available.

We would therefore make the suggestion that any software that makes use of FFT rou-

tines and that favours portability over strictly higher performance should probably use

the FFTW libraries, unless the developer can be absolutely certain that every platform

the software will run on will use the same vendor library.

4.3 Automatic Routines

Performance and ease of use suggests that if only a slab decomposition is needed,

FFTW2’s library routines are a particularly good mechanism. It is hoped that when

the FFTW3 parallel routines enter the stable version, they will perform as well. PESSL,

on the other hand, may be easier to use if BLACS is already being employed – to avoid

switching to MPI, but evidence suggests it should not be used for pure performance

reasons.

53

Chapter 4. Discussion and Conclusions 4.4. Improving the Benchmark

4.4 Improving the Benchmark

Having reviewed the benchmark code, we may ascertain a number of ways in which it

may be improved. The complex number support is currently somewhat fragmented –

as previously stated, with the bitwise compatibility of all the complex types, it is pos-

sible to operate on them without knowing which library is being used. The data arrays

would then only need to be cast when being passed into FFT function calls. Rewrit-

ing the complex number handling functions in this way would allow for much greater

optimisation potential, as there would be only one method to attempt to accelerate.

As is, the benchmark executable performs only one full 3D FFT before exiting. In

retrospect, this was exceptionally wasteful, as the data-independent data and planning

data must be recalculated each time. Making the executable perform multiple runs

would be a fairly quick fix, and in fact, we quickly implemented a temporary form of it

for the MKL test, but did not implement the feature fully.

Much of the timing data is amalgamated within the program, being output as total time,

and two different numbers which represent communications time and FFT time in the

rod decomposition only. A more detailed timing readout could provide much more

meaningful data with the same amount of time spent in computation.

4.5 Future Work

This study has explored many parameters, but none in extreme detail; therefore, there

is much related work that we could perform.

We have seen that the slab decomposition generally outperforms the rod decomposition,

but these were both load-balanced in every case. The rod decomposition offers more

flexibility given this constraint, but a comparison between the rod decomposition, and

54

Chapter 4. Discussion and Conclusions 4.5. Future Work

an unbalanced slab decomposition, in which the working arrays have been padded up

to a size suitable for a slab decomposition or the more flexible MPI_Alltoallv function

used, could yield interesting results. Similarly, we have only used a rod decomposition

with a processor grid that is square, or as square as possible. It is possible that this

is sub-optimal, especially if network topology is arranged such that a less balanced

decomposition could be arranged to have lower-latency communications for both all-

to-all steps.

The MPI_Alltoallv function could in fact be used to construct the whole rod decom-

position, but this is a complex operation to perform. If MPI_Alltoall over subsets of

processors performs more badly than MPI_Alltoallv on the global set, this could give

performance gains over the current rod decompositional method. A general benchmark

comparing all the different parameters of the different possible techniques for perform-

ing the communications techniques used in the parallel 3D FFT would be of general

interest. This could consist of latency, insertion bandwidth, bisection bandwidth and

derived datatype packing speed tests, as well as a comparison of the speed of all-to-all

methods using the normal all-to-all calls as compared to non-blocking point-to-point

calls, and tests of memory latency and bandwidth. It should be possible to use all these

factors to build an image of which will limit parallel FFT speed in an application, and

whether it can be improved.

One of the parameter restrictions which we have given less attention to is the restriction

to processor counts p = 2n where n is a positive integer. Typical collective algorithms

can perform best with this type of restriction in place, and an investigation of the effects

of relaxing it may yield interesting results.

Additional possible work could be used, after improving the benchmark application,

to make it suitable for general distribution and use – currently it can require much

knowledgeable intervention to compile using the different libraries it supports, and this

55

Chapter 4. Discussion and Conclusions 4.5. Future Work

could be automated given enough time. The kernel of the benchmark could even be

made into a generally usable, highly configurable library performing parallel 3D FFTs

in other applications. This would require significant work, however, including perfor-

mance tuning and optimisation research, and given the promise shown by the FFTW3

parallel routines, it is possibly not a useful avenue of research.

With the growing tendency towards the utilisation of greater and greater numbers of

processors, and multicore processing, making efficient use of these processing units be-

comes more and more challenging. The slab decomposition may be easier to implement

and suitable for many applications, but it may be inadequate for systems like BlueSky,

with its large numbers of relatively low power processors. At time of writing, IBM have

in development, an experimental library for the Blue Gene platform, designed to make

optimal use of its toroidal network [35]. It could be interesting to see whether a li-

brary making use of FFTW3 could perform comparably – a mature, high-performance,

portable, and open-source solution could greatly benefit the myriad fields in which this

technique is employed.

56

Bibliography

[1] H. Jagode, “Fourier Transforms for the BlueGene/L Communication Network,”

Master’s thesis, EPCC, 2006.

[2] U. Sigrist, “Optimizing Parallel 3D Fast Fourier Transformations for a Cluster of

IBM POWER5 SMP Nodes,” Master’s thesis, EPCC, 2007.

[3] C. F. Gauss, “Nachlass: Theoria interpolationis methodo nova tractata,” Werke,

vol. 3, pp. 265–327, 1866.

[4] J. W. Cooley and J. W. Tukey, “An algorithm for the machine calculation of com-

plex Fourier series,” Math. Comput., vol. 19, pp. 297–301, 1965.

[5] M. T. Heideman, D. H. Johnson, and C. S. Burrus, “Gauss and the history of the

fast Fourier transform,” IEEE ASSP Magazine, vol. 1, no. 4, pp. 14–21, 1984.

[6] C. M. Rader, “Discrete Fourier transforms when the number of data samples is

prime,” Proceedings of the IEEE, vol. 56, no. 6, pp. 1107–1108, 1968.

[7] L. I. Bluestein, “A linear filtering approach to the computation of the discrete

Fourier transform,” Northeast Electronics Research and Engineering Meeting

Record, vol. 10, pp. 218–219, 1968.

[8] J. Hein, A. Simpson, A. Trew, H. Jagode, and U. Sigrist, “Parallel 3D-FFTs for

Mult-processing Core nodes on a Meshed Communication Network,” in Proceed-

ings of the CUG 2008, 2008.

57

Bibliography Bibliography

[9] M. Frigo and S. G. Johnson, “The Design and Implementation of FFTW3,” Pro-

ceedings of the IEEE, vol. 93, no. 2, pp. 216–231, 2005, special issue on "Program

Generation, Optimization, and Platform Adaptation".

[10] (2008, August) Engineering Scientific Subroutine Library (ESSL) and Parallel

ESSL. [Online].

Available: http://www-03.ibm.com/systems/p/software/essl/index.html

[11] (2008, August) The Basic Linear Algebra Communication Subprograms Project.

[Online].

Available: http://www.netlib.org/blacs/

[12] (2008, August) AMD Core Math Library 4.1.0 User Guide. [Online].

Available: http://developer.amd.com/assets/acml_userguide.pdf

[13] (2008, August) Intel Math Kernel Library Reference Manual version 024.

[Online].

Available: http://softwarecommunity.intel.com/isn/downloads/

softwareproducts/pdfs/347468.pdf

[14] (2008, August) HECToR - UK National Supercomputing Service. [Online].

Available: http://www.hector.ac.uk/

[15] (2008, August) HECToR and the University of Edinburgh. [Online].

Available: http://www.hector.ac.uk/about-us/partners/uoe/

[16] (2008, August) Cray XT4 and XT3 Supercomputers. [Online].

Available: http://www.cray.com/products/xt4/

[17] (2008, August) The Portland Group. [Online].

Available: http://www.pgroup.com/

[18] (2008, August) Pathscale 64-bit Compilers. [Online].

Available: http://www.pathscale.com/

58


[19] (2008, August) HPCx. [Online].

Available: http://www.hpcx.ac.uk/

[20] (2008, August) Daresbury SIC: HPCx. [Online].

Available: http://www.daresburysic.co.uk/facilities/expertise/hpcx

[21] O. Lascu, Z. Borgosz, P. Pereira, J.-D. S. Davis, and A. Socoliuc, An Introduction

to the New IBM eserver pSeries High Performance Switch. IBM, 2003.

[22] (2008, August) EPCC - Ness. [Online].

Available: http://www2.epcc.ed.ac.uk/∼ness/documentation/ness/index.html

[23] (2008, August) EPCC - Blue Gene. [Online].

Available: http://www2.epcc.ed.ac.uk/∼bgapps/user_info.html

[24] G. Almási, C. Archer, J. G. C. nos, J. A. Gunnels, C. C. Erway, P. Heidelberger,

X. Martorell, J. E. Moreira, K. Pinnow, J. Ratterman, B. D. Steinmacher-Burow,

W. Gropp, and B. Toonen, “Design and implementation of message-passing ser-

vices for the Blue Gene/L supercomputer,” IBM Journal of Research and Devel-

opment, vol. 49, no. 2/3, 2005.

[25] (2008, August) ECDF. [Online].

Available: http://www.ecdf.ed.ac.uk/

[26] (2008, August) Barcelona Supercomputing Centre. [Online].

Available: http://www.bsc.es/

[27] (2008, August) LRZ: Höchstleistungsrechner in Bayern (HLRB II). [Online].

Available: http://www.lrz-muenchen.de/services/compute/hlrb/

[28] A. Dubey and D. Tessera, “Redistribution strategies for portable parallel FFT: a

case study,” Concurrency and Computation: Practice and Experience, vol. 13,

no. 3, pp. 209–220, 2001.

59


[29] The Current ISO C 99 Standard (with technical corrigenda TC1, TC2 and TC3).

[Online].

Available: http://www.open-std.org/JTC1/SC22/WG14/www/docs/n1256.pdf

[30] ANSI C — ANS X3.159-1989, Programming Language C, 1989.

[31] (2008, August) Autoconf - a tool for generating configure scripts. [Online].

Available: http://www.gnu.org/software/autoconf/

[32] (2008, August) pkg-config. [Online].

Available: http://pkg-config.freedesktop.org/wiki/

[33] X/Open CAE Specification, System Interface Definitions, Issue 4 Version 2,

September 1994.

[34] (2008, August) Vampir - MPI Instrumentation. [Online].

Available: http://www.vampir.eu/

[35] (2008, August) 3D Fast Fourier Transform Library for Blue Gene/L. [Online].

Available: http://www.alphaworks.ibm.com/tech/bgl3dfft

60

Appendix A

All-to-all Data Rearrangement

The actual code used to rearrange the data prior and post all-to-all calls is somewhat

obfuscated, largely due to the complex operation being performed on the current index.

It loads input data contiguously, to aid cached data reuse, and uses integer division

where appropriate to discard remainders.

It essentially performs transposes across subarrays in whichever dimension an all-to-

all is about to take place in, both organising the data into the order it should be in its

new state, and making it contiguous to allow use of MPI_Alltoall without the use of a

non-primitive datatype specification.

The unpack routine then takes the correctly ordered blocks after the all-to-all and spaces

them correctly within the array.

61

Appendix A. All-to-all Data Rearrangement

Figure A.1: The code used to rearrange data prior to the all-to-all.

/* domainSize[2] -> an array containing how many rods each processor has, ** in each dimension of the 2D decomposition */

/* extent -> the extent across one edge of the data cube *//* *dataIn -> a pointer to the input data *//* *dataOut -> a pointer to the buffer to be used for the all-to-all */

void ataRowRearrange(complexType *dataIn, complexType *dataOut,int domainSize[2], int extent)

{ /* Rearranges the data in a domain such that all the data ** that needs to be sent to one processor is contiguous and ** in the right order, for an all-to-all across rows of a ** 2D decomposition of a 3D array. */

int i;

/* Loop over every element this processor holds */for(i=0;i<domainSize[0]*domainSize[1]*extent;i++){

/* Assign complex number from pointer to pointer */complexAssign(&dataOut[

( ( i % domainSize[0] ) * domainSize[0] ) +( ( ( i % extent ) / domainSize[0] ) * domainSize[0] * domainSize[0] * domainSize[1] ) +( ( i / extent ) % domainSize[0] ) +( ( i / ( domainSize[0] * extent ) ) * domainSize[0] * domainSize[0] )

] , dataIn[i]);

}}

void ataColRearrange(complexType *dataIn, complexType *dataOut,int domainSize[2], int extent)

{ /* Rearranges the data in a domain such that all the data ** that needs to be sent to one processor is contiguous and ** in the right order, for an all-to-all across cols of a ** 2D decomposition of a 3D array. */

int i;

for(i=0;i<domainSize[0]*domainSize[1]*extent;i++){

complexAssign(&dataOut[( i % extent ) * domainSize[0] * domainSize[1] +( ( i / extent ) % domainSize[0] ) * domainSize[1] +( i / ( domainSize[0] * extent ) )

] , dataIn[i]);

}}

62

Appendix A. All-to-all Data Rearrangement

Figure A.2: The code used to unpack data after the all-to-all.

void ataRowUnpack(complexType *dataIn, complexType *dataOut,int domainSize[2], int extent)

{ /* Unpacks data after all-to-all across rows. */int i;for(i=0;i<domainSize[0]*domainSize[1]*extent;i++){

complexAssign(&dataOut[(i%domainSize[0]) +( ((i/domainSize[0]) % (domainSize[0] * domainSize[1])) * extent ) +( ( i / (domainSize[0] * domainSize[0] * domainSize[1] )) * domainSize[0] )

],dataIn[i]);

}}

void ataColUnpack(complexType *dataIn, complexType *dataOut,int domainSize[2], int extent)

{ /* Unpacks data after all-to-all across cols. */int i;for(i=0;i<domainSize[0]*domainSize[1]*extent;i++){

complexAssign(&dataOut[(i%domainSize[1]) +( ((i/domainSize[1]) % (domainSize[0] * domainSize[1])) * extent ) +( ( i / (domainSize[0] * domainSize[1] * domainSize[1] )) * domainSize[1] )

],dataIn[i]);

}}

63

Appendix B

Patching FFTW3 to Blue Gene/L

FFTW3.2a3 failed to compile ’out of the box’ on the BlueSky system, and the fol-

lowing patch was devised with the assistance of Matteo Frigo of MIT, and applied to

kernel/cycle.h. The problem is believed to stem from an incompatibility be-

tween the Blue Gene/L and other, more common PowerPC chips – the patch adds a

Blue Gene/L-specific section which uses an IBM-compiler-specific call to the chip’s

native timing routines.

64

Appendix B. Patching FFTW3 to Blue Gene/L

131a132,174>>>> /*----------------------------------------------------------------*/> /*> * Blue Gene/L version of ‘‘cycle’’ counter using the time> * base register.> */> /* 64 bit */> #if defined(__blrts__) && (__64BIT__) && !defined(HAVE_TICK_COUNTER)> typedef unsigned long long ticks;>> static __inline__ ticks getticks(void)> {> return __mftb();> }>> INLINE_ELAPSED(__inline__)>> #define HAVE_TICK_COUNTER> #endif>> /* 32 bit */> #if defined(__blrts__) && !defined(HAVE_TICK_COUNTER)> typedef unsigned long long ticks;>> static __inline__ ticks getticks(void)> {> unsigned int tbl, tbu0, tbu1;>> do {> tbu0 =__mftbu();> tbl =__mftb();> tbu1 =__mftbu();> } while (tbu0 != tbu1);> return (((unsigned long long)tbu0) << 32) | tbl;> }>> INLINE_ELAPSED(__inline__)>> #define HAVE_TICK_COUNTER> #endif>

Figure B.1: The patch, applied to cycle.h.

65

Appendix C

Readme for Software Package

We could not obtain direct access to the MareNostrum platform, and instead we pro-

vided David Vicente of the Barcelona Supercomputing Centre with our software and a

’readme’ file explaining how to use it. This file is reproduced below.

=== 3D FFT Benchmark ===

Unfortunately, I didn’t have time to learn how to

use autoconf for this, so some manual editing of the

Makefile may be required.

The list of steps required to run this consist of:

1) Compile all versions.

2) Make a template file for the job scripts.

3) Run batchmaker.

4) Move jobs and executables to a staging directory if necessary.

5) Submit jobs.

== 1 - Compile all versions ==

In an ideal environment, you can just:

make LIB=fftw3

make sweep

make LIB=fftw2

make sweep

make LIB=essl

66

Appendix C. Readme for Software Package

make sweep

make LIB=mkl

make sweep

make LIB=acml

... whichever apply on the system.

Other settings are:

CC=[gcc|pgcc|xlc|icc|xlc-bg]

Sets the compiler type underlying the usual MPI compiler wrapper,

for purposes of compiler flags, defaults to gcc.

If you have none of these, you can set the flags used for compilation

separately using:

CFLAGS=

By default, contains optimisation flags appropriate for the above.

MPICC=

Contains the name of the MPI compiler wrapper. Defaults to ’mpicc’.

EXTRAFLAGS=

Empty by default, added to every compilation line. Use for flags you

need to include to specify extra libraries needed to link against on

your system, or -L and -I flags to specify library locations.

The makefile assumes maximum capabilities for each library by default

(for SYSTEM=generic, which means that FFTW2 is assumed to be compiled

with MPI support, without type-prefixes (use LIB=dfftw2 otherwise), that

MKL includes parallel support, and that ESSL includes PESSL.

== 2 - Make a template file ==

There are a number of templates in the templates directory which may be able

to be re-configurable to your batch system - you may be able to take the PBS

or SGE one directly and merely alter the account code. template.template

contains a list of all the keytags that batchmaker replaces, as well as a

non-specific template form.

== 3 - Run batchmaker ==

./batchmaker.sh fft-*

will usually do the job, assuming you’re in the directory where you

compiled the fft executables. Batchmaker is pretty self-explanatory

67

Appendix C. Readme for Software Package

to use, and generates a pile of job-version-cpucount.nys files, which

are job files to be submitted.

== 4 - Move files to staging directory ==

If you need to, move all the *.nys files and the fft-* executables to a

staging directory at this point...

== 5 - Submit ==

And submit them, however you do that on your system.

68

Appendix D

Work Plan and Organisation

Our workplan changed quite significantly during the project, due to having underesti-

mated both the complexity of the 3D FFT operation, and the difficulties of porting the

benchmarking application to all the libraries and platforms. Additionally, a number of

steps that we had marked as discrete blended together where practical.

Our original workplan follows:

WP 1 — Write Generic FFT Code – 2.5 weeks

WP 2 — Write Execution and Result Collation Scripts – 2 weeks

WP 3 — Port and Add Library Support – 3 weeks

WP 4 — Perform Experimental Runs and Analyse Results – 2 weeks

WP 5 — Make any Necessary Adjustments to Software – 1 week

WP 6 — Perform Complete Runs – 1 week

WP 7 — Complete Write-up – 4 weeks

In reality, modifications were being made to the code almost continuously throughout,

though the initial working version of the code took longer than expected to develop, due

69

Appendix D. Work Plan and Organisation

to the complexity of the data rearrangement algorithm as given in Appendix A. When all

the code but this section was complete, the execution and collation scripts were written

at the same time as working on this algorithm. We considered portability throughout

the code implementation, despite not adding the actual extra library calls until we began

the porting phase, and the execution scripts were designed to be modular and portable

from their creation. The porting phase took longer than expected, as we encountered

unforeseen issues on each platform; and experimental runs and adjustment, rather than

being performed as discrete, whole tasks, were performed for each platform in turn.

In retrospect, our various testing procedures should have been planned more carefully,

to save time and resources. We wasted an unfortunate amount of time in performing

non-specific tests, when it would probably have been much more economical to perform

some sort of step-by-step analysis of each facet of the process for obtaining our results.

Planning such projects in this level of detail is hardly an exact and rigorous procedure,

however, and while we may have not adhered to our plan, designing it at such an early

stage did provide us with a good indication of the complexity and the steps and depen-

dencies required in the project.

70

Demanding Parallel FFTs: Slabs & Rods

Documents

Transcript of Demanding Parallel FFTs: Slabs & Rods