Demanding Parallel FFTs: Slabs & Rods
Transcript of Demanding Parallel FFTs: Slabs & Rods
Demanding Parallel FFTs: Slabs & Rods
Ian Kirker
August 22, 2008
MSc in High Performance ComputingThe University of EdinburghYear of Presentation: 2008
Abstract
Fourier transforms of multidimensional data are an important component of many sci-
entific codes, and thus the efficient parallelisation of these transforms is key in obtaining
high performance on large numbers of processors. For the three-dimensional case, two
common decompositions of input data can be employed to perform this parallelisation
– one-dimensional ("slab"), and two-dimensional ("rod") processor grid divisions.
In this report, we demonstrate implementation of the three-dimensional FFT routine in
parallel, using component routines from a library capable of performing one-dimensional
FFTs, and briefly examine considerations of creating a portable application which can
use multiple libraries in C.
We then examine the performance of the two decompositions on seven different high-
performance platforms – HECToR, HPCx, Ness, BlueSky, MareNostrum, HLRB II, and
Eddie, using each of the FFT libraries available on each of these platforms. Efficient
scaling is demonstrated up to more than a thousand processors for sufficiently large data
cube sizes, and results obtained indicate that the slab decomposition generally obtains
greater performance.
Additionally, our results indicate that FFT libraries installed on platforms we tested are
largely comparable in performance, whether vendor-provided or open-source, except
for the esoteric Blue Gene/L architecture, upon which ESSL proved to be superior.
Contents
1 Introduction 11.1 Fourier Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 All-to-All Communication . . . . . . . . . . . . . . . . . . . . . . . . 51.3 FFT Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.1 FFTW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3.2 ESSL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.3.3 ACML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.3.4 MKL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 HPC Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.4.1 HECToR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.4.2 HPCx . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.4.3 Ness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.4.4 BlueSky . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.4.5 Eddie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121.4.6 MareNostrum . . . . . . . . . . . . . . . . . . . . . . . . . . . 121.4.7 HLRB II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2 Benchmarking 142.1 2D FFTs for the Slab Decomposition . . . . . . . . . . . . . . . . . . . 152.2 Benchmarking on an HPC Platform . . . . . . . . . . . . . . . . . . . 162.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.1 Language and Communication . . . . . . . . . . . . . . . . . . 172.3.2 Test Data and Verification . . . . . . . . . . . . . . . . . . . . 172.3.3 Multidimensional FFT Data Rearrangement . . . . . . . . . . . 192.3.4 Rod Decomposition Dimensions . . . . . . . . . . . . . . . . . 192.3.5 Memory Limitations . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 FFT Library Porting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.4.1 FFTW3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.4.2 FFTW2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.4.3 MKL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.4.4 ESSL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.4.5 ACML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.5 System Porting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.6 Scripting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
ii
Contents Contents
3 Results 263.1 HECToR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.1.1 Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.1.2 1D vs 2D FFT . . . . . . . . . . . . . . . . . . . . . . . . . . 283.1.3 Slabs vs Rods . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.1.4 Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Ness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.2.1 Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.2.2 Slabs vs Rods . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.2.3 Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3 HPCx . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.3.1 Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.3.2 Slabs vs Rods . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.3.3 Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4 BlueSky . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.4.1 Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.4.2 Slabs vs Rods . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.4.3 Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.5 Eddie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.5.1 Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.5.2 Slabs vs Rods . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.5.3 Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.6 MareNostrum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.6.1 Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.6.2 Slabs vs Rods . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.6.3 Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.7 HLRB II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463.7.1 Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463.7.2 Slabs vs Rods . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.8 Automatic Parallel Routines . . . . . . . . . . . . . . . . . . . . . . . 48
4 Discussion and Conclusions 524.1 Rods & Slabs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524.2 Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.3 Automatic Routines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.4 Improving the Benchmark . . . . . . . . . . . . . . . . . . . . . . . . 544.5 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
A All-to-all Data Rearrangement 61
B Patching FFTW3 to Blue Gene/L 64
C Readme for Software Package 66
D Work Plan and Organisation 69
iii
List of Tables
1.1 Software versions used on each platform . . . . . . . . . . . . . . . . . 13
2.1 FFT libraries and their complex double-precision floating-point numberspecification for the C language interface. . . . . . . . . . . . . . . . . 22
iv
List of Figures
1.1 A directed acyclic graph showing the data dependencies for each stepof an FFT operation on an 8 element array. . . . . . . . . . . . . . . . . 3
1.2 Slab and rod decompositions as applied to a 4 x 4 x 4 data cube decom-posed over 4 processors. . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Steps and data rearrangement in the slab decomposition of the 3D FFT.Reproduced from [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Steps and data rearrangement in the rod decomposition of the 3D FFT.Reproduced from [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 All-to-all communication between processors, as applied to a commonmatrix transposition. . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1 Graphical representation of the ability to break down a three-dimensionalFFT into different dimensionality function calls. Each method is equiv-alent, and the operations are commutative. Axes may be arbitrarily re-ordered. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 2D Decomposition Code . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1 A comparison of the time spent in FFT calls for a rod decomposition3D FFT, for different libraries, using 64 processors, on HECToR. . . . . 27
3.2 A comparison of the total time taken for the purely 1D and 2D & 1DFFT-using slab decomposition 3D FFT, for different libraries, using 16processors, on HECToR. . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 A comparison of the total time taken for the two different decomposi-tions of the 3DFFT on HECToR, using the ACML library. . . . . . . . 29
3.4 A comparison of the total time taken for the 3DFFT on a data cube ofthe given extents (x) on varying numbers of MPI tasks on HECToR. . . 30
3.5 A comparison of the time spent in FFT calls for a rod decomposition3D FFT, for different libraries, using 8 processors, on Ness. . . . . . . . 31
3.6 A comparison of the time spent in FFT calls for a rod decomposition3D FFT, for MKL on the first call of the FFT routine and the second,using 8 processors, on Ness. . . . . . . . . . . . . . . . . . . . . . . . 32
3.7 A comparison of the total time taken for the two different decomposi-tions of the 3DFFT on Ness, using the FFTW3 library. . . . . . . . . . 33
3.8 A comparison of the total time taken for the 3DFFT on a data cube ofthe given extents (x) on varying numbers of MPI tasks on Ness. . . . . 34
v
List of Figures List of Figures
3.9 A comparison of the time spent in FFT calls for a rod decomposition3D FFT, for different libraries, using 32 processors, on HPCx. . . . . . 35
3.10 A comparison of the total time taken for the two different decomposi-tions of the 3DFFT on HPCx, using the ESSL library. . . . . . . . . . . 35
3.11 A comparison of the total time taken for the 3DFFT on a data cube ofthe given extents (x) on varying numbers of MPI tasks on HPCx. . . . . 36
3.12 A comparison of the time spent in FFT calls for a rod decomposition3D FFT, for different libraries, using 128 processors, on BlueSky. . . . 37
3.13 A comparison of the total time taken for the two different decomposi-tions of the 3DFFT on BlueSky, using the ESSL library. . . . . . . . . . 38
3.14 A comparison of the total time taken for the two different decomposi-tions of the 3DFFT on BlueSky, using the FFTW3 library. . . . . . . . 38
3.15 A comparison of the total time taken for the 3DFFT on a data cubeof the given extents (x) on varying numbers of MPI tasks on BlueSkyusing ESSL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.16 A comparison of the total time taken for the 3DFFT on a data cubeof the given extents (x) on varying numbers of MPI tasks on BlueSkyusing FFTW3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.17 A comparison of the time spent in FFT calls for a rod decomposition3D FFT, for different libraries, using 16 processors, on Eddie. . . . . . 41
3.18 A comparison of the total time taken for the two different decomposi-tions of the 3DFFT on Eddie, using the FFTW3 library. . . . . . . . . . 42
3.19 A comparison of the total time taken for the 3DFFT on a data cube ofthe given extents (x) on varying numbers of MPI tasks on Eddie. . . . . 43
3.20 A comparison of the time spent in FFT calls for a rod decomposition3D FFT, for different libraries, using 32 processors, on MareNostrum. . 44
3.21 A comparison of the total time taken for the two different decomposi-tions of the 3DFFT on MareNostrum, using the ESSL library. . . . . . . 45
3.22 A comparison of the total time taken for the 3DFFT on a data cube ofthe given extents (x) on varying numbers of MPI tasks on MareNostrum. 46
3.23 A comparison of the time spent in FFT calls for a rod decomposition3D FFT, for different libraries, using 64 processors, on HLRB II. . . . . 47
3.24 A comparison of the total time taken for the two different decomposi-tions of the 3DFFT on HLRB II, using the FFTW3 library. . . . . . . . 47
3.25 A comparison of the total time taken for the 3DFFT on a data cube ofthe given extents (x) on varying numbers of MPI tasks on HLRB II. . . 48
3.26 A comparison of the time taken for a slab decomposition 3D FFT, forthe automatic parallel routines, as a percentage of the time taken for ourwritten routines, using FFTW2, on HECToR. . . . . . . . . . . . . . . 49
3.27 A Vampir trace timeline taken on Ness, showing the barrier proceeding,and the communication performed within, the FFTW2 parallel routine. . 50
3.28 A comparison of the time taken for a slab decomposition 3D FFT, forthe automatic parallel routines using PESSL, on HPCx. . . . . . . . . . 50
vi
List of Figures List of Figures
3.29 A comparison of the time taken for a slab decomposition 3D FFT, forthe automatic parallel routines, as a percentage of the time taken for ourwritten routines, using our compiled version of FFTW3.2a3, on Ness. . 51
A.1 Data rearrangement code . . . . . . . . . . . . . . . . . . . . . . . . . 62A.2 Data rearrangement code (cont.) . . . . . . . . . . . . . . . . . . . . . 63
B.1 The FFTW3 Patch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
vii
Acknowledgements
Principally, I’d like to thank Dr. Gavin Pringle, for his immeasurable guidance, support,and encouragement throughout this project.
I’d also like to thank Dr. Joachim Hein, for his greatly appreciated assistance, especiallyconcerning the BlueSky system.
Acknowledgement and thanks are also due for the assistance of David Vicente of theBarcelona Supercomputing Centre, who enabled us to have results for MareNostrum,and the Leibniz-Rechenzentrum München, for their allowing us to use their HLRB IIsystem.
Finally, many thanks to David Weir, and my assorted family members, for all theirsupport throughout my time in Edinburgh.
Chapter 1
Introduction
Fourier transforms are an oft-used technique for signal analysis, where they can be used
to perform complex operations on signals or simply retrieve a list of frequencies present,
and linear algebra, where they are widely used in multiple dimensions to solve prob-
lems that could take an untenable amount of time if done by iterative methods. Fourier
transforms are often the most computationally intensive region of the application, there-
fore parallelisation of the operation should benefit the overall runtime of a calculation
significantly.
It is therefore in our interest to determine an optimal method of parallelisation for this
transformation. Commonly used methods use data parallelism, decomposing the multi-
dimensional input data along one or more of its dimensions and giving each processor
a subset of the data to transform. For the three-dimensional case, the one-dimensional
decomposition is more common, but with increasing numbers of processors on mod-
ern systems, demand for higher degrees of parallelism in applications has made the
two-dimensional decomposition more of a viable option than previously.
Previous work [1][2] has studied these methods and the tuning thereof on specific plat-
forms, however in this study we aim to provide a view, over many different libraries
1
Chapter 1. Introduction 1.1. Fourier Transforms
and platforms using several libraries and our written parallel decompositions, of the
performance of these two decompositional methods.
1.1 Fourier Transforms
The Fourier transform is a mathematical method of representing a function in terms of
sinusoidal components, constructing a new function that gives the phase and amplitude
of each of these components that is required to reconstruct the original function.
It is often used to devolve a function in terms of time into its component frequencies,
e.g. and the resulting transformed function has many useful properties that can be used
to combine functions in a way that would be much more complex to perform on the
function in the time domain.
The transform can be discretised in order to perform a transformation on numerical
sequence – in this case, the numerical sequence is assumed to be a single instance
of a repeating signal, and the resolution is limited to that of the initial input data. The
discrete Fourier transform is used in many computational techniques, often to transform
the data into a form where a certain operation can be applied more easily, and then
perform the reverse transformation to return the data to the former domain.
The mathematical form of the discrete Fourier transform are largely irrelevant to the
context of this study, except to say that there are essentially two common mechanisms
that can be used – the ordinary, discrete Fourier transform method, which takes O(n2)
time, and a recursive method thought to be first described by Gauss[3], re-formalised
by Cooley and Tukey[4], and much investigated since [5], known as the fast Fourier
transform (FFT), in which the transform is constructed by recursively performing sim-
pler, smaller transforms via a divide and conquer-type algorithm, taking only O(n log n)
time.
2
Chapter 1. Introduction 1.1. Fourier Transforms
Figure 1.1: A directed acyclic graph showing the data dependencies for each step ofan FFT operation on an 8 element array.
This technique is thus limited to sequences which can be divided in this way, while
sequences of prime length must be operated on using either the full O(n2) mechanism,
or by using one of a set of more advanced technique in which the FFT is expressed as a
convolution of two sequences, which are themselves expressible as the product of two
FFTs of sequences of non-prime length [6][7].
The divide and conquer mechanism leads to complex data dependencies in the one-
dimensional FFT (see Figure 1.1), leading to it being inefficient to parallelise except for
very large data sets or using vector platforms. However, the two and three-dimensional
forms of FFT are often used in scientific computing, which are mathematically equiv-
alent to performing one-dimensional FFTs along each dimension in any order. In the
three-dimensional case, the data can thus be decomposed over a one-dimensional array
of processors along any of the dimensions of the data cube and still leave two dimen-
sions of the cube intact to perform whole FFTs without communication (Figure 1.3).
This type of decomposition is commonly known as a ’slab’ decomposition.
Due to the computational intensivity of these operations, and the relatively small dimen-
sions of the cube of data (usually smaller than 104), it would be ideal for parallelisation
to divide up one additional dimension of the data cube, allowing only one dimension to
be operated on without communication. This form is known commonly as a "rod" or
3
Chapter 1. Introduction 1.1. Fourier Transforms
"pencil" decomposition. Performing the three-dimensional FFT in this case, however,
then requires an extra communication step over the slab decomposition, as shown in
Figure 1.4. The additional cost of the communication involved may make this operation
perform significantly worse than any of the more simple one-dimensional decomposi-
tions. However, the two steps of the rod decomposition are simpler than the one step
of the slab decomposition, requiring only O(√
p) messages to and from each proces-
sor, rather than O(p) in the case of the slab decomposition. Each message contains the
same quantity of data, so this reduction in message count comes with a corresponding
decrease in the overall quantity of data that has to be inserted into the network connect.
These factors may improve performance over the slab decomposition if its performance
is limited by any of the properties of the network.
We therefore investigate which decomposition gives better performance, and to what
degree, for a range of data set sizes and processor counts, on a range of platforms.
This will also provide us with expectable timings and scaling measures for these three-
dimensional FFTs.
1D - ‘slab’ 2D - ‘rod’
Figure 1.2: Slab and rod decompositions as applied to a 4 x 4 x 4 data cube decom-posed over 4 processors.
4
Chapter 1. Introduction 1.2. All-to-All Communication
y
x z
y
x z
x
y z
Proc 0
Proc 1
Proc 2
Proc 3
problem sizeL x M x N
perform 1st 1D-FFTalong y-dimension and
2nd 1D-FFT along z-dimension(a)
perform 1D-FFTalong x-dimension
(b)
ALL-
to-
ALLto get data over
x-dimension locally
Figure 1.3: Steps and data rearrangement in the slab decomposition of the 3D FFT.Reproduced from [1]. Note that the axes have been rotated to better dis-play data reorganisation.
y
x z
Proc 0
Proc 3Proc 2
Proc 1
Proc 4
Proc 7Proc 6
Proc 5
Proc 8
Proc 11Proc 10
Proc 9
Proc 12
Proc 15Proc 14
Proc 13
z
x y
x
z yperform 1D-FFT
along y-dimension(a)
perform 1D-FFTalong z-dimension
(a)
perform 1D-FFTalong x-dimension
(a)
ALL-
to-
ALL
WITHINEACH
sub-group
to getdata over
z-dimensionlocally
ALL-
to-
ALL
WITHINEACH
sub-group
to getdata over
x-dimensionlocally
Figure 1.4: Steps and data rearrangement in the rod decomposition of the 3D FFT. Re-produced from [1]. Note that the axes have been rotated to better displaydata reorganisation.
1.2 All-to-All Communication
All-to-all communication, as the name suggests, is a style of communication in which
every discrete processing unit in a parallel computing group needs to communicate
with every other. In the message-passing paradigm, this means taking n processing
units each with n items of data, and redistributing them such that the nth unit receives
the nth item from each other task. This is equivalent to performing a transposition of
5
Chapter 1. Introduction 1.2. All-to-All Communication
0
4
8
12
1
5
9
13
2
6
10
14
3
7
11
15
Proc
0
Proc
3
Proc
2
Proc
1
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Proc
0
Proc
3
Proc
2
Proc
1
Figure 1.5: All-to-all communication between processors, as applied to a commonmatrix transposition.
an n-row matrix, as shown in Figure 1.5.
This is one of the most expensive operations an interconnect between tasks can perform
in a distributed memory environment, and depends heavily on the insertion and bisec-
tional bandwidths of the sector of the network used. The insertion bandwidth is the
limit on individual messages – how fast each processor can send data into the network,
while the bisection bandwidth is the limit of connection bandwidth across the network’s
minimum cross-section, i.e. the smallest quantity of bandwidth-link that every message
must travel through.
Often the quoted figures for such bandwidths are the theoretical maxima – often the
message-passing library implementation is implemented on top of a lower level net-
work library, and overheads can add much overhead to messaging operations, reducing
performance. Also, sub-optimal task assignment to physical processors can cause un-
necessary latency in each message sent, especially in the case of the rod decomposition,
where compartmentalising each all-to-all such that messages from different instances
do not route through the same links can provide significant performance benefits [1][8].
For message-passing, libraries implementing the Message-Passing Interface have be-
come the de-facto standard, being implemented on many platforms, even where mes-
sage passing is not the most powerful paradigm available, e.g. on shared-memory sys-
6
Chapter 1. Introduction 1.3. FFT Libraries
tems, where all memory is de-localised and shared with all processors, and ccNUMA
(cache-coherent non-uniform memory access) systems, where memory is localised but
accessible by all processors.
1.3 FFT Libraries
The process of performing a fast Fourier transform can be implemented as needed,
however, there exist several well-known libraries that provide generic implementations
that typically perform well for a variety of data extents. The recursive nature of the FFT
algorithm means that an FFT library will typically have a number of optimised base
cases for small prime factors, so many libraries have a list of prime factors of extents
they can perform very well. This set of optimised factors was once a major factor in
choice of library, but now most libraries support any size of extent, but may not perform
optimally for extents that are not factorable into small primes.
Libraries provided by vendors will typically be optimised for their platform, however,
libraries that attempt to provide good performance across a range of platforms also
exist. Below we briefly describe the libraries we used on the platforms we tested.
1.3.1 FFTW
FFTW ("Fastest Fourier Transform in the West") is a free, portable, open source li-
brary for performing Fourier transforms, developed and maintained by Matteo Frigo
and Steven Johnson of the MIT Laboratory for Computer Science. [9] It claims speed
on many platforms by self-optimising – different methods for performing the same op-
eration ("codelets") can be speed-tested at run-time and optimal ones used. The current
stable version is 3.1.2, although as the API was changed between version 2 and 3, many
software packages have not been updated and still use the most recent update of version
7
Chapter 1. Introduction 1.3. FFT Libraries
2, 2.1.5. Version 3 does not yet have fully tested MPI transforms, as these have been
introduced only in the 3.2 alpha versions (currently 3.2 alpha 3 is the most recent). Both
the MPI transforms in versions 2 and 3 only allow for a slab decomposition of the data
operated on.
FFTW claims to perform best on arrays with extents that are multiples of 2, 3, 5 and
7, but also uses O(n log n) algorithms for all extents, even large primes [9]. It also has
"codelet generation" facilities with which code to support a particular factor optimally
can be generated, for advanced use.
1.3.2 ESSL
IBM’s Engineering and Scientific Subroutine Library [10] is provided on – and typi-
cally specifically optimised for – IBM-supplied Linux and AIX systems, and provides
functionality used in many scientific applications – not only Fourier transforms, but
also linear algebra, matrix operations, and random number generators. Also included
is the parallel extension, PESSL, which uses an interface to an implementation of the
BLACS library (Basic Linear Algebra Communication Subprograms [11]) and an ex-
tension of the ESSL API to perform many of these routines in parallel, however, for the
FFT parallel routines, it only provides a slab decompositional method.
ESSL will only perform FFTs with length:
n = (2h)(3i)(5j)(7k)(11m) ≤ 37748736
where h ∈ {1, 2, ..., 25}, i ∈ {0, 1, 2}, j, k, m ∈ {0, 1}
Attempting to perform an FFT that does not conform to this specification will produce
an error.
8
Chapter 1. Introduction 1.4. HPC Systems
1.3.3 ACML
AMD produces the AMD Core Math Library [12], providing much the same linear
algebra, matrix, random number and transform functionality as ESSL, optimised for
AMD processors. It does not include any parallel routines.
ACML claims best efficiency on array extents that are powers of two, but also good
efficiency on extents that have only small prime factors up to 13. It will perform FFTs
on arrays of any length, however.
1.3.4 MKL
Intel’s Math Kernel Library [13] is an optimised mathematical library for Intel’s pro-
cessors, providing similar functionality again to ACML and ESSL. It provides some
parallel routines, based on a BLACS library implementation, but unlike PESSL keeps
this within the library itself, not requiring explicit BLACS usage by the programmer.
Like PESSL, it only allows for slab decomposition of the data.
MKL claims optimal performance on array extents that are powers of 2, 3, 4, and 5 for
most architectures, and 7 and 11 for Intel’s IA-64 architecture, but supports any extent.
1.4 HPC Systems
We wanted to cover a broad range of types of high-performance computing (HPC) plat-
form for our tests, and so enlisted seven different systems, each with a different combi-
nation of technologies. These are described below.
9
Chapter 1. Introduction 1.4. HPC Systems
1.4.1 HECToR
HECToR [14] is a a Cray XT4 system, and serves as the current primary national general
purpose capability computing service for UK universities. It was installed into the
University of Edinburgh’s Advanced Computer Facility in 2007 [15].
The HECToR system is comprised of 1,416 compute blades, each with 4 dual-core
AMD 2.8 GHz Opteron processors for a total of 11,328 cores, and 24 services blades,
each with 2 similar processors, which manage user access, network services, and I/O
from the compute blades.
Each dual-core processor is connected to 6 GB of RAM, and a Cray SeaStar2 communi-
cation processor with an embedded PowerPC440 processor. Each SeaStar2 is connected
in a 3D toroidal network topology to six others, giving a theoretical point-to-point band-
width of 2.17 GB/s and minimum bisection bandwidth of 4.1 TB/s. [14]
HECToR compute nodes use Cray’s own operating system, UNICOS/lc, based on a cus-
tomised Linux kernel called Compute Node Linux (CNL). [16] Two high performance
compilers are available – the Portland Group compilers [17] and the PathScale compiler
[18]. The Portland Group compiler is much more commonly used, and it is the one we
have used.
1.4.2 HPCx
HPCx [19] was the national capability computing service for UK universities before
HECToR was installed, and is now the complementary service. HPCx is located at
STFC’s Daresbury Laboratory near Cheshire where it was installed in 2004 [20].
HPCx contains 168 IBM eServer 575 nodes, with 160 being allocated for computation,
and 8 for user access and I/O. Each node contains 8 dual-core 1.5 GHz POWER5 pro-
10
Chapter 1. Introduction 1.4. HPC Systems
cessors and 32 GB of shared RAM, giving a total of 2560 compute cores, and nodes are
linked by IBM’s clos-topology High Performance Switch [21].
HPCx uses IBM’s own AIX 5 operating system on both user access and compute nodes.
1.4.3 Ness
Ness [22] is the EPCC’s most recent training and testing server, consisting of two Sun-
fire X4600 shared memory compute units with 8 dual-core 2.6 GHz AMD Opteron
processors and 30 GB of shared RAM each, with a Sunfire X2100 serving as login
node.
The software environment is largely designed to be very similar to HECToR to allow
development and testing comparisons. The operating system is based on Scientific
Linux, with Portland Group compilers as on HECToR.
1.4.4 BlueSky
BlueSky [23] is a single IBM Blue Gene/L cabinet maintained by the EPCC for the
University of Edinburgh’s School of Physics. It has 1024 dual core 700 MHz PowerPC
440 processors, with 512 MB per processor. The processors are connected by five
separate networks, but the principle computational network is the three-dimensional
toroidal network, in which each processing unit is connected to all six of its neighbours
by 154 MB/s channels[24].
Blue Gene/L systems use a specialised, cut down OS on the compute CPUs, designed
to be as lightweight as possible – even some very common features, such as threading,
are not implemented.
11
Chapter 1. Introduction 1.4. HPC Systems
1.4.5 Eddie
Eddie [25], operated by the ECDF ("Edinburgh Compute and Data Facility"), is the
University of Edinburgh’s own cluster-computing service, consisting of 128 dual-core
and 118 quad-core nodes, using 3 GHz Intel Xeon processors and 2 GB of RAM per
core, connected by a gigabit Ethernet network, however, a small partition of 60 dual-
core nodes are connected by a faster Infiniband network, and this is the region of the
machine employed.
Eddie uses Scientific Linux for both user access and compute nodes, with the Intel
compiler for high-performance applications.
1.4.6 MareNostrum
MareNostrum [26] is the Barcelona Supercomputing Centre’s computing cluster, formed
of 2282 IBM eServer BladeCenter JS20 servers, each with two dual-core 2.3 GHz Pow-
erPC 970MP processors and 8GB of RAM. The servers are connected to a Myrinet op-
tical fibre clos-switched network for performance as well as a Gigabit ethernet network
for administration.
The software environment is based on a Linux 2.6 kernel and the SUSE distribution,
with IBM’s own compiler and library distribution.
1.4.7 HLRB II
HLRB II [27] is the largest system operated by the Leibniz Supercomputing Centre, an
SGI Altix 4500 platform with 4864 dual-core Intel Itanium2 processors. Each processor
is linked to 8GB per core of RAM, except the first processor on each partition, which
is linked to 16GB. The nodes are connected with SGI’s NUMAlink 4 hardware in a
12
Chapter 1. Introduction 1.4. HPC Systems
’fat-tree’ fashion, allowing processors to access any memory installed in the system
directly.
Platform Compiler Message-Passing FFT LibrariesHECToR pgcc 7.1-4 XT-MPT 3.0.0 FFTW 3.1.1
FFTW 2.1.5ACML 4.0.1a
HPCx xlc 08.00.0000.0013 POE 4.2 FFTW 3.0.1LAPI 2.4.4.4 FFTW 2.1.5
ESSL 4.2.0.0PESSL 3.2
HLRB II icc 9.1 SGI MPI/Altix 1.16 FFTW 3.0.1FFTW 2.1.5MKL 9.1
MareNostrum xlc 08 MPICHGM 1.2.5.2 FFTW 3.1.1FFTW 2.1.5ESSL 4.1
Eddie icc 10.1 Infinipath 2.1 FFTW 3.1.2FFTW 2.1MKL 10.0.1.014
Ness pgcc 7.0-7 MPICH2 - 1.0.5p4 FFTW 3.1.2FFTW 2.1.5ACML 3.6.0MKL 10.0.1.014
BlueSky xlc 08.00.0000.0001 MPICH2 - 1.0.3 FFTW 3.2a3FFTW 2.1.5ESSL 4.2.2
Table 1.1: Software versions used on each platform
13
Chapter 2
Benchmarking
High performance computing platforms are designed to offer the best performance pos-
sible from their available resources. As such, components, hardware or software, are
often upgraded or altered at short notice.
To allow easy comparison of platforms and libraries in a way that would allow rapid
repetition in case of, e.g. changes to the computational environment, it was decided
to create a single portable benchmarking system that could be rapidly deployed on all
the platforms targeted within a narrow window of time. This would have to allow
both slab and rod decompositions, and the use of each of the available libraries, as
well as allowing comparison between using a library’s 1D and 2D FFT calls for the
planes of data in the slab decomposition. It would also have to employ a library’s
own parallel routines, if present, as well as ours. Despite adding these in, to maintain
fair comparison, we compare our own slab and rod decomposition routines, and then
compare the automatic routines with our own.
14
Chapter 2. Benchmarking 2.1. 2D FFTs for the Slab Decomposition
2.1 2D FFTs for the Slab Decomposition
As previously stated, for a three dimensional data set, a three-dimensional Fourier
transform is equivalent to a one-dimensional transform along each dimension. Given
that a two-dimensional transform for a two-dimensional data set is likewise equivalent
to a one-dimensional transform along each dimension, it can be shown that a three-
dimensional transform can also be performed by performing a two-dimensional trans-
form on each layer followed by a one-dimensional transform in the remaining dimen-
sion (Figure 2.1). We therefore added to our tests the option of using a two-dimensional
transform library routine for each plane of data in the slab decomposition, as compared
with using two one-dimensional transforms.
While performing a two-dimensional FFT, it is usual to transpose the data in memory
to allow contiguous access to the second dimension. Most of the libraries we used –
FFTW2, FFTW3, ACML, MKL offer the ability to return it transposed, to save having
to re-transpose it back into its original alignment. We have used this option where
available, as typically, a programmer can account for this and operate on the data after
the transpose in the way they originally intended without having to re-transpose it.
1D FFT - X
2D FFT - XY
1D FFT - Y
1D FFT - Z
1D FFT - Z
3D FFT - XYZ
Figure 2.1: Graphical representation of the ability to break down a three-dimensionalFFT into different dimensionality function calls. Each method is equiv-alent, and the operations are commutative. Axes may be arbitrarily re-ordered.
15
Chapter 2. Benchmarking 2.2. Benchmarking on an HPC Platform
2.2 Benchmarking on an HPC Platform
One of the main issues with benchmarking on an HPC platform is that unless whole
partitions of the machine are entirely reserved for jobs, as on the Blue Gene system,
communications may suffer from congestion with other jobs. The obvious way around
this is to reserve the whole machine before running each job, but this is very expen-
sive in terms of processor time – for example, reserving thousands of processors to run
a 32 processor job is not economical. A possible solution to this could be to reserve
the whole machine and pack as many jobs into all reserved space as possible – how-
ever, this essentially ensures that other jobs will be running simultaneously with a very
similar communications pattern, meaning that if congestion could occur, it is almost
guaranteed.
We instead chose to intentionally ignore this problem – by running jobs in the normal
way, we get typical timing figures rather than optimal measures, but we also guarantee a
pseudo-random distribution of congestion on the system, if any, as other users run jobs.
There is also an issue, especially with the more layered hardware architectures, that
using a small number of processors can lead to obtained results that do not scale up
beyond a layer. This is especially true of HPCx, with its shared memory nodes of 16
processors, where scaling benefits have been observed by performing all-to-all commu-
nication across two boxes, linked by a network connect, rather than within the solely
shared memory environment within one box [8]. We therefore avoided such results,
instead setting our minima for the larger systems at a level where we would include at
least two units of the secondary communication layer – e.g. 32 processors on HPCx –
two shared memory nodes; or 4 processors on HECToR – two dual-core processors . In
any case, the rod decomposition can only be constructed for four processors or more,
as three and two are prime and thus could only produce slab decompositions. For the
sake of simplicity, we only used processor counts that were integer powers of two.
16
Chapter 2. Benchmarking 2.3. Implementation Details
2.3 Implementation Details
In this section, we discuss some of the particular choices we made in implementing the
benchmark software.
2.3.1 Language and Communication
Due to the mutually exclusive compile-time choices that would need to be made in our
code, and familiarity with the language, we elected to use ISO C 99, despite the complex
number issues that would arise (see Section 2.4). The powerful standard preprocessor
allows the use of logic and macro definition to create the code that the compiler will
then operate on.
To allow communication, we used MPI routines, using strict MPI-1, as Eddie’s Infini-
path implementation of the MPI did not support MPI-2 routines.
2.3.2 Test Data and Verification
To allow verification of a correct transform, test data was created in the form of a trivari-
ate sine function over the three Cartesian co-ordinates of the distributed cubic array:
f(x, y, z) = sin
(2πx
X+
2πy
X+
2πz
X
)(2.1)
17
Chapter 2. Benchmarking 2.3. Implementation Details
where X is the length of one edge of the cube. This function can be analytically shown
[1] to produce transformed data thus:
F (x, y, z) =
−1
2.i.X3 if x = y = z = 1
12.i.X3 if x = y = z = X − 1
0 else
(2.2)
This output signal is conveniently symmetric about the three diagonal axes of the data
cube, which allows us to ignore whether a given library’s 2D FFT call returns data
in a transposed form or not, in the comparison between 1D and 2D FFT calls in the
slab decomposition. One approach to writing a 2D FFT function involves scrambling
or transposing data during the call to improve performance overall, and this can often
result in the output data being in a transposed form from the original – if a library’s
function does this, it will usually offer an option to transpose the data back into the
original form, but we may ignore this to enhance performance.
We may use a simple algorithm to determine where the peaks are located within the
decomposed data, and produce, in an existing buffer, the exact output data for each
processor. Then we may calculate the absolute difference between the transformed data
and this exact comparison data, summing over processors, and divide by the number of
elements in the array.
residue =1
X3
X∑x=0
X∑y=0
X∑z=0
|Foutput(x, y, z)− Fideal(x, y, z)|
This value is then checked against a set tolerance, ≤ 10−10, to determine accuracy.
18
Chapter 2. Benchmarking 2.3. Implementation Details
2.3.3 Multidimensional FFT Data Rearrangement
One of the most complex issues in implementation is the method used to rearrange the
data for the all-to-all calls required between steps. MPI offers a number of automatic
data packing routine specifications commonly called ’derived datatypes’ to aid in sim-
ilar operations, however, they are less than ideal for this usage, due to the necessary
combination of at least three datatypes and type-resizing, which is particularly complex
with MPI-1. It was found that rearranging the data into contiguous blocks for each
processor using modulo arithmetic was the simplest to implement. Routines to perform
this operation are reproduced in Appendix A, but essentially this operation involves
performing many small transposes of subarrays within the larger array, with variable,
overlapping strides. This is also expected to perform relatively well, as the limited im-
plementation may allow more optimisation by the compiler than the generic datatype
implementations of an MPI library.
The MPI_Alltoall call was used to perform the actual communication between FFT
steps; this may not, in fact, be the fastest method for performing this operation on all
systems, however, it is the call designed for this type of use, and is thus most likely
to be optimised for such. Studies performed in the past have examined mechanisms
for performing this operation, and it has been suggested that this method is optimal for
many types of system in the past[28], though obviously it is entirely problem dependant.
2.3.4 Rod Decomposition Dimensions
In a two-dimensional decomposition, the obvious choice of dimensions for the proces-
sor grid is what gives the most square form – this minimises the number of processors
involved each communication in both steps. We have assumed that this is the optimal
case, particularly as we are comparing against the slab decomposition, as this gives us
19
Chapter 2. Benchmarking 2.3. Implementation Details
for(i = (int) sqrt( (double) processors);i>0;i--)
{if ( (0 == i%2) && ( 0 == processors%i ) ){
dimensions[0] = i;dimensions[1] = processors/i;break;
}}
Figure 2.2: Code used to obtain the dimensions of the two-dimensional processor ar-ray.
the most different processor arrangement to the slab form. In the case that the number
of processors is not a square number, we have chosen to make one of the dimensions
a factor of two of the other, since we are always using powers of two numbers of pro-
cessors. The code for obtaining the dimensions of the array of processors is shown in
Figure 2.2. We assume that a future investigation of the properties of altered processor
array shapes could be useful.
2.3.5 Memory Limitations
The data cube uses a significant quantity of memory in every case – a 1024 x 1024 x
1024 data cube is 16 GB of data, and there has to be another buffer of identical size,
as MPI_Alltoall cannot be performed in-place. Adding working space for common
variables, FFT calls and additional buffers allocated by MPI, and the data required by
the benchmark can easily overflow the memory available on a processor. Attempts
were made to mitigate the effects of this by predicting the data usage and comparing
with the system’s available RAM, but if not all the processors failed gracefully (for
example, if all but one processors allocate the necessary arrays on a shared-memory
array), then evidence suggests the large core files produced automatically can cripple
the I/O handlers of an HPC system.
20
Chapter 2. Benchmarking 2.4. FFT Library Porting
2.4 FFT Library Porting
Because the standard specifications of ISO C lacked a native complex type until 1999
[29] (ISO C having first been publically internationally specified as a copy of ANSI C
in 1990 [30]), many libraries that can interface with C code and use complex numbers
construct their own complex number data structures, as shown in Table 2.1. This can
make interchangeability difficult to achieve. We solved this problem by using prepro-
cessor directives to allow determination of a library being used at compile-time, and
creating one executable per available FFT library. In retrospect, this was a sub-optimal
solution – despite the different definitions, the different types used can be demonstrated
to be bit-compatible, and this fact could have been used to operate on them in a library-
independent manner for arithmetic. The existing implementation uses this fact to send
the complex data through MPI by treating every type as merely two contiguous double-
precision floating-point numbers in memory.
All the libraries used have a preparation step that must be performed to pre-calculate
invariant data used to calculate the FFT, and a performance step which uses that data
to perform the actual calculation. Because we were primarily interested in only the
performance step, we omitted the preparatory steps from the resultant timing data.
We now discuss our experience of including each library into our benchmark code.
2.4.1 FFTW3
The benchmark was originally written using FFTW3, and it was found to be easy to use,
well-documented save for the experimental MPI routines, and trivial in compilation,
save for the adjustment between versions with and without parallel routines with MPI
support, as with FFTW2.
Both FFTW libraries have multiple interface functions, according to the level of neces-
21
Chapter 2. Benchmarking 2.4. FFT Library Porting
Library Complex Number Format(Native C) _Complex double;
FFTW3 _Complex double;
FFTW3 (alt) double [2];
FFTW2 struct { double re, im; }
MKL _Complex double;
ACML struct{ double real, imag; };
ESSL union{ struct{ double _re, _im; };
double _align };
Table 2.1: FFT libraries and their complex double-precision floating-point numberspecification for the C language interface. The FFTW3 has two entriesbecause it will only use the native complex number format if the filecomplex.h has already been included.
sary complexity in the particular user’s need – for example, for a single FFT calculation
on a single array, uses a different, simpler preparatory function to performing multiple
FFTs on a strided array that is contained within a larger array.
Compiling the FFTW3 libraries on the BlueSky system proved to be somewhat chal-
lenging – a patch had to be produced and applied to modify the library’s cycle counters
to the PowerPC 440 processor. This patch is reproduced in Appendix B.
2.4.2 FFTW2
FFTW2 was generally well-documented, C-oriented though with additional documen-
tation for the Fortran interface, and follows the same structure as FFTW3. It was,
therefore, fairly easy to port from FFTW3 to FFTW2, especially since the main shift in
interface, FFTW3’s introduction of pre-definition of the operating array, did not signif-
icantly affect our use of the library.
The main complication came in compiling with FFTW2 on different platforms, where
the FFTW2-MPI library was not available, such as on Eddie, or the library files were
named differently to specify whether the library used double or single precision, such
as on HPCx and HECToR. These were minor complications once made apparent, but
22
Chapter 2. Benchmarking 2.4. FFT Library Porting
their presence was not indicated until attempts to link with FFTW2 failed.
2.4.3 MKL
Generally, MKL was quite easy to write with, though when it came to actually com-
piling the application, MKL is divided up into approximately 35 different linkable sec-
tions, some apparently mutually exclusive, and the documentation does not explicitly
state which sections any given function call will need. To eventually obtain a success-
fully linked executable, a trial and error approach was required. In addition, the sections
available changed over the two different major versions we used – MKL 9 on HLRB II,
and MKL 10 on Ness and Eddie.
2.4.4 ESSL
ESSL, while quite explicitly documented, appears to be written with a highly Fortran-
biased view, and has some approaches to data handling that can be quite unfamiliar
to a C programmer. For example, while FFTW and MKL libraries have a mechanism
for preparation, and a mechanism for performing the actual FFT, in ESSL one merely
calls the same library routine with a preparation flag and then without, using a large,
pre-allocated array for storing pertinent information. In addition, it was occasionally
found to be advantageous to search through the ESSL header files to find the actual C
prototype for functions.
Using the parallel extensions to ESSL (PESSL) in particular proved to be a challenge, as
these use the BLACS – Basic Linear Algebra Communication Subprograms – interface
for parallelism. BLACS is another portable specification which can be used for per-
forming linear algebra operations in parallel, and can usually be layered on top of MPI.
Indeed, the BLACS tools are often written using MPI, however using both in one code
23
Chapter 2. Benchmarking 2.5. System Porting
adds an unnecessary layer of complexity to what is already a fairly complex operation,
especially as BLACS documentation is also often sparse and, again, Fortran-oriented.
2.4.5 ACML
ACML is a library written in Fortran, and seems to be documented largely for Fortran
users. When using the C interface, this can mean lead to having to fill in missing details
as necessary in a few cases, and hunt through header files in others. During compilation,
it is often necessary to link against your compiler’s Fortran mathematical libraries. It is
quite similar in use to ESSL, having similar "call twice" FFT routines, similarly styled
large pre-allocated arrays, and the same long, complex function prototypes.
2.5 System Porting
Being unfamiliar with the common automatic configuration tools such as autoconf [31]
and pkg-config [32], we decided instead to simply make one Makefile that contained
the necessary flags and commands for every system, and a Bash script to use this make
file for each library executable the system supported. Occasionally we found that con-
figurational tools to aid use of software packages were out of date or had not been
maintained, or found that despite these, a search for the library files we needed was
required. Module files for FFTW3 on Eddie had to be fixed to bring the version number
up to date with the actual installed library before they would function, and the library
paths for MKL on HLRB II had to be located manually.
Allowing the benchmark to compile and run on all the libraries on each platform was
the longest single stage in this study – documentation was often found to be sparse,
especially for Eddie, and as we could only obtain access to MareNostrum via David
Viscente, it was necessary to ensure that everything could work with as little work on
24
Chapter 2. Benchmarking 2.6. Scripting
his part as possible. A "readme" file was provided with a compressed copy of the code
and the surrounding scripts – this is reproduced in Appendix C. Nevertheless, this
package was not designed for any sort of public release – it still requires some manual
configuration.
2.6 Scripting
To allow complete data sets to be obtained easily, a script to create sets of job files was
created, using the Bash shell scripting language and a number of common tools con-
tained in every UNIX operating system [33], including bc, sed, and cat, for porta-
bility. These tools allowed us to create a quick and easy to use terminal interface that
greatly sped up the testing and benchmarking process. Initially we had thought it would
be useful to create an additional script that would insert output result sets into an SQL
database, but this was later found to be much less efficient for processing data than
initially thought, and results were processed as comma-separated value files.
25
Chapter 3
Results
The benchmark executable was run on each system from a small number of processors
corresponding to the smallest number of processors that still made use of the commu-
nication facilities of the system as described in Section 2.2, up to the highest number
of processors available or 1024, whichever was smaller. Given that our FFT planning
preparation was not parallelisable, every doubling of processors also doubled the com-
pute time used in this step, and to go higher would use an even more significant portion
of budget with each step.
All executions were repeated at least three times, four where budgeting allowed, and
the minimum taken.
3.1 HECToR
3.1.1 Libraries
Figure 3.1 shows the total time spent in FFT calls in one run of the executable for HEC-
ToR using 64 processors. In this case, the 3 libraries are very similar in performance,
26
Chapter 3. Results 3.1. HECToR
0.0001
0.001
0.01
0.1
1
10
32 64 128 256 512 1024 2048
Tim
e sp
ent i
n F
FT
cal
ls (
s)
Extent
ACMLFFTW3FFTW2
Figure 3.1: A comparison of the time spent in FFT calls for a rod decomposition 3DFFT, for different libraries, using 64 processors, on HECToR.
though ACML consistently performs slightly better for higher extents, and worse for
small extents. The performance increase could be due to AMD’s optimised library hav-
ing better indications of the cache sizes of the AMD processors, and thus being able to
make more use of them – insufficient time was available to test, but the Portland Group
compiler can be informed of the cache sizes of the processors used, and these used to
optimise further during compilation. If this information was provided to the FFTW3
compilation, it could result in performance increases over ACML, but obviously this is
purely speculative.
FFTW3 also performs better for every extent than FFTW2, but due to the multitudinous
algorithms used by both FFTW versions, this could be due to a differing choice or a
new algorithm being used in 3 that provides slightly better performance, e.g. improved
SSE engine support.
27
Chapter 3. Results 3.1. HECToR
0.001
0.01
0.1
1
10
64 128 256 512 1024
Tot
al T
ime
(s)
Extent
ACML, 1D FFTACML, 2D FFT
FFTW3, 1D FFTFFTW3, 2D FFTFFTW2, 1D FFTFFTW2, 2D FFT
Figure 3.2: A comparison of the total time taken for the purely 1D and 2D & 1DFFT-using slab decomposition 3D FFT, for different libraries, using 16processors, on HECToR.
3.1.2 1D vs 2D FFT
Figure 3.2 shows the timings for the slab-decomposed 3D FFT using the three 1D FFT
calls, against the 2D followed by 1D FFT call. Aside from a slight difference when
using FFTW2 at the largest extent, no significant difference is exhibited between the
two methods.
We found this was true across all the platforms we tested, and as such have omitted
further comparison of the two methods.
3.1.3 Slabs vs Rods
Figure 3.3 shows the total time taken for the two different decompositions of the 3DFFT
on HECToR for the ACML library. As might be expected, the slab decomposition is
faster than the rod decomposition for every extent. It should be noted, however, that
28
Chapter 3. Results 3.1. HECToR
0.0001
0.001
0.01
0.1
1
10
100
16 32 64 128 256 512 1024 2048
Tot
al T
ime
(s)
Extent
Slab, p=2Rod, p=4Slab, p=4Rod, p=8Slab, p=8
Rod, p=16Slab, p=16Rod, p=32Slab, p=32Rod, p=64Slab, p=64
Rod, p=128Slab, p=128Rod, p=256Slab, p=256Rod, p=512Slab, p=512
Rod, p=1024Slab, p=1024
Figure 3.3: A comparison of the total time taken for the two different decompositionsof the 3DFFT on HECToR, using the ACML library.
there is no overlap between numbers of processors – according to these performance
figures, using the slower decomposition of the two will never be slower than using half
the number of processors with the faster decomposition.
3.1.4 Scaling
Figure 3.4 shows the effects on the time taken to perform the 3D FFT of increasing the
number of processors, for given extents. As is usual for parallel programs involving
significant communication during calculation, we can see that as the data per processor
increases in quantity, better scaling (i.e. closer to ideal) can be demonstrated on larger
numbers of tasks. In general, this application seems to demonstrate consistent scaling
on HECToR up to large numbers of processors, meaning that it is neither saturating the
insertion nor bisectional bandwidth of the network. Since this is the condition under
which the rod decomposition would benefit, it is unsurprising that the rod decomposi-
29
Chapter 3. Results 3.1. HECToR
0.0001
0.001
0.01
0.1
1
10
100
2 4 8 16 32 64 128 256 512 1024
Tim
e (s
)
Tasks
Rod, x=24Slab, x=32Rod, x=32Rod, x=48Slab, x=64Rod, x=64Rod, x=96
Slab, x=128Rod, x=128Rod, x=192Slab, x=256Rod, x=256Rod, x=384Slab, x=512Rod, x=512Rod, x=768
Slab, x=1024Rod, x=1024Rod, x=1536
Ideal Gradient
Figure 3.4: A comparison of the total time taken for the 3DFFT on a data cube of thegiven extents (x) on varying numbers of MPI tasks on HECToR, using theACML library. The ideal gradient is perfect scaling – in which doublingthe number of tasks halves the time taken.
tion does not perform better on this platform.
We can see from these figures, however, that the greater number of messages sent in the
slab decomposition’s method causes it to lose scaling consistency for smaller numbers
of tasks than the rod decomposition as latency starts to dominate the communication
time, e.g. for x = 128, we can see that the rod decomposition shows significant de-
viation from ideal scaling at 128 tasks, while the rod decomposition is still equally
efficient. This effect decreases as the quantity of data in each message and thus work
involved in the FFT steps increases, however, and at x = 512 it is unnoticeable at this
resolution.
30
Chapter 3. Results 3.2. Ness
1e-05
0.0001
0.001
0.01
0.1
1
10
16 32 64 128 256 512 1024
Tim
e sp
ent i
n F
FT
cal
ls (
s)
Extent
MKLACML
FFTW3FFTW2
Figure 3.5: A comparison of the time spent in FFT calls for a rod decomposition 3DFFT, for different libraries, using 8 processors, on Ness.
3.2 Ness
3.2.1 Libraries
As could be expected, Figure 3.5 shows that the MKL library for Intel processors per-
forms less well, and seemingly more noisily, than AMD’s library on an AMD processor
platform, though there is a crossover between the extents of 512 and 1024 that might
bear investigating to determine whether it is a statistical anomaly. Similarly to HEC-
ToR, and unsurprisingly given that Ness has similar processors, the other FFT libraries
perform quite similarly, though ACML performs better with larger extents, and the
FFTW libraries perform better, by a small margin, with smaller extents.
When using MKL, we were surprised by the large overhead we found to be dominating
the timing, and subsequent testing revealed that there was quite a large speed difference
for the first execution of the FFT using a given ’descriptor’ and the second, as shown in
31
Chapter 3. Results 3.2. Ness
0.0001
0.001
0.01
0.1
1
10
16 32 64 128 256 512 1024
Tim
e sp
ent i
n F
FT
cal
ls (
s)
Extent
MKL, 2nd FFT callMKL, 1st FFT call
Figure 3.6: A comparison of the time spent in FFT calls for a rod decomposition 3DFFT, for MKL on the first call of the FFT routine and the second, using 8processors, on Ness.
Figure 3.6. We expect that this is due to the various data-independent ’twiddle’ values
being calculated on the first execution of the FFT, rather than when the descriptor is
prepared, as expected. To compensate for this, we were forced to alter the code to rerun
the benchmark for this library twice within the code after we had collected the majority
of the results. For this reason, the figures for Eddie and HLRB II reflect the unaltered
code, and thus show the values with the overhead.
3.2.2 Slabs vs Rods
Given that Ness is a shared-memory system, we might expect that these results would
be much more similar than their equivalents on HECToR, but we can still see a marked
difference between performance at high extents for the different decompositions. Given
the erratic results at low extents, it is possible that the uniformity at high extents is
caused by higher overhead and buffering used to maintain greater efficiency in memory
32
Chapter 3. Results 3.2. Ness
0.0001
0.001
0.01
0.1
1
10
100
16 32 64 128 256 512 1024
Tot
al T
ime
(s)
Extent
Slab, p=1Slab, p=2Rod, p=4Slab, p=4Rod, p=8Slab, p=8
Rod, p=16Slab, p=16
Figure 3.7: A comparison of the total time taken for the two different decompositionsof the 3DFFT on Ness, using the FFTW3 library.
usage for the transfer of large quantities of data. The difference between conformance
for high and low extents could bear further investigation.
3.2.3 Scaling
The two principle features to extract from Figure 3.8 are the uniform sub-ideal scal-
ing for high extents, and the very poor scaling of small extents using the slab decom-
position, compared with the slightly lesser effect on the rod decomposition method –
the slab decomposition becomes a lot less efficient (and predictable) at lower extents,
whereas the rod decomposition retains a great degree of uniformity and maintains time
improvements, however slight, up to 16 tasks in all cases save x = 16.
33
Chapter 3. Results 3.3. HPCx
0.0001
0.001
0.01
0.1
1
10
100
1 2 4 8 16
Tim
e (s
)
Tasks
Slab, x=16Rod, x=16Rod, x=24Slab, x=32Rod, x=32Rod, x=48Slab, x=64Rod, x=64Rod, x=96
Slab, x=128Rod, x=128Rod, x=192Slab, x=256Rod, x=256Rod, x=384Slab, x=512Rod, x=512
Ideal Gradient
Figure 3.8: A comparison of the total time taken for the 3DFFT on a data cube ofthe given extents (x) on varying numbers of MPI tasks on Ness, using theFFTW3 library. The ideal gradient is perfect scaling – in which doublingthe number of tasks halves the time taken.
3.3 HPCx
3.3.1 Libraries
Figure 3.9 shows the library comparison for HPCx – as might be expected, IBM’s own
library optimised for IBM’s processors and other hardware performs better than the
more portable libraries – however, it is notable just how much more clear-cut the differ-
ence between ESSL and FFTW libraries is here, compared to ACML and MKL on the
non-IBM systems.
3.3.2 Slabs vs Rods
The decomposition comparison in this case seems much less clear-cut in this case than
for HECToR – Figure 3.3 – and in many cases it is suggested that using more processors
34
Chapter 3. Results 3.3. HPCx
1e-05
0.0001
0.001
0.01
0.1
1
10
32 64 128 256 512 1024
Tim
e sp
ent i
n F
FT
cal
ls (
s)
Extent
FFTW3FFTW2
ESSL
Figure 3.9: A comparison of the time spent in FFT calls for a rod decomposition 3DFFT, for different libraries, using 32 processors, on HPCx.
0.001
0.01
0.1
1
10
100
32 64 128 256 512 1024 2048 4096
Tot
al T
ime
(s)
Extent
Rod, p=32Slab, p=32Rod, p=64Slab, p=64
Rod, p=128Slab, p=128Rod, p=256Slab, p=256Rod, p=512Slab, p=512
Rod, p=1024Slab, p=1024
Figure 3.10: A comparison of the total time taken for the two different decompositionsof the 3DFFT on HPCx, using the ESSL library.
35
Chapter 3. Results 3.3. HPCx
0.001
0.01
0.1
1
10
100
32 64 128 256 512 1024
Tim
e (s
)
Tasks
Rod, x=32Rod, x=48Slab, x=64Rod, x=64Rod, x=96
Slab, x=128Rod, x=128Rod, x=192Slab, x=256Rod, x=256Rod, x=384Slab, x=512Rod, x=512Rod, x=768
Slab, x=1024Rod, x=1024Rod, x=1536
Ideal Gradient
Figure 3.11: A comparison of the total time taken for the 3DFFT on a data cube of thegiven extents (x) on varying numbers of MPI tasks on HPCx, using theESSL library. The ideal gradient is perfect scaling – in which doublingthe number of tasks halves the time taken.
results in slower overall timings for either decomposition. To a certain degree, this could
be blamed to combinations of noise in the machine, however, the digressions at high
numbers of processors are fairly extreme, and could indicate that the synchronisations
for such numbers in HPCx are less than optimal.
3.3.3 Scaling
As with Figure 3.4 for HECToR, Figure 3.11 shows that the slab decomposition scales
more badly at lower numbers of processors than the rod decomposition. HPCx appears
to exhibit worse overall scaling than HECToR for this measure – efficiency is generally
lower, as indicated by the mean gradient of each line and its deviation towards the
horizontal from ideal.
36
Chapter 3. Results 3.4. BlueSky
1e-05
0.0001
0.001
0.01
0.1
1
10
32 64 128 256 512 1024
Tim
e sp
ent i
n F
FT
cal
ls (
s)
Extent
FFTW3FFTW2
ESSL
Figure 3.12: A comparison of the time spent in FFT calls for a rod decomposition 3DFFT, for different libraries, using 128 processors, on BlueSky.
3.4 BlueSky
3.4.1 Libraries
As we expected, the ESSL library performs significantly better on the Blue Gene system
than either of the FFTW libraries – the Blue Gene processor is a moderately esoteric
type, and the ESSL library distributed with the Blue Gene library packages will have
been specifically optimised by IBM to take advantage of the special features in the
PowerPC 440 processor, e.g. the double floating point unit, which has special double-
instructions for complex number-type operations. The FFTW packages compiled up
simply may not have been able to take advantage of these features, and so will not be
using the processor most efficiently.
37
Chapter 3. Results 3.4. BlueSky
0.0001
0.001
0.01
0.1
1
10
32 64 128 256 512 1024
Tot
al T
ime
(s)
Extent
Slab, p=2Rod, p=4Slab, p=4Rod, p=8Slab, p=8
Rod, p=16Slab, p=16Rod, p=32Slab, p=32Rod, p=64Slab, p=64
Rod, p=128Slab, p=128
Figure 3.13: A comparison of the total time taken for the two different decompositionsof the 3DFFT on BlueSky, using the ESSL library.
0.0001
0.001
0.01
0.1
1
10
32 64 128 256 512 1024
Tot
al T
ime
(s)
Extent
Rod, p=32Slab, p=32Rod, p=64Slab, p=64
Rod, p=128Slab, p=128Rod, p=256Slab, p=256Rod, p=512Slab, p=512
Rod, p=1024Slab, p=1024
Figure 3.14: A comparison of the total time taken for the two different decompositionsof the 3DFFT on BlueSky, using the FFTW3 library.
38
Chapter 3. Results 3.4. BlueSky
3.4.2 Slabs vs Rods
Unfortunately in this case, our processing budget was depleted before we could capture
results for the higher numbers of processors with ESSL, but the timings we have ob-
tained for this library – shown in Figure 3.13 – demonstrate a similar relationship to the
others shown so far – that using the slab decomposition is faster than using a rod decom-
position. In this case, there is even overlap between the two between 32 and 64 proces-
sors – using 64 processors with the rod decomposition proves to be slower than using
32 with the slab decomposition. In the case of Blue Gene this may be solvable with
controlled process positioning within the node, however, this is considering the naive
case. With FFTW3 we have obtained results for higher numbers of processors, Figure
3.14, which demonstrate even more overlap between slab and rod decomposition tim-
ings, indicating a communications library that performs well even for high-congestion
communications.
3.4.3 Scaling
In the graphs demonstrating scaling on BlueSky, we see the effect of not filling par-
titions of the machine combined with non-optimal default process placing within the
partition – we see poorer scaling in processor counts that do not use all the processors
in the partition, as the mean link count between active processors is longer, leading to
higher latency. (Partition sizes are 32, 128, and 512 processors.) These leads to the
’stepped’ effect evident in 3.16. This can be eliminated by careful placing of the active
processors, as demonstrated by Heike Jagode [1], but usually there is no reason not to
fill the partition.
If we discount the unfilled partitions, we see excellent scaling for all extents upwards
of 32 - for this size the communication overheads seem to dominate past p ≥ 512.
39
Chapter 3. Results 3.4. BlueSky
0.0001
0.001
0.01
0.1
1
10
2 4 8 16 32 64 128
Tim
e (s
)
Tasks
Slab, x=32Rod, x=32Rod, x=48Slab, x=64Rod, x=64Rod, x=96
Slab, x=128Rod, x=128Rod, x=192Slab, x=256Rod, x=256Rod, x=384Slab, x=512Rod, x=512
Ideal Gradient
Figure 3.15: A comparison of the total time taken for the 3DFFT on a data cube of thegiven extents (x) on varying numbers of MPI tasks on BlueSky, using theESSL library. The ideal gradient is perfect scaling – in which doublingthe number of tasks halves the time taken.
0.0001
0.001
0.01
0.1
1
10
32 64 128 256 512 1024
Tim
e (s
)
Tasks
Rod, x=32Rod, x=48Slab, x=64Rod, x=64Rod, x=96
Slab, x=128Rod, x=128Rod, x=192Slab, x=256Rod, x=256Rod, x=384Slab, x=512Rod, x=512Rod, x=768
Slab, x=1024Rod, x=1024
Ideal Gradient
Figure 3.16: A comparison of the total time taken for the 3DFFT on a data cube of thegiven extents (x) on varying numbers of MPI tasks on BlueSky, using theFFTW3 library. The ideal gradient is perfect scaling – in which doublingthe number of tasks halves the time taken.
40
Chapter 3. Results 3.5. Eddie
1e-06
1e-05
0.0001
0.001
0.01
0.1
1
10
16 32 64 128 256 512 1024
Tim
e sp
ent i
n F
FT
cal
ls (
s)
Extent
MKLFFTW3FFTW2
Figure 3.17: A comparison of the time spent in FFT calls for a rod decomposition 3DFFT, for different libraries, using 16 processors, on Eddie.
3.5 Eddie
3.5.1 Libraries
Library performance figures in this case, as with the figures for Ness, show that MKL
has a uniquely high amount of overhead attached to its FFT call. Otherwise, the per-
formance for FFTW2 and FFTW3 is very similar, with FFTW3 appearing to perform
slightly better for higher extents, but not to a significant degree.
3.5.2 Slabs vs Rods
Similarly to the other performance figures for this comparison, Eddie demonstrates bet-
ter performance generally for the slab decomposition. In this case, however, there is
much less of a difference between the two, and there is even one case where the rod
decomposition has outperformed the slab decomposition in a stable series for 64 pro-
41
Chapter 3. Results 3.5. Eddie
0.0001
0.001
0.01
0.1
1
10
100
1000
10000
16 32 64 128 256 512 1024
Tot
al T
ime
(s)
Extent
Rod, p=8Slab, p=8
Rod, p=16Slab, p=16Rod, p=32Slab, p=32Rod, p=64Slab, p=64
Figure 3.18: A comparison of the total time taken for the two different decompositionsof the 3DFFT on Eddie, using the FFTW3 library.
cessors with an extent of 256. There is some overlap between processor counts, but only
at very low extent values and high processor counts, where it could be expected that the
synchronisation inherent in the communication could overwhelm the computation time.
The lack of difference between the two decompositions would seem to indicate that ei-
ther the interconnect is not being used to its maximum efficacy, that the communication
performance is comparable to the local memory performance, or that the communica-
tion steps are insignificant compared to the FFT calls. Comparison with Figure 3.17,
and the technology on the Eddie platform would suggest that the former is more likely,
and if this is true, it is possible that the MPI implementation suffers significantly with
higher complexity all-to-all communication. Investigation of these issues on this plat-
form could bear investigation.
42
Chapter 3. Results 3.5. Eddie
0.0001
0.001
0.01
0.1
1
10
100
8 16 32 64
Tim
e (s
)
Tasks
Slab, x=16Rod, x=16Rod, x=24Slab, x=32Rod, x=32Rod, x=48Slab, x=64Rod, x=64Rod, x=96
Slab, x=128Rod, x=128Rod, x=192Slab, x=256Rod, x=256Rod, x=384Slab, x=512Rod, x=512Rod, x=768
Ideal Gradient
Figure 3.19: A comparison of the total time taken for the 3DFFT on a data cube of thegiven extents (x) on varying numbers of MPI tasks on Eddie, using theFFTW3 library. The ideal gradient is perfect scaling – in which doublingthe number of tasks halves the time taken.
3.5.3 Scaling
Analysing the timing with respect to extent on Eddie would seem to indicate that the
network latency may suffer somewhat from random placing as in the case of BlueSky,
or else from network congestion adding to latency – there are slight incongruities in the
graph that seem to resemble the steps on BlueSky, but are more random. In general,
however, Eddie appears to exhibit good scaling for x ≥ 48 – at this count we get
good performance at 64 processors, while below this we fail to gain any time benefit
from adding more processors past 16, and at x = 16 adding more processors becomes
detrimental not only to efficiency but also to total time taken.
43
Chapter 3. Results 3.6. MareNostrum
1e-05
0.0001
0.001
0.01
32 64 128 256
Tim
e sp
ent i
n F
FT
cal
ls (
s)
Extent
FFTW3FFTW2
ESSL
Figure 3.20: A comparison of the time spent in FFT calls for a rod decomposition 3DFFT, for different libraries, using 32 processors, on MareNostrum.
3.6 MareNostrum
3.6.1 Libraries
Save for one extent value, 64, IBM’s ESSL library outperforms the two FFTW libraries
slightly for each extent. Again, this could be due to IBM having the opportunity to
carefully optimise the library for the platform – IBM supplied MareNostrum in its en-
tirety and could have tailored certain routines to take most advantage of the hardware,
or optimally adjusted parameters in compilation.
3.6.2 Slabs vs Rods
Performance figures in this category once again demonstrate that the slab decomposition
is significantly faster than the rod decomposition, with a few cases where the lower
number of processors proves to be faster for the slab decomposition than the higher for
44
Chapter 3. Results 3.6. MareNostrum
the rod decomposition.
0.0001
0.001
0.01
0.1
1
32 64 128 256 512
Tot
al T
ime
(s)
Extent
Rod, p=32Slab, p=32Rod, p=64Slab, p=64
Rod, p=128Slab, p=128Rod, p=256Slab, p=256Rod, p=512
Figure 3.21: A comparison of the total time taken for the two different decompositionsof the 3DFFT on MareNostrum, using the ESSL library.
3.6.3 Scaling
Timings on MareNostrum seem to not scale particularly well (Figure 3.22), with per-
formance increases with increasing numbers of processors tailing off slightly sooner
than we would expect. It is possible that this is due to a sub-optimal process placing as
suggested for Eddie and BlueSky, as a high message latency, for whatever reason, could
cause figures of this type.
45
Chapter 3. Results 3.7. HLRB II
0.0001
0.001
0.01
0.1
1
32 64 128 256 512
Tim
e (s
)
Tasks
Rod, x=32Rod, x=48Slab, x=64Rod, x=64Rod, x=96
Slab, x=128Rod, x=128Rod, x=192Slab, x=256Rod, x=256
Ideal Gradient
Figure 3.22: A comparison of the total time taken for the 3DFFT on a data cube ofthe given extents (x) on varying numbers of MPI tasks on MareNostrum,using the ESSL library. The ideal gradient is perfect scaling – in whichdoubling the number of tasks halves the time taken.
3.7 HLRB II
3.7.1 Libraries
As with other platforms, HLRB II shows no significant performance difference between
FFTW3 and FFTW2, and shows the significant performance overhead for MKL, though
in this case it seems to be mitigated. This could be due to MKL in this case running
on fast, high-performance Intel hardware, rather than an AMD processor or a slightly
older Intel chip.
3.7.2 Slabs vs Rods
Results for this platform are somewhat chaotic, with no clear trend emerging other than
that for most processor counts at high extents, using more processors caused the cal-
46
Chapter 3. Results 3.7. HLRB II
1e-05
0.0001
0.001
0.01
0.1
1
10
100
32 64 128 256 512 1024 2048
Tim
e sp
ent i
n F
FT
cal
ls (
s)
Extent
MKLFFTW3FFTW2
Figure 3.23: A comparison of the time spent in FFT calls for a rod decomposition 3DFFT, for different libraries, using 64 processors, on HLRB II.
0.001
0.01
0.1
1
10
100
1000
32 64 128 256 512 1024 2048
Tot
al T
ime
(s)
Extent
Rod, p=32Slab, p=32Rod, p=64Slab, p=64
Rod, p=128Slab, p=128Rod, p=256Slab, p=256Rod, p=512Slab, p=512
Rod, p=1024Slab, p=1024Rod, p=2048
Figure 3.24: A comparison of the total time taken for the two different decompositionsof the 3DFFT on HLRB II, using the FFTW3 library.
47
Chapter 3. Results 3.8. Automatic Parallel Routines
0.0001
0.001
0.01
0.1
1
10
100
1000
32 64 128 256 512 1024 2048
Tim
e (s
)
Tasks
Rod, x=32Rod, x=48Slab, x=64Rod, x=64Rod, x=96
Slab, x=128Rod, x=128Rod, x=192Slab, x=256Rod, x=256Rod, x=384Slab, x=512Rod, x=512Rod, x=768
Slab, x=1024Rod, x=1024Rod, x=1536
Ideal Gradient
Figure 3.25: A comparison of the total time taken for the 3DFFT on a data cube of thegiven extents (x) on varying numbers of MPI tasks on HLRB II, using theFFTW3 library. The ideal gradient is perfect scaling – in which doublingthe number of tasks halves the time taken.
culation to take less time, but even this is not dependable. It has been suggested that
these noise is due to the different regions of the large quantity of memory the program
requires not all being allocated locally initially, which would cause unexpected per-
formance when transferring them between tasks. The performance of the interconnect
under large all-to-all data transfers could bear further investigation.
3.8 Automatic Parallel Routines
For all our comparison timings, we used our own coded routines. However, we also
tested the parallel calls for FFTW2 and ESSL on platforms where available. Figure
3.26 shows a comparison of the time taken by the FFTW2 automatic MPI routines to
the time taken by our routines by percentage, on HECToR.
48
Chapter 3. Results 3.8. Automatic Parallel Routines
20
40
60
80
100
120
140
32 64 128 256 512 1024
Per
cent
age
of M
anua
l Tim
e
Extent
p=2p=4p=8
p=16p=32p=64
p=128p=256p=512
p=1024
Figure 3.26: A comparison of the time taken for a slab decomposition 3D FFT, forthe automatic parallel routines, as a percentage of the time taken for ourwritten routines, using FFTW2, on HECToR.
The FFTW2 parallel routines demonstrated an extreme performance increase over our
routines for smaller extents, but were slower for larger extents. We assume that FFTW2
is performing not only optimised FFT routines, but optimised MPI calls. Instrumenting
this using the MPI profiling tool, Vampir [34] on Ness, however, revealed that, for 4
processors at least, internally FFTW2 is using the same MPI_Alltoall as our routines,
as shown in Figure 3.27. The FFTW2 MPI code appears to have two different methods,
one using all-to-all calls explicitly which it uses to perform out-of-place operations, and
one using non-blocking point-to-point communications, to perform in-place. We only
tested the out-of-place operations, however.
We found the results found on HECToR to be generally true across platforms – the
automatic routines were up to 2.5 times faster, especially for low extents, but could be
slightly slower for extents x ≥ 512.
Using PESSL on HPCx, however, we found the opposite to be true – smaller extents
49
Chapter 3. Results 3.8. Automatic Parallel Routines
were slightly slower than our routines, and it was only at x ≥ 512 that the PESSL
implementation was faster (Figure 3.28). Admittedly this may be able to be improved
upon using a pure BLACS implementation rather than our BLACS on MPI within MPI
approach.
Process 0 63 128 122 MPI_Alltoall
Process 1 63 128 122 MPI_Alltoall
Process 2 MPI_Barrier 128 122 MPI_Alltoall
Process 3 MPI_Barrier main fftwnd_mpi 122 MPI_Alltoall
MPIApplication
3.409 s3.408 s
a.otf (3.408 s - 3.409 s = 1.334 ms) Printed by Vampir
Figure 3.27: A Vampir trace timeline taken on Ness, showing the barrier proceeding,and the communication performed within, the FFTW2 parallel routine.
For the sake of comparison, we compiled the alpha release of FFTW3, FFTW3.2a3, on
Ness, and compared the routines in a similar fashion (Figure 3.29). In a similar fashion
50
100
150
200
250
300
64 128 256 512 1024
Per
cent
age
of M
anua
l Tim
e
Extent
p=64p=128p=256p=512
p=1024
Figure 3.28: A comparison of the time taken for a slab decomposition 3D FFT, for theautomatic parallel routines using PESSL, on HPCx.
50
Chapter 3. Results 3.8. Automatic Parallel Routines
0
10
20
30
40
50
60
70
80
90
100
110
8 16 32 64 128 256 512
Per
cent
age
of M
anua
l Tim
e
Extent
p=1p=2p=4p=8
p=16
Figure 3.29: A comparison of the time taken for a slab decomposition 3D FFT, forthe automatic parallel routines, as a percentage of the time taken for ourwritten routines, using our compiled version of FFTW3.2a3, on Ness.
to FFTW2, we found that the FFTW3 parallel routines could be much more efficient for
small extents, however, the improvement, while diminished, was maintained for larger
extents. We did not, unfortunately, have time to test this on a larger system, however,
the results are promising for the parallel routines in FFTW3.
51
Chapter 4
Discussion and Conclusions
We have made a number of specific observations, however, we may also bring these
together to discuss, in more overarching terms, the results of this investigation, and how
it can be taken further.
4.1 Rods & Slabs
The results obtained strongly suggest that for our cubic data objects, on a system with a
high-performance interconnect, a slab decomposition should outperform a rod decom-
position in almost every case, with the slab decomposition tending to lose scaling at
approximately p = x2. If flexibility is needed, however, code to generate the rod decom-
position can easily be modified to be able to generate both types – in fact, this is the
approach we used – and the slab decomposition used where ever possible.
52
Chapter 4. Discussion and Conclusions 4.2. Libraries
4.2 Libraries
Our main intention in testing speeds of different libraries was to see how the vendor-
supplied libraries compared to the oft-used and cross-platform FFTW libraries. In these
cases, it seems that although the FFTW libraries do not get the best performance on ev-
ery platform, it is only on BlueSky, an unusual architecture, that they are outperformed
by a uniform wide margin, otherwise getting very similar and often slightly better per-
formance to vendor-libraries. We can attribute this to the optimisation of ESSL for the
Blue Gene/L platform – optimising for this means designing your routines to take best
advantage of the PowerPC440’s unusual double floating-point unit, a process akin to,
but different from, optimising for the Streaming SIMD Extensions feature available in
recent x86 processors to speed floating-point calculations. ESSL seems to generally
perform well for every platform upon which it is available.
We would therefore make the suggestion that any software that makes use of FFT rou-
tines and that favours portability over strictly higher performance should probably use
the FFTW libraries, unless the developer can be absolutely certain that every platform
the software will run on will use the same vendor library.
4.3 Automatic Routines
Performance and ease of use suggests that if only a slab decomposition is needed,
FFTW2’s library routines are a particularly good mechanism. It is hoped that when
the FFTW3 parallel routines enter the stable version, they will perform as well. PESSL,
on the other hand, may be easier to use if BLACS is already being employed – to avoid
switching to MPI, but evidence suggests it should not be used for pure performance
reasons.
53
Chapter 4. Discussion and Conclusions 4.4. Improving the Benchmark
4.4 Improving the Benchmark
Having reviewed the benchmark code, we may ascertain a number of ways in which it
may be improved. The complex number support is currently somewhat fragmented –
as previously stated, with the bitwise compatibility of all the complex types, it is pos-
sible to operate on them without knowing which library is being used. The data arrays
would then only need to be cast when being passed into FFT function calls. Rewrit-
ing the complex number handling functions in this way would allow for much greater
optimisation potential, as there would be only one method to attempt to accelerate.
As is, the benchmark executable performs only one full 3D FFT before exiting. In
retrospect, this was exceptionally wasteful, as the data-independent data and planning
data must be recalculated each time. Making the executable perform multiple runs
would be a fairly quick fix, and in fact, we quickly implemented a temporary form of it
for the MKL test, but did not implement the feature fully.
Much of the timing data is amalgamated within the program, being output as total time,
and two different numbers which represent communications time and FFT time in the
rod decomposition only. A more detailed timing readout could provide much more
meaningful data with the same amount of time spent in computation.
4.5 Future Work
This study has explored many parameters, but none in extreme detail; therefore, there
is much related work that we could perform.
We have seen that the slab decomposition generally outperforms the rod decomposition,
but these were both load-balanced in every case. The rod decomposition offers more
flexibility given this constraint, but a comparison between the rod decomposition, and
54
Chapter 4. Discussion and Conclusions 4.5. Future Work
an unbalanced slab decomposition, in which the working arrays have been padded up
to a size suitable for a slab decomposition or the more flexible MPI_Alltoallv function
used, could yield interesting results. Similarly, we have only used a rod decomposition
with a processor grid that is square, or as square as possible. It is possible that this
is sub-optimal, especially if network topology is arranged such that a less balanced
decomposition could be arranged to have lower-latency communications for both all-
to-all steps.
The MPI_Alltoallv function could in fact be used to construct the whole rod decom-
position, but this is a complex operation to perform. If MPI_Alltoall over subsets of
processors performs more badly than MPI_Alltoallv on the global set, this could give
performance gains over the current rod decompositional method. A general benchmark
comparing all the different parameters of the different possible techniques for perform-
ing the communications techniques used in the parallel 3D FFT would be of general
interest. This could consist of latency, insertion bandwidth, bisection bandwidth and
derived datatype packing speed tests, as well as a comparison of the speed of all-to-all
methods using the normal all-to-all calls as compared to non-blocking point-to-point
calls, and tests of memory latency and bandwidth. It should be possible to use all these
factors to build an image of which will limit parallel FFT speed in an application, and
whether it can be improved.
One of the parameter restrictions which we have given less attention to is the restriction
to processor counts p = 2n where n is a positive integer. Typical collective algorithms
can perform best with this type of restriction in place, and an investigation of the effects
of relaxing it may yield interesting results.
Additional possible work could be used, after improving the benchmark application,
to make it suitable for general distribution and use – currently it can require much
knowledgeable intervention to compile using the different libraries it supports, and this
55
Chapter 4. Discussion and Conclusions 4.5. Future Work
could be automated given enough time. The kernel of the benchmark could even be
made into a generally usable, highly configurable library performing parallel 3D FFTs
in other applications. This would require significant work, however, including perfor-
mance tuning and optimisation research, and given the promise shown by the FFTW3
parallel routines, it is possibly not a useful avenue of research.
With the growing tendency towards the utilisation of greater and greater numbers of
processors, and multicore processing, making efficient use of these processing units be-
comes more and more challenging. The slab decomposition may be easier to implement
and suitable for many applications, but it may be inadequate for systems like BlueSky,
with its large numbers of relatively low power processors. At time of writing, IBM have
in development, an experimental library for the Blue Gene platform, designed to make
optimal use of its toroidal network [35]. It could be interesting to see whether a li-
brary making use of FFTW3 could perform comparably – a mature, high-performance,
portable, and open-source solution could greatly benefit the myriad fields in which this
technique is employed.
56
Bibliography
[1] H. Jagode, “Fourier Transforms for the BlueGene/L Communication Network,”
Master’s thesis, EPCC, 2006.
[2] U. Sigrist, “Optimizing Parallel 3D Fast Fourier Transformations for a Cluster of
IBM POWER5 SMP Nodes,” Master’s thesis, EPCC, 2007.
[3] C. F. Gauss, “Nachlass: Theoria interpolationis methodo nova tractata,” Werke,
vol. 3, pp. 265–327, 1866.
[4] J. W. Cooley and J. W. Tukey, “An algorithm for the machine calculation of com-
plex Fourier series,” Math. Comput., vol. 19, pp. 297–301, 1965.
[5] M. T. Heideman, D. H. Johnson, and C. S. Burrus, “Gauss and the history of the
fast Fourier transform,” IEEE ASSP Magazine, vol. 1, no. 4, pp. 14–21, 1984.
[6] C. M. Rader, “Discrete Fourier transforms when the number of data samples is
prime,” Proceedings of the IEEE, vol. 56, no. 6, pp. 1107–1108, 1968.
[7] L. I. Bluestein, “A linear filtering approach to the computation of the discrete
Fourier transform,” Northeast Electronics Research and Engineering Meeting
Record, vol. 10, pp. 218–219, 1968.
[8] J. Hein, A. Simpson, A. Trew, H. Jagode, and U. Sigrist, “Parallel 3D-FFTs for
Mult-processing Core nodes on a Meshed Communication Network,” in Proceed-
ings of the CUG 2008, 2008.
57
Bibliography Bibliography
[9] M. Frigo and S. G. Johnson, “The Design and Implementation of FFTW3,” Pro-
ceedings of the IEEE, vol. 93, no. 2, pp. 216–231, 2005, special issue on "Program
Generation, Optimization, and Platform Adaptation".
[10] (2008, August) Engineering Scientific Subroutine Library (ESSL) and Parallel
ESSL. [Online].
Available: http://www-03.ibm.com/systems/p/software/essl/index.html
[11] (2008, August) The Basic Linear Algebra Communication Subprograms Project.
[Online].
Available: http://www.netlib.org/blacs/
[12] (2008, August) AMD Core Math Library 4.1.0 User Guide. [Online].
Available: http://developer.amd.com/assets/acml_userguide.pdf
[13] (2008, August) Intel Math Kernel Library Reference Manual version 024.
[Online].
Available: http://softwarecommunity.intel.com/isn/downloads/
softwareproducts/pdfs/347468.pdf
[14] (2008, August) HECToR - UK National Supercomputing Service. [Online].
Available: http://www.hector.ac.uk/
[15] (2008, August) HECToR and the University of Edinburgh. [Online].
Available: http://www.hector.ac.uk/about-us/partners/uoe/
[16] (2008, August) Cray XT4 and XT3 Supercomputers. [Online].
Available: http://www.cray.com/products/xt4/
[17] (2008, August) The Portland Group. [Online].
Available: http://www.pgroup.com/
[18] (2008, August) Pathscale 64-bit Compilers. [Online].
Available: http://www.pathscale.com/
58
Bibliography Bibliography
[19] (2008, August) HPCx. [Online].
Available: http://www.hpcx.ac.uk/
[20] (2008, August) Daresbury SIC: HPCx. [Online].
Available: http://www.daresburysic.co.uk/facilities/expertise/hpcx
[21] O. Lascu, Z. Borgosz, P. Pereira, J.-D. S. Davis, and A. Socoliuc, An Introduction
to the New IBM eserver pSeries High Performance Switch. IBM, 2003.
[22] (2008, August) EPCC - Ness. [Online].
Available: http://www2.epcc.ed.ac.uk/∼ness/documentation/ness/index.html
[23] (2008, August) EPCC - Blue Gene. [Online].
Available: http://www2.epcc.ed.ac.uk/∼bgapps/user_info.html
[24] G. Almási, C. Archer, J. G. C. nos, J. A. Gunnels, C. C. Erway, P. Heidelberger,
X. Martorell, J. E. Moreira, K. Pinnow, J. Ratterman, B. D. Steinmacher-Burow,
W. Gropp, and B. Toonen, “Design and implementation of message-passing ser-
vices for the Blue Gene/L supercomputer,” IBM Journal of Research and Devel-
opment, vol. 49, no. 2/3, 2005.
[25] (2008, August) ECDF. [Online].
Available: http://www.ecdf.ed.ac.uk/
[26] (2008, August) Barcelona Supercomputing Centre. [Online].
Available: http://www.bsc.es/
[27] (2008, August) LRZ: Höchstleistungsrechner in Bayern (HLRB II). [Online].
Available: http://www.lrz-muenchen.de/services/compute/hlrb/
[28] A. Dubey and D. Tessera, “Redistribution strategies for portable parallel FFT: a
case study,” Concurrency and Computation: Practice and Experience, vol. 13,
no. 3, pp. 209–220, 2001.
59
Bibliography Bibliography
[29] The Current ISO C 99 Standard (with technical corrigenda TC1, TC2 and TC3).
[Online].
Available: http://www.open-std.org/JTC1/SC22/WG14/www/docs/n1256.pdf
[30] ANSI C — ANS X3.159-1989, Programming Language C, 1989.
[31] (2008, August) Autoconf - a tool for generating configure scripts. [Online].
Available: http://www.gnu.org/software/autoconf/
[32] (2008, August) pkg-config. [Online].
Available: http://pkg-config.freedesktop.org/wiki/
[33] X/Open CAE Specification, System Interface Definitions, Issue 4 Version 2,
September 1994.
[34] (2008, August) Vampir - MPI Instrumentation. [Online].
Available: http://www.vampir.eu/
[35] (2008, August) 3D Fast Fourier Transform Library for Blue Gene/L. [Online].
Available: http://www.alphaworks.ibm.com/tech/bgl3dfft
60
Appendix A
All-to-all Data Rearrangement
The actual code used to rearrange the data prior and post all-to-all calls is somewhat
obfuscated, largely due to the complex operation being performed on the current index.
It loads input data contiguously, to aid cached data reuse, and uses integer division
where appropriate to discard remainders.
It essentially performs transposes across subarrays in whichever dimension an all-to-
all is about to take place in, both organising the data into the order it should be in its
new state, and making it contiguous to allow use of MPI_Alltoall without the use of a
non-primitive datatype specification.
The unpack routine then takes the correctly ordered blocks after the all-to-all and spaces
them correctly within the array.
61
Appendix A. All-to-all Data Rearrangement
Figure A.1: The code used to rearrange data prior to the all-to-all.
/* domainSize[2] -> an array containing how many rods each processor has, ** in each dimension of the 2D decomposition */
/* extent -> the extent across one edge of the data cube *//* *dataIn -> a pointer to the input data *//* *dataOut -> a pointer to the buffer to be used for the all-to-all */
void ataRowRearrange(complexType *dataIn, complexType *dataOut,int domainSize[2], int extent)
{ /* Rearranges the data in a domain such that all the data ** that needs to be sent to one processor is contiguous and ** in the right order, for an all-to-all across rows of a ** 2D decomposition of a 3D array. */
int i;
/* Loop over every element this processor holds */for(i=0;i<domainSize[0]*domainSize[1]*extent;i++){
/* Assign complex number from pointer to pointer */complexAssign(&dataOut[
( ( i % domainSize[0] ) * domainSize[0] ) +( ( ( i % extent ) / domainSize[0] ) * domainSize[0] * domainSize[0] * domainSize[1] ) +( ( i / extent ) % domainSize[0] ) +( ( i / ( domainSize[0] * extent ) ) * domainSize[0] * domainSize[0] )
] , dataIn[i]);
}}
void ataColRearrange(complexType *dataIn, complexType *dataOut,int domainSize[2], int extent)
{ /* Rearranges the data in a domain such that all the data ** that needs to be sent to one processor is contiguous and ** in the right order, for an all-to-all across cols of a ** 2D decomposition of a 3D array. */
int i;
for(i=0;i<domainSize[0]*domainSize[1]*extent;i++){
complexAssign(&dataOut[( i % extent ) * domainSize[0] * domainSize[1] +( ( i / extent ) % domainSize[0] ) * domainSize[1] +( i / ( domainSize[0] * extent ) )
] , dataIn[i]);
}}
62
Appendix A. All-to-all Data Rearrangement
Figure A.2: The code used to unpack data after the all-to-all.
void ataRowUnpack(complexType *dataIn, complexType *dataOut,int domainSize[2], int extent)
{ /* Unpacks data after all-to-all across rows. */int i;for(i=0;i<domainSize[0]*domainSize[1]*extent;i++){
complexAssign(&dataOut[(i%domainSize[0]) +( ((i/domainSize[0]) % (domainSize[0] * domainSize[1])) * extent ) +( ( i / (domainSize[0] * domainSize[0] * domainSize[1] )) * domainSize[0] )
],dataIn[i]);
}}
void ataColUnpack(complexType *dataIn, complexType *dataOut,int domainSize[2], int extent)
{ /* Unpacks data after all-to-all across cols. */int i;for(i=0;i<domainSize[0]*domainSize[1]*extent;i++){
complexAssign(&dataOut[(i%domainSize[1]) +( ((i/domainSize[1]) % (domainSize[0] * domainSize[1])) * extent ) +( ( i / (domainSize[0] * domainSize[1] * domainSize[1] )) * domainSize[1] )
],dataIn[i]);
}}
63
Appendix B
Patching FFTW3 to Blue Gene/L
FFTW3.2a3 failed to compile ’out of the box’ on the BlueSky system, and the fol-
lowing patch was devised with the assistance of Matteo Frigo of MIT, and applied to
kernel/cycle.h. The problem is believed to stem from an incompatibility be-
tween the Blue Gene/L and other, more common PowerPC chips – the patch adds a
Blue Gene/L-specific section which uses an IBM-compiler-specific call to the chip’s
native timing routines.
64
Appendix B. Patching FFTW3 to Blue Gene/L
131a132,174>>>> /*----------------------------------------------------------------*/> /*> * Blue Gene/L version of ‘‘cycle’’ counter using the time> * base register.> */> /* 64 bit */> #if defined(__blrts__) && (__64BIT__) && !defined(HAVE_TICK_COUNTER)> typedef unsigned long long ticks;>> static __inline__ ticks getticks(void)> {> return __mftb();> }>> INLINE_ELAPSED(__inline__)>> #define HAVE_TICK_COUNTER> #endif>> /* 32 bit */> #if defined(__blrts__) && !defined(HAVE_TICK_COUNTER)> typedef unsigned long long ticks;>> static __inline__ ticks getticks(void)> {> unsigned int tbl, tbu0, tbu1;>> do {> tbu0 =__mftbu();> tbl =__mftb();> tbu1 =__mftbu();> } while (tbu0 != tbu1);> return (((unsigned long long)tbu0) << 32) | tbl;> }>> INLINE_ELAPSED(__inline__)>> #define HAVE_TICK_COUNTER> #endif>
Figure B.1: The patch, applied to cycle.h.
65
Appendix C
Readme for Software Package
We could not obtain direct access to the MareNostrum platform, and instead we pro-
vided David Vicente of the Barcelona Supercomputing Centre with our software and a
’readme’ file explaining how to use it. This file is reproduced below.
=== 3D FFT Benchmark ===
Unfortunately, I didn’t have time to learn how to
use autoconf for this, so some manual editing of the
Makefile may be required.
The list of steps required to run this consist of:
1) Compile all versions.
2) Make a template file for the job scripts.
3) Run batchmaker.
4) Move jobs and executables to a staging directory if necessary.
5) Submit jobs.
== 1 - Compile all versions ==
In an ideal environment, you can just:
make LIB=fftw3
make sweep
make LIB=fftw2
make sweep
make LIB=essl
66
Appendix C. Readme for Software Package
make sweep
make LIB=mkl
make sweep
make LIB=acml
... whichever apply on the system.
Other settings are:
CC=[gcc|pgcc|xlc|icc|xlc-bg]
Sets the compiler type underlying the usual MPI compiler wrapper,
for purposes of compiler flags, defaults to gcc.
If you have none of these, you can set the flags used for compilation
separately using:
CFLAGS=
By default, contains optimisation flags appropriate for the above.
MPICC=
Contains the name of the MPI compiler wrapper. Defaults to ’mpicc’.
EXTRAFLAGS=
Empty by default, added to every compilation line. Use for flags you
need to include to specify extra libraries needed to link against on
your system, or -L and -I flags to specify library locations.
The makefile assumes maximum capabilities for each library by default
(for SYSTEM=generic, which means that FFTW2 is assumed to be compiled
with MPI support, without type-prefixes (use LIB=dfftw2 otherwise), that
MKL includes parallel support, and that ESSL includes PESSL.
== 2 - Make a template file ==
There are a number of templates in the templates directory which may be able
to be re-configurable to your batch system - you may be able to take the PBS
or SGE one directly and merely alter the account code. template.template
contains a list of all the keytags that batchmaker replaces, as well as a
non-specific template form.
== 3 - Run batchmaker ==
./batchmaker.sh fft-*
will usually do the job, assuming you’re in the directory where you
compiled the fft executables. Batchmaker is pretty self-explanatory
67
Appendix C. Readme for Software Package
to use, and generates a pile of job-version-cpucount.nys files, which
are job files to be submitted.
== 4 - Move files to staging directory ==
If you need to, move all the *.nys files and the fft-* executables to a
staging directory at this point...
== 5 - Submit ==
And submit them, however you do that on your system.
68
Appendix D
Work Plan and Organisation
Our workplan changed quite significantly during the project, due to having underesti-
mated both the complexity of the 3D FFT operation, and the difficulties of porting the
benchmarking application to all the libraries and platforms. Additionally, a number of
steps that we had marked as discrete blended together where practical.
Our original workplan follows:
WP 1 — Write Generic FFT Code – 2.5 weeks
WP 2 — Write Execution and Result Collation Scripts – 2 weeks
WP 3 — Port and Add Library Support – 3 weeks
WP 4 — Perform Experimental Runs and Analyse Results – 2 weeks
WP 5 — Make any Necessary Adjustments to Software – 1 week
WP 6 — Perform Complete Runs – 1 week
WP 7 — Complete Write-up – 4 weeks
In reality, modifications were being made to the code almost continuously throughout,
though the initial working version of the code took longer than expected to develop, due
69
Appendix D. Work Plan and Organisation
to the complexity of the data rearrangement algorithm as given in Appendix A. When all
the code but this section was complete, the execution and collation scripts were written
at the same time as working on this algorithm. We considered portability throughout
the code implementation, despite not adding the actual extra library calls until we began
the porting phase, and the execution scripts were designed to be modular and portable
from their creation. The porting phase took longer than expected, as we encountered
unforeseen issues on each platform; and experimental runs and adjustment, rather than
being performed as discrete, whole tasks, were performed for each platform in turn.
In retrospect, our various testing procedures should have been planned more carefully,
to save time and resources. We wasted an unfortunate amount of time in performing
non-specific tests, when it would probably have been much more economical to perform
some sort of step-by-step analysis of each facet of the process for obtaining our results.
Planning such projects in this level of detail is hardly an exact and rigorous procedure,
however, and while we may have not adhered to our plan, designing it at such an early
stage did provide us with a good indication of the complexity and the steps and depen-
dencies required in the project.
70