Post on 21-Feb-2015
GRIDDING FOR RADIO ASTRONOMY
ON COMMODITY GRAPHICS HARDWARE
USING OPENCL
Alexander Ottenhoff
School of Electrical and Electronic Engineering
University of Western Australia
Supervisor
Dr Chistopher Harris
Research Associate
International Centre for Radio Astronomy Research
Co-Supervisor
Associate Professor Karen Haines
Western Australian Supercomputer Program
October 2010
ii
16 Arenga CrtMount Claremont WA 6010
October 29, 2010
The DeanFaculty of Engineering Computing and MathematicsThe University of Western Australia35 Stirling HighwayCRAWLEY WA 6009
Dear Sir,
I submit to you this dissertation entitiled “Gridding for Radio Astronomy on Com-modity Graphics Hardware using OpenCL” in partial fulfillment of the requirementof the award of Bachelor of Engineering.
Yours faithfully,
Alexander Ottenhoff
Abstract
With the emergence of large radiotelescope arrays, such as the MWA, ASKAP and
SKA, the rate at which data is generated is nearing the limits of what can currently
be processed or stored in real time. Since processor clock rates have plateaued
computer hardware manufacturers are trying different strategies, such as develop-
ing massively parallel architectures, in order to create more powerful processors.
A major challenge in high performance computing is the development of parallel
programs which can take advantage of new processors. Due to their extremely high
instruction throughput and low power consumption, fully programmable Graphics
Processing Units (GPUs) are an ideal target for radio-astronomy applications. This
research investigates gridding, a very time-consuming stage of the radio astronomy
image synthesis process, and the challenges involved in devising and implementing a
parallel gridding kernel optimised for programmable GPUs using OpenCL. A paral-
lel gridding implementation was developed, which successfully outperformed a single
threaded reference program for gridding in all but the smallest test cases.
iii
iv
Acknowledgements
I thank my supervisors Christopher Harris and Professor Karen Haines for providing
guidance throughout the course of this project. They provided feedback and advice
on my work and helped me refine my research and academic writing skills.
Thanks to Xenon Technologies for providing the computer used throughout this
project.
I’d like to acknowledge the techincal support staff at WASP, Jason Tan and Khanh
Ly, for providing me with access to WASP facilities, setting up the computer used
during this project.
Thanks to Paul Bourke for providing me with a small CUDA project that got me
started in GPU programming.
Thanks also to Derek Gerstmann for organising the OpenCL Summer School where
I was able to become familiar with the OpenCL API before starting this project.
I’d also like to thank Ankur Sharda and Stefan Westerlund with who I shared the
Hobbit Room with for offering suggestions and feedback on various ideas.
Finally, thank to my family for supporting me over the course of this project. In
v
vi
particular my mother for staying up all night proofreading the final version of this
document.
Contents
Abstract iii
Acknowledgements v
List of Figures viii
1 Introduction 1
2 Background 5
2.1 Radio Astronomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Aperture Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Gridding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Parallel Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5 OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3 Literature Review 19
4 Model 23
4.1 Scatter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2 Gather . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3 Pre-sorted Gather . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5 Testing 33
vii
viii CONTENTS
6 Discussion 45
6.1 Work-Group Optimisation . . . . . . . . . . . . . . . . . . . . . . . . 45
6.2 Performance Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.3 Performance Comparison . . . . . . . . . . . . . . . . . . . . . . . . . 48
7 Conclusion 51
7.1 Project Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
7.2 Future Consideration . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
A Original Proposal 57
List of Figures
2.1 Aperture synthesis data processing pipeline. . . . . . . . . . . . . . . 9
2.2 Overview of the gridding operation . . . . . . . . . . . . . . . . . . . 11
2.3 Comparison of CPU and GPU architectures. . . . . . . . . . . . . . . 14
(a) CPU Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
(b) GPU Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 OpenCL memory hierarchy. . . . . . . . . . . . . . . . . . . . . . . . 17
4.1 Gridding with a scatter kernel. . . . . . . . . . . . . . . . . . . . . . . 26
4.2 Gridding with a gather kernel. . . . . . . . . . . . . . . . . . . . . . . 29
4.3 Gridding with a pre-sorted gather kernel. . . . . . . . . . . . . . . . . 31
5.1 Thread topology optimisation . . . . . . . . . . . . . . . . . . . . . . 35
5.2 Performance profile of GPU gridding implementation compared with
CPU gridding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.3 Performance profile of GPU gridding implementation with sorting
running on the CPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.4 CPU and GPU gridding performance for a varying number of visibilities 40
5.5 Thread optimisation for a range of convolution filter widths . . . . . . 41
5.6 CPU and GPU gridding performance for a varying convolution filter
width . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
ix
x LIST OF FIGURES
Chapter 1
Introduction
Astronomers can gain a better understanding of the the creation and early evolution
of the universe, test theories and attempt to answer many questions in physics by
producing images from radio waves emitted by distant celestial entities. With the
construction of vast radio-telescope arrays, such as the Murchison Wide-field Array
(MWA), Australian SKA Path-finder (ASKAP) and Square Kilometre Array (SKA),
many engineering challenges have to be overcome. ASKAP alone will generate data
at a rate of 40Gb/s, producing over 12PB in a single month [6] and SKA will produce
several orders of magnitude more, so data processing and storage are major issues.
As we reach the limit of how fast we can make single core CPUs run we need to look
to parallel processors such as multi-core CPUs, GPUs and digital signal processors
to process this vast amount of data. One of biggest problems limiting the popularity
of parallel processors has been the lack of a standard language that runs on a wide
variety of hardware. To address this, the Khronos Group produced the OpenCL
standard.
1
2 CHAPTER 1: Introduction
OpenCL is an open standard for heterogeneous parallel programming [17]. One of
the major advantages of code written in OpenCL is that it allows programmers to
write software capable of running on any device with an OpenCL driver; eliminating
the need to rewrite large amounts of code for each vendor’s hardware. This par-
tially solves the issue of vendor lock-in, a major problem in general purpose GPU
(GPGPU) programming up until now, where, due to the lack of standardisation
software is often restricted to running on a series of architectures produced by a
single company.
In this project I aim to develop an efficient method to develop a parallel algorithm
in OpenCL for the gridding stage of radio-interferometric imaging, which has tra-
ditionally been the most time-consuming stage of the imaging process [30]. Due
to the large amount of data that will be generated by the next generation of radio
telescopes, the amount of data which can be processed in real-time may be a serious
performance bottleneck. Since a cluster of GPUs with equal computational perfor-
mance to a traditional supercomputer consumes a fraction of the energy, an efficient
OpenCL implementation would be a significantly less expensive option. I will pri-
marily target GPU architectures in particular the NVIDIA Tesla C1060, although
I will also attempt to benchmark and compare performance on several different de-
vices.
Chapter 2, the background will explain the theory behind radio astronomy with
a focus on the aperture synthesis process. It will provide an overview of GPU
architectures, NVIDIA’s Tesla series of graphics cards and OpenCL. In Chapter 3,
the literature review, previous implementations of gridding on other heterogeneous
parallel architectures will be discussed. The model in Chapter 4 will provide a
3
detailed explanation of gridding and detail several ways of adapting it to GPU
hardware. Chapter 5 will outline and present the results of various tests performed in
order to determine the paramaters which result in the best performance of the GPU
based gridding algorithm. The results of these tests will be discussed in Chapter 6,
as well as other discoveries made over the course of this project. Finally Chapter 7
will summarise the important results of this work and outline possible areas for
future research.
4 CHAPTER 1: Introduction
Chapter 2
Background
This chapter will explain background information on various topics that are useful for
understanding this project. It will discuss the theory behind radio astronomy, as well
as what scientists in this field can discover. An overview of the aperture synthesis
process, used to generate 2-dimensional images from multiple radio telescopes, will
be given. General Graphical Processing Unit (GPU) design will then be outlined,
with a particular focus on the NVIDIA GT200 architecture used in the Tesla C1060.
OpenCL will also be discussed, explaining all the features used in this project and
why it was chosen for this project over other programming languages.
2.1 Radio Astronomy
Radio astronomy is the branch of astronomy which focuses with observing electro-
magnetic waves emitted by celestial entities lying in the radio band of the electro-
magnetic spectrum. While the visible light spectrum observed by optical telescopes
5
6 CHAPTER 2: Background
can pass through the atmosphere with only a small amount of atmospheric dis-
tortion, radio waves with wavelengths ranging from 3cm to around 30m are not
distorted at all by the Earth’s atmosphere. Also, unlike visible waves which are
mostly produced by hot thermal sources, such as stars, radio waves can originate
from a wide variety of sources including gas clouds, pulsars and even background
radiation left over from the big bang [31]. It is also possible to observe radio waves
through clouds as well as during the day, when the amount of light emitted by the
Sun vastly exceeds that which reaches Earth from distant sources, which allows ra-
dio telescopes to operate when optical astronomy is impossible.
Due to the long wavelengths of the signals being measured, radio telescopes are
generally far larger than their optical counterparts. For a single dish style radio
telescope, the angular resolution R of the image generated from a signal of wave-
length λ is related to the diameter of the dish D.
R =λ
D(2.1)
Since R is a measure of the finest object a telescope can detect, a dish designed to
create detailed images of signals less than 1GHz would need to be several hundred
metres in diameter. Constructing a dish of this size is both difficult and extremely
expensive and, for wavelengths longer than around 3 metres, the diameter required
for a good resolution can surpass what can realistically be constructed. A tech-
nique called radio interferometry makes it possible to combine multiple telescopes
to make observations with a finer angular resolution than that of what each telescope
could resolve individually. When using this technique neither telescope measures the
7
brightness of a frequency in the sky directly. Instead, each pair of telescopes in the
array measures a component of the brightness and combines this data in a process
known as aperture synthesis.
2.2 Aperture Synthesis
Aperture synthesis works by combining signals from multiple telescopes to produce
an image with a resolution approximately equal to that of a single dish with a di-
ameter of the maximum distance between antennae. The aperture synthesis process
is made up of several stages, which transform the signals measured by each pair of
telescopes in an array into a two dimensional image. This process consists of several
stages shown in Figure 2.1.
The first stage of this process involves taking the signals from each pair of antennas
and cross-correlating them to form a baseline. The relationship between the number
of an antennas in an array a, and the total number of baselines b including antennas
is shown in Equation A.2.
b =a(a− 1)
2+ a (2.2)
These signals are combined to produce a set of complex visibilities, one for each
baseline, frequency channel and period of time. The complex visibilities for each
baseline are created by cross-correlating sampled voltages from a pair of telescopes.
8 CHAPTER 2: Background
The next stage is to calibrate the visibilities to remove noise introduced by atmo-
spheric interference and small irregularites in the shape and positioning of the radio
dishes. The calibrated visibilities can then be used to generate a two-dimensional
image by converting them to the spatial domain. These visibilities are first mapped
to a regular two-dimensional grid in a process referred to as gridding. This is fol-
lowed by applying the two-dimensional inverse Fast Fourier Transform (FFT) to the
gridded visibilities, converting them to the spatial domain. The output of this oper-
ation is known as the dirty image, because it still contains some artifacts introduced
during the aperture synthesis process.
In order to remove these synthesis artifacts the dirty image is finally processed with
a deconvolution technique. Two common algorithms used to perform this operation
are the CLEAN algorithm [2] and the Maximum Entropy Method (MEM) [25]. The
CLEAN algorithm works by finding and removing point sources in the dirty image,
and then adding them back to the image after removing associated side lobe noise.
The MEM process involves defining a function to describe the entropy of an image
and then searching for the maximum of this function. The result of the deconvolu-
tion process is known as a clean image. Several radio astronomy software packages
exist which are able to perform the aperture synthesis process including Miriad [21],
which is used in this project. Of the stages used in aperture synthesis, gridding is
the focus of this research and will be discussed in more depth.
9
Imaging
Fast FourierTransform
Gridding
Deconvolution
CalibrationCorrelation
Figure 2.1: Aperture synthesis data processing pipeline [30].Shown is an overview of the major software stages involved in taking sampled radio-wave data from a pair of radio telescopes and generating an image. The signalsfrom a pair of telescopes are correlated with each other to provide a stream ofvisibilities. These visibilities are then calibrated to correct for irregularities in thetelescope’s dish, small errors in the the telescope’s alignment and to account for someatmospheric interference. After being calibrated these visibilities are converted intoa two-dimensional image through a three stage process consisting of interpolation toa regular grid, transformation to the spatial domain with a Fast Fourier Transform(FFT) and deconvolution using a technique such as the CLEAN algorithm [2] orMaximum Entropy Method [25].
10 CHAPTER 2: Background
2.3 Gridding
Gridding is the stage of the aperture synthesis process which converts a list of cal-
ibrated visibilities into a form that can be transformed to the spatial domain with
an inverse Fast Fourier Transform. This operation involves sampling the measured
visibilities to a two-dimensional grid aligned with the u and v axes which are defined
for each baseline by the Earth’s rotation. An example of visibilities measured by a
telescope array containing eight baselines is shown in Figure 2.2. In order to min-
imise aliasing effects in the image plain, ie. distortion introduced due to sampling,
each visibility is mapped across a small region of the grid defined by a convolution
window. This convolution function used in this project is the spheroidal function,
which emphasises aliasing suppression near the centre of the image, typically near
the object of interest [23].
Typically, instead of computing the spheroidal function every time its used, coeffi-
cients are generated ahead of time and stored in an array. Because the same function
is used for both gridding a visibility in both u and v directions, the coefficients of the
convolution function can be stored in a one-dimensional array. The ratio between
the length of the convolution array and the width of the convolution function is
known as the oversampling ratio. Because a high oversampling ratio results in bet-
ter suppression of aliasing in the final image, the convolution array is significantly
larger than the width of the function.
11
Gridding
u(t)
v(t)
GridVisibilities
Figure 2.2: Overview of the gridding operationThe gridding operation transforms a set of visibilities sampled from multiple base-lines of a radio-telescope array and convolves them to a regular grid. This operationis necessary to prepare this visibility data for the two dimensional Inverse FastFourier Transform (FFT) operation, which is used to transform these visibilitiesfrom the frequency domain into a two dimensional image in the spatial domain.Each red line represents measurements made by a separate baseline taken over aperiod of time.
12 CHAPTER 2: Background
2.4 Parallel Processors
Computer manufacturers have shifted their focus in recent years from designing fast
single core processors to creating processors which can execute multiple threads si-
multaneously and minimise memory access latency with on chip cache. Since these
multi-core processors are still relatively new, a diverse range of architectures are
available, including multi-core x86 processors such as the AMD Phenom and Intel
Core i7, GPUs like NVIDIA’s Tesla and AMD’s Firestream series as well as other
types of processors including IBM’s Cell/B.E. One of the factors limiting the usage
of parallel processors by developers is the vast amount of code that has been de-
veloped for single processor computers. Often, due to inter-dependencies between
operations, rewriting these legacy programs to take advantage of multiple concur-
rent threads is not a trivial task.
While originally developed as co-processors optimised for graphics calculations,
GPUs are being designed with increasingly flexible instruction sets and are emerging
as affordable massively parallel processors. NVIDIA’s recent Tesla C1060 GPU is
capable of 933 single precision GigaFLOPS [8] (floating point operations per sec-
ond) compared to one of the fastest CPUs available at the time, Intel’s Core i7 975
with a reported theoretical peak of 55.36 GigaFLOPS [14]. Part of the reason that
GPUs can claim such high performance figures is their architecture. As shown in
Figure 2.3, GPUs devote more die s[ace to data processing. GPUs are thus highly
optimised for performing simple vector operations on large amounts of data faster
than a processor using that die space for other purposes. However, this performance
comes at the expense of control circuitry meaning that GPUs cannot make use of
advanced run time optimisations commonly found on modern desktop CPUs such
13
as branch prediction and out-of-order execution. GPUs also sacrifice the amount of
circuitry used for local cache, which has a major impact on the average amount of
time a process needs to wait after requesting data from memory.
14 CHAPTER 2: Background
ALU
ALU
ALU
ALUControl
Cache
DRAM
(a) CPU Layout
U
U
ALUs
Control
Cache
DRAM
(b) GPU Layout
Figure 2.3: Comparison of CPU and GPU architectures [7].This figure shows the difference in layout between a CPU and a GPU. CPUs aredesigned to be general purpose processors capable of performing a wide variety oftasks quickly. Because of this, a large amount of space on the chip’s die is dedicatedto control logic and local cache, both of which can be used to optimise programs atrun time. GPUs are highly tuned to perform graphics operations, which are mostlysimple vector operations on large amounts of data. This performance is achievedby dedicating most of the chip to the Arithmetic and Logic Units (ALUs) whichperform instructions at the expense of cache and control circuitry. Because of this,GPUs often lack many of the the advanced run time optimisations commonly foundon modern desktop CPUs, such as branch prediction and out-of-order execution andaccessing system memory has a higher average latency.
15
2.5 OpenCL
OpenCL is a programming language created by the Khronos Group with the design
goal of enabling the creating of code that can run across a wide range of parallel
processor architectures without needing to be modified. To deal with the many dif-
ferent types of processors that can be used for processing data, the OpenCL runtime
separates them into two different classes; device and host. The host, which repre-
sents a general purpose computer, is in charge of transferring both device programs
compiled at run-time (kernels) and data to a device. It also instructs devices to run
kernels and sends requests for data to be transferred back from a device to the host.
A host can make use of a command queue object in order to schedule data transfers
and the execution of kernels on various devices asynchronously so that it remains
free to perform other operations while the devices are busy.
The job of a device is to simply execute a kernel in parallel across a range of data
storing the results locally and to then alert the host when the kernel has finished
execution, so the results can be transferred back. Each device can be divided into a
collection of compute units and in turn each of these compute units is composed of
one or more Processing Elements (PEs). Memory on a device is organised into four
distinct regions: global, constant, local and private. Global and constant memory
are shared among all compute units on a device and are the only regions of memory
accessible to the host. The only major difference between these two regions is that
constant memory can only be written to by the host while global memory can be
written to by both host and device. Local memory is memory shared by all pro-
cessing elements within a work-group, which can be allocated by the host but only
manipulated by the device. Finally, private memory is memory available to only
16 CHAPTER 2: Background
a single processing element. Figure 2.4 shows how the hierachy of processors and
various memory types are linked together.
In order to run a kernel, the host initialises an NDRange, which represents a one,
two or three dimensional array with a specific length in each dimension. The size
of this NDRange, also known as an index space, determines the number of kernal
instances launched. Each instance of a kernel running on a device is known as a
work-item and is provided with an independent global ID representing a position in
index space. Work-items are organised into work-groups. Each of which have their
own group ID as well as providing work-items within the group with independent
local IDs. When a kernel is executed each work-group is executed on a compute unit
and each work-item maps to a processing element. Limitations on various parame-
ters such as the maximum number of work-items a work-group can allocate as well
as the amount of memory available in each region, are dependent on the architecture
of a device.
For GPU devices based on the CUDA architecture such as the NVIDIA Tesla
C1060 used in this project, OpenCL compute units correspond to hardware objects
called multiprocessors. While each multiprocessor can process 32 threads in parallel
(known as a warp), it is capable of storing the execution context (program counters,
registers, etc) of multiple warps simultaneously and switching between them very
quickly [7]. This technique can be used to efficiently run work-groups larger than
32 threads on a single multiprocessor. Since this context switch can occur between
two consecutive instructions, the multiprocessor can instantly switch to a warp with
threads ready to execute if the current context becomes idle, such as when reading
or writing global memory. Each multiprocessor posesses a single set of registers and
17
Host System
System Memory
CPU
CPU Cache
Compute Device 1
...
Global/Constant Memory Data Cache
Localmemory 1
Compute unit 1
...Private
memory 1
PE 1
Privatememory N
PE N
Localmemory M
Compute unit M
...Private
memory 1
PE 1
Privatememory N
PE N
Compute Device Memory 1
Global Memory
Constant Memory
Compute Device P
...
Global/Constant Memory Data Cache
Localmemory 1
Compute unit 1
...Private
memory 1
PE 1
Privatememory N
PE N
Localmemory M
Compute unit M
...Private
memory 1
PE 1
Privatememory N
PE N
Compute Device Memory P
Global Memory
Constant Memory
...
Figure 2.4: OpenCL memory hierachy [7].A system running OpenCL consists of a host, which can be any computer capable ofrunning the OpenCL API and one or more devices. Each device can be divided intoa collection of compute units and, in turn, each of these compute units is composedof one or more Processing Elements. Memory on a device is arranged in a similarhierachy. Global and constant memory are shared among all compute units on adevice, local memory is only available to a single compute unit and private memory isspecific to a single processing element. OpenCL devices typically represent GPUs,multi-core CPUs, Digital Signal Processors (DSPs) and other parallel processors.Since a host represents a general purpose computer, it has its own CPU and memorywhich are used to issue commands and transfer data to the various devices as wellas perform other operations outside the OpenCL environment.
18 CHAPTER 2: Background
a fixed amount of local memory which are shared between all active warps. Because
of this tradeoff between work-group size and memory available to each work-item,
trying to find a balance between these parameters is essential to obtaining optimal
performance.
Chapter 3
Literature Review
The gridding algorithm used in aperture synthesis is widely documented in scientific
literature [4,5,10,18,23,32]. A large part of the research effort has been focused on
improving the quality of images generated by devising methods to programatically
determine the ideal convolution window for a given set of data, as well as minimil-
ising artifacts introduced from oversampling.
There have been various efforts to implement this algorithm on parallel hardware
in [11, 19, 28–30]. Before the OpenCL standard was publish, IBM’s Cell Proces-
sor was a major target for research efforts, although recently GPUs have become
cheaper, more powerful and easier to program, leading to more research on paral-
lelisation with GPUs, particularly with NVIDIA’s CUDA based cards.
Gridding is also used in Magnetic Resonance Imaging (MRI) applications and sev-
eral papers have been written on the topic of improving the gridding algorithm
19
20 CHAPTER 3: Literature Review
as well as creating various implementations targetting heterogenous parallel pro-
cessors [1, 12, 15, 16, 20, 22, 24, 26]. While the process used to convert MRI data
into images is completely different to the aperture synthesis process used in ra-
dio astronomy, both processes involve transforming irregularly sampled data in the
Fourier domain into a spacial image.
An early attempt to parallelise gridding on IBM’s Cell Broadband Engine is de-
scribed in an article entitled Radio-Astronomy Image Synthesis on the Cell/B.E. [29]
published in 2008. This paper describes an application of gridding and its inverse
function degridding, and compares the perfomance between an Intel Pentium D x86
CPU and two different platforms containing the cell processor: Sony’s Playstation
3 and a IBM’s QS20 Server Blade. On average, the results for the Cell platforms
showed a twentyfold increase in performance compared to the Pentium D, although
speed increase was negligible for small kernels less than 17x17. One of the main
conclusions reached in this paper is that I/O delay and memory latency are the
largest bottlenecks in scaling this algorithm to a cluster of processors.
The parallel gridding implementation detailed in this paper took advantage of the
Cell’s high bandwidth between processors by using the Power Processing Element
(PPE) to distribute the visibility data along with the relevant convolution and grid
indices to the Synergistic Processing Elements (SPEs) on the fly. The PPE stored
separate queues as well as seperate copies of the grid for each of the SPEs. There-
fore, if multiple adjacent visibilities were located close to each other, they would be
allocated to a single SPE to reduce the number of memory accesses. To prevent too
much work from piling up in a single queue, a maximum queue size was established
in order that the PPE would not continuously fill a single queue while the other
21
SPEs idled. Each of the SPEs performed a simple loop of polling their queue until
work was available, fetching the appropriate data from system memory with Direct
Memory Access (DMA), performing the gridding operation and writing the results
to its copy of the grid in system memory. Once all visibilities were processed, the
PPE added each of the grids together to produce the output.
A follow up paper was written by the same research team in 2009, entitiled Building
high-resolution sky images using the Cell/B.E. [30], detailing further optimisations
to their Cell-based gridding implementation. The largest optimisation detailed in
this paper was to check consecutive visibilities to see if they had identical u − v
coordinates and if so, add them together and then enqueue the combined visibility.
The result of th further optimisations was a scalable version of the previous gridding
algorithm designed to run on a cluster of Cell processors with each Cell core able to
process all data generated by 500 baselines and 500 frequency channels at a rate of
one sample per second.
More recently an effort was made to implement several stages of the aperture syn-
thesis process using CUDA, which is outlined in Enabling a High Throughput Real
Time Data Pipeline for a Large Radio Telescope Array with GPUs [9]. The purpose
of this research was to design a data pipeline capable of processing data generated
by the Murchison Widefield Array in real time. While the data pipeline required
over 500 seconds of processing time running on a single core of an Intel Core-i7 920,
the same pipeline implemented in CUDA could be processed in under 7.5 seconds
on a single NVIDIA Tesla C1060. Excluding data transfer times, the GPU imple-
mentation of gridding developed as part of this research demonstrated an average
speedup of twenty-twofold when compared to the CPU version.
22 CHAPTER 3: Literature Review
This research demonstrates that gridding has been successfully implemented on sev-
eral different parallel processor architectures with significant performance improve-
ments compared to existing serial implementations. Most of the research conducted
to date has been focused on implementing gridding on a single processor architecture
or on comparing the performance of multiple independent implementations written
for different devices. Due to the portability of software written in OpenCL, a par-
allel version of gridding implemented as an OpenCL kernel could be combined with
kernels implementing other stages of aperture synthesis and run on a system com-
prised of multiple different compute devices.
Chapter 4
Model
The gridding algorithm is used to interpolate a set of visibilities to a regular grid
as illustrated in Figure 2.2. Each visibility sample is projected onto a region of the
grid by convolving its brightness value by a two-dimensional set of coefficients. In
this chapter I will outline the model I developed which implements the gridding
algorithm on the parallel architecture of NVIDIA’s Tesla C1060 GPU. I describe
three approaches. Firstly, the scatter approach, where each visibility is mapped to
an OpenCL work-item and the kernel performs a similar convolution operation to
the original serial implementation. Secondly, the gather approach, where the two-
dimensional location of each pixel on the grid corresponds to a thread on the GPU
and the kernel reads in the entire list of visibilities, only writing to the grid address
corresponding to its global ID. Finally, the pre-sorted gather approach, which is
similar to the normal gather approach except the visibilities are sorted and placed
into bins prior to gridding and each kernel only reads through a subset of the list of
visibilities.
23
24 CHAPTER 4: Model
4.1 Scatter
Scatter communication occurs when the ID value given to a kernel processing a
stream of data corresponds to a single input element and the kernel writes to mul-
tiple locations, scattering information to other parts of memory [13]. In the context
of parallel gridding, a scatter kernel is implemented so that the global ID of each
work-item corresponds to a single visibility and the kernel convolves this visibility
over a region of the grid. Of the different parallelisation approaches discussed, scat-
ter is the closest to a traditional serial implementation because the kernel effectively
performs the same operations, although instead of looping through the list of vis-
ibilities, these these operations are performed simultaneously. An example of this
type of kernel is shown in Figure 4.1
In the case of a scatter kernel operating over a set of v visibilities with a convolution
function of width c, v threads are launched where each thread performs c2 multipli-
cations by looping across the convolution function in two-dimensions. This results
in a computational complexity of O(v · c2). Although the complexity is the same as
that of the serial implementation, a scatter kernel can scale across a large number
of processors with a proportional speed increase.
Even though the scatter approach is very fast, it does nothing to prevent multiple
threads attempting to write to the same memory location simultaneously, which can
lead to a write conflict resulting in the results of one thread being lost. A possible
solution is to provide each processor with a unique copy of the grid, which it writes
to, and adding an extra step at the end of the process to add all the grids together.
25
While this solution would be ideal on a multi-core CPU, it would be impractical
on a GPU like device with hundreds of processing elements, since the amount of
memory needed would likely exceed that which is available for any practical grid
size.
26 CHAPTER 4: Model
ThreadIndex
Visibilities
Grid
ConvolutionFunction
A
B
C A
B
C
Figure 4.1: Gridding with a scatter kernel.The scatter strategy involves assigning each visibility to a different thread, whereeach thread applies the convolution function. While this approach is very fast, itruns into problems when threads attempt to write to the same grid location asshown in the magenta region where kernels A and B overlap as well as the cyanregion where kernels B and C overlap. When this occurs only one of the valuesbeing written is saved while all the other values are lost.
27
4.2 Gather
A gather kernel works by mapping each address in the output of a function to a
thread and processing the set of input data separately at each location. The gather
approach to gridding works by assigning a thread for each pixel on the output grid
and having each thread process the list of visibilities separately. Since each thread
only writes to a single pixel of the output grid, this approach avoids the problem of
write conflicts found in scatter kernels, as shown in Figure 4.2.
Given a set of v visibilities which are to be convolved to a w by h grid, a gather
kernel needs to iterate through the list of visibilities once for each thread. Because
a thread is launched for each position on the grid, this results in a complexity of
O(v ·w·h). Since the grid is always significantly larger than the convolution function,
the gather approach is far more algorithmically complex than the scatter approach
and therefore takes longer to run. When it comes to writing, gather kernels have
one major advantage in complexity, because each thread writes to a single location,
the number of writes is only w · h. Since all writes can be performed independently,
given a GPU with p processing elements, the complexity of writing is only O(w·hp
).
A major disadvantage of to this approach is that the total number of operations per-
formed is significantly larger, since each visibility is processed once for each thread,
whereas the scatter approach only processes each visibility once. Because the con-
volution function used to map visibilities to the grid is significantly smaller than
the grid itself, most visibilities processed by each thread fall outside the convolution
width and can safely be ignored. With this in mind, an optimised version of the
28 CHAPTER 4: Model
gather approach was developed and is discussed in the following section.
29
ThreadIndex 1
Visibilities
Grid
ConvolutionFunction
ThreadIndex 0
A
A
A
A
B
C
B
B
B
C
C
C
Figure 4.2: Gridding with a gather kernel.The gather approach to gridding works by assigning a thread for each pixel on theoutput grid and having each thread process the list of visibilities separately. Sinceeach thread only writes to a single pixel of the output grid, this approach avoids theproblem of write conflicts found in scatter kernels A disadvantage to this approachis that the total number of operations performed is significantly larger since the listof visibilities is read once for each thread, whereas the scatter approach only readsthem in once.
30 CHAPTER 4: Model
4.3 Pre-sorted Gather
The pre-sorted gather approach attempts to significantly improve the performance
of the regular gather approach by performing an additional series of steps before the
gridding operation. These steps attempt reduce the number of visibilities processed
by each thread while still producing correct output. This sequence of steps, col-
lectively called binning, works by splitting the list of visibilities into a collection of
shorter lists, whereby each short list contains the visibilities located in a particular
region of the grid, or bin. The binning process begins by determining a bin size,
which must be equal to or larger than the convolution function. This is followed by
idneifying which bin each visibility is located in, and useing thes values to create a
list of keys.
Once the list of keys has been generated, the visibilities are sorted based on the
value stored in each visibility’s corresponding key, which results in a list where
visibilities in each bin are grouped together. The list of visibilities is then processed
a second time in order to generate an array containing the index of the first and last
visibility in each bin. Following this step, a modified gather kernel is launched to
perform the gridding process, with the array of bin indicies passed as an additional
argument, The size and position of each work-group corresponds to the size and
location of each bin. Instead of looping through each visibility in the list, each
work-item only iterates through the visibilities located in its own bin and the eight
bins directly adjacent to it. An illustration of the pre-sorted gather approach is
shown in Figure 4.3.
While the easiest approach to sorting the visibilities into bins would be to sort them
31
ThreadIndex 1
Visibilities
Grid
ConvolutionFunction
ThreadIndex 0
A
A
A
B
C
B
C
C 00 01 02 03
10 11 12 13
20 21 22 23
30 31 32 33
Figure 4.3: Gridding with a presorted gather kernelSince the grid is significantly larger than the convolution window, each thread onlyneeds to consider visibilities located nearby. To take advantage of this, the gridis divided into sub-regions called bins and the list of visibilities is sorted into anorder where visibilities in each bin are grouped together. Each work-item processesvisibilities located in it’s own bin and adjacent bins. The red, green and blue boxeson the left correspond to the list of visibilities processed by an individual work-item.The tinted bins represent adjacent bins for their corresponding coloured work-items.
32 CHAPTER 4: Model
on the CPU with a traditional algorithm such as Quicksort, it is possible to do this
on the GPU using a parallel sorting algorithm such as a bitonic sort. Bitonic sorting
is based on a network of threads taking a divide and conquer approach to sorting,
which implement two kernels: Bitonic Sort which orders the data into alternate in-
creasing and decreasing subsequences, and Bitonic Merge which takes a pair of these
ordered subsequences and combines them together. This was implemented with a
modified version of the Bitonic Sorting network example found in NVIDIA’s GPU
Computing SDK with the datatype of the values convertred from uint to float4
in order to handle visibilities.
Chapter 5
Testing
A GPU gridding program was successfully implemented and tested, using the pre-
sorted gather model presented in Chapter 4. The testing compared the GPU grid-
ding implementation with a single core CPU implementation in order to determine
the suitability of parallel architectures to the gridding stage of aperture synthesis.
All testing was performed on a Xenon Nitro A6 Tesla workstation. This system
contained a Foxconn Destroyer motherboard featuring a single AM2+ CPU Socket,
four Dual Channel DDR2 Memory slots, four PCIe v2.0 slots, NVIDIA nForce 780a
SLI chipset and 5.2 GT/s HyperTransportTM bus connecting the CPU with the
northbridge. The CPU used was an AMD PhenomTM II X4 955 clocked at 3.2 GHz.
8GB of RAM was installed, consisting of four 2GB DIMMS running at DDR2-800.
Two different Graphics Cards were available, a NVIDIA Tesla C1060 and a NVIDIA
Quadro 5800, which both include a 240 core GPU clocked at 1.3GHz and 8GB and
4GB resectively of on-board RAM clocked at 800MHz and connected over a 512 bit
GDDR3 with a bandwidth of 102 GB/s. Both graphics cards were connected to the
33
34 CHAPTER 5: Testing
motherboard through a PCIe x16 bus.
The operating system used in the tests was the AMD64 release of Ubuntu Linux
9.10 (Karmic Koala), running Linux Kernel 2.6.31-22. Version 4.4.1 of the GNU
compiler collection was used to compile all C and FORTRAN code. The reference
implementation of gridding was a version of the invert function from the 2010-04-22
release of the Miriad data reduction package, modified to measure and output it’s
run time. The NVIDIA drivers installed were version 195.36.15, along with ver-
sion 3.0 of the NVIDIA Toolkit which includes the OpenCL libraries for NVIDIA
GPUs and version 3.0 of the NVIDIA GPU Computing SDK. All performance tim-
ing data was measured using the gettimeofday function found in the Unix library
sys/time.h.
Performance tests were conducted using a sample dataset of 1337545 visibilities
taken by the Australian Telescope Compact Array (ATCA) of Supernova SN1987A.
Unless specified otherwise the grid size used is 1186 by 2101 and the convolution
function width is 6, with the convolution function data comprising a spheroidal func-
tion stored in a 2048 element array. The data used to generate each performance
plot was created by running the relevant program five times in a row and averaging
the execution time of the last three runs in order to minimise the impact of hard
disk seek times and power saving features on the results.
The objective of the first test, shown in Figure 5.1, was to determine the optimal
local work-group size for the gridding kernel, operating on the sample dataset. This
value also determines the size of the bins used in the binning stage of the gridding
35
700
800
900
1000
1100
1200
1300
0 50 100 150 200 250 300
Run
time
(ms)
Work Group Size
6x6
6x7
6x86x96x10
6x11
6x126x13
6x146x15
6x16
7x67x7
7x8
7x9
7x107x11
7x12
7x13
7x147x15
7x16
8x68x7
8x8
8x9
8x108x11
8x128x13
8x148x15
8x16
9x6
9x7
9x89x9
9x109x119x12
9x139x149x159x16
10x6
10x7
10x810x9
10x1010x1110x12
10x13
10x1410x1510x16
11x611x7
11x8
11x9
11x1011x11
11x12
11x13
11x1411x1511x16
12x6
12x7
12x812x9
12x10
12x11
12x12
12x13
12x1412x1512x16
13x6
13x7
13x8
13x9
13x10
13x1113x12
13x1313x14
13x1513x16
14x6
14x7
14x814x9
14x1014x1114x12
14x13
14x14
14x1514x16
15x615x7
15x8
15x915x10
15x11
15x12
15x1315x1415x1515x16
16x616x7
16x8
16x916x1016x1116x12
16x1316x1416x1516x16
Figure 5.1: Thread topology optimisationThis figure shows how the performance of the GPU gridding kernel varies with anumber of local work-group sizes on the sample data. This test was performed inorder to find an optimal work-group size for later tests. The x-axis shows the numberof work-items in each group, with each datapoint showing the width and height ofthe work-group it represents. The y-axis is the execution time of the gridding processmeasured in milliseconds. Because its not perfectly clear in the diagram, the fastestwork-group sizes are 6x10, 6x9, 7x8, 6x8, 10x6 and 8x8.
36 CHAPTER 5: Testing
process. This test was conducted by iterating through each combination of work
group width and height and recording the time taken by the entire gridding process.
This included time taken transferring data between the device and host. A value
of 6 was used for the minimum number of elements in both dimensions since the
gridding kernel is only designed to work with both work-group dimensions equal to
or greater than the convolution function width. Values greater than 16 in either
dimension are not displayed on the plot, since increasing either work-group dimen-
sion past this value significantly decreased performance. While a work-group size of
6x10 resulted in the fastest execution time, 8x8 was used in further tests for reasons
that are explained in Chapter 6.
Figure 5.2 illustrates the execution time of each stage of the GPU gridding imple-
mentation compared to the total execution time taken by Miriad’s gridding imple-
mentation. A performance profile of the GPU gridding process with sorting handled
on the CPU is shown in Figure 5.3. The purpose of these diagrams is to visualise
the amount of time spent at each stage of the gridding process in order to determine
if any stage in particular is acting as a performance bottleneck. Each item listed in
the key located on either diagram represents a distinct stage of the GPU gridding
process. Binning represents the time spent determining which bin each visibility is
located in. Device Transfer represents the total amount of time spent transferring
the binned visibilities and convolution function from host to device. Sorting is a
measure of the total time taken by the sorting stage. Bin Processing represents
the time taken to transfer the sorted visibilities from device to host, build an array
containing indices for the first and last visibility in each bin, and transfer this new
array onto the device. Kernel Execution represents the time spent performing the
actual gridding operation. Finally, Host Transfer represents the time taken trans-
37
0
200
400
600
800
1000
1200
Miriad OpenCL with GPU Sort
Tim
e (m
s)
Host TransferKernel Execution
Bin ProcessingSorting
Device TransferBinning
Figure 5.2: Performance Profile of GPU gridding implementations com-pared with CPU gridding.This diagram illustrates the time spent in each stage of the GPU gridding processcompared with the total execution time of Miriad’s gridding implementation, us-ing the sample dataset. Each item listed in the key represents a distinct stage ofthe GPU gridding process. Binning represents the time spent determining whichbin each visibility is located in. Device Transfer represent the total amount oftime spent transferring the binned visibilities and convolution function from hostto device. Sorting is a measure of the total time taken by the sorting stage. BinProcessing represents the time taken to transfer the sorted visibilities from deviceto host, build an array containing indices for the first and last visibility in each binand transfer this new array onto the device. Kernel Execution represents the timespent performing the actual gridding operation. Finally, Host Transfer representsthe time taken transferring the grid from the device back to the host.
38 CHAPTER 5: Testing
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
OpenCL with CPU Sort
Tim
e (m
s)
Kernel ExecutionHost Transfer
Bin ProcessingSorting
BinningDevice Transfer
Figure 5.3: Performance Profile of GPU gridding implementation withsorting running on the CPU.This diagram illustrates the time spent in each stage of the GPU gridding processwith sorting of the visibilities handled by the CPU using the sample dataset. Thisplot indicates the large amount of processing time needed to sort the visibilitiesinto bins on the CPU. Because of this major performance bottleneck, the sortingstage was adapted to run on the GPU which lead to a significantly faster griddingimplementation as shown in the second column of Figure 5.2.
39
ferring the grid from the device back to the host. Because Miriad’s gridding process
is performed entirely on the host without any pre-processing of visibility data, its
performance profile only consists of the kernel execution stage.
Figure 5.4 compares the execution time of GPU gridding with Miriad for various
size visibility lists. This test was done to compare how the performance of each
program scales when provided with larger datasets to process. The large datasets
used in this test were generated by repeating the visibilities in the SN1987A dataset
as many times as necessary for each test.
In order to compare the performance for convolution windows of various sizes, the
optimal work-group for each convolution width needed to be measured. Because the
graphics card used for testing only allows for work-groups of up to 512 elements in
size, only convolution windows up to 22x22 elements in size could be tested since
the convolution width acts as a lower bound for work-group sizes. The results of
this test are shown in Figure 5.5 and the measured optimal work-group sizes are
listed in Table 5.1.
Figure 5.6 shows how the GPU gridding program performs compared with Miriad
over a number of different convolution filter widths. The work-group sizes used for
the GPU kernel in this test are the optimal values displayed in Table 5.1. This
test was conducted by changing the convolution width parameter provided to both
gridding programs. Since the convolution function used in the sample data has a
large oversampling ratio, the array of convolution coefficients didn not require mod-
ification.
40 CHAPTER 5: Testing
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
0 2 4 6 8 10
Run
time
(ms)
Number of Visibilities (millions)
MiriadOpenCL
Figure 5.4: CPU and GPU gridding performance for a varying numberof visibilitiesThis graph compares the performance of the optimised GPU gridding implemen-tation with Miriad as the number of elements in the visibility list, N , increased.Results were plotted starting at N = 250000 and repeated for every multiple of250000 up to a maximum of N = 10000000. Each datapoint was generated byaveraging the runtime for each value of N over four runs.
41
600
800
1000
1200
1400
1600
1800
0 5 10 15 20
Run
time
(ms)
CGF Width
Work-group size1x12x23x34x45x56x67x78x89x9
10x1011x1112x1213x1314x1415x1516x1617x1718x1819x1920x2021x2122x22
Figure 5.5: Thread optimisation for a range of convolution filter widthsThe purpose of this test was to determine the optimal work-group sizes for the GPUgridding kernel for a range of convolution filter widths. Due to the way the griddingkernel handled bins, the minimum possible work-group size needs to be equal orgreater than the convolution width in both dimensions in order to generate correctoutput, so only datpoints satisfying this criteria were plotted. Another limitation isthat the GPU only allows work-groups with 512 elements or less, leading to upperbounds of 22 for the convolution filter width and 22x22 for the work-group size.
42 CHAPTER 5: Testing
0
2000
4000
6000
8000
10000
12000
0 5 10 15 20
Run
time
(ms)
CGF Width
MiriadOpenCL
Figure 5.6: CPU and GPU gridding performance for a varying convolu-tion filter widthThis test was designed to demonstrate the differences in performance between Miriadand OpenCL gridding implementations over a range of convolution widths. Theseconvolution widths, on the x-axis, are plotted against run time, on the y-axis. Cgfwidths were tested over a range of 1 to 22. The maximum value was imposed dueto the current GPU implementation’s requirement of a work-group size equal to orgreater than the convolution width, with 22x22 being the maximum work-group sizepossible on the NVIDIA Tesla C1060.
43
44 CHAPTER 5: Testing
Convolutionwidth
Optimal Work-group
1 8x82 8x83 7x74 8x85 7x76 8x87 7x78 8x89 11x1110 11x1111 11x11
Convolutionwidth
Optimal Work-group
12 12x1213 13x1314 14x1415 16x1616 16x1617 20x2018 21x2119 21x2120 21x2121 21x2122 22x22
Table 5.1: Optimal work-group sizes for various convolution filter widthsThis table shows the best performing local work-group sizes for a range of convolu-tion widths as determined by the results of the test shown in Figure 5.5
Chapter 6
Discussion
The results presented in the previous chapter are now discussed. I will begin by ex-
amining the selection of an optimal work-group size and explaining the effect of this
parameter on performance. Subsequently, the performance profile of the OpenCL
gridding implementation will be discussed. Finally, Performance of both the Miriad
and OpenCL gridding implementations will be compared and the paramaters affect-
ing each program will be analysed.
6.1 Work-Group Optimisation
The first major goal of testing was to determine the optimal local work-group size
which makes gridding on the GPU run in the shortest amount of time. This para-
mater has a major impact on GPU performance in a number of ways, as it deter-
mines how many work-items can be run simultaneously, the number of registers and
45
46 CHAPTER 6: Discussion
amount of shared memory available to each processor, as well as determining the
size of the bins that visibilities are sorted into. For a given number of work-items in
a work-group T and warp size Wsize (which is equal to 32 for GPUs based NVIDIA’s
CUDA architecture), the total number of warps required by a work-group, Wwg is
given by Equation 6.1 [7]:
Wwg = ceil(T
Wsize
, 1) (6.1)
Given the warp allocation granularity GW (equal to 2 on the Tesla), the number of
registers used by a kernel Rk and the thread allocation granularity GT (512 on the
Tesla), the number of registers allocated to each work-group, Rwg, can be expressed
by Equation 6.2:
Rblock = ceil(ceil(Wwg, GW ) ×Wsize ×Rk, GT ) (6.2)
From Figure 5.1 the best performing work-group size determined to be 60 work-
items arranged as 6x10. The reason a work-group size of 8x8 was chosen was that it
contains 64 work-items, which happens to be the maximum number that can fit into
2 warps. This maximises the number of work-items capable of running simultane-
ously on the GPU without decreasing the number of registers available to each warp.
The optimal work-group sizes measured for a range of convolution widths, shown
in Figure 5.5 and Table 5.1 show several interesting patterns. Work convolution
widths up to 8 in length, work-group sizes of 7x7 and 8x8 appear to produce very
similar results, outperforming all the other sizes. While the 8x8 and 7x7 work-groups
47
contain 64 and 49 work-items respectively, they both take up 2 warps. A possible
explanation for the fast performance of the 7x7 work-group is despite running less
work-items in parallel, the smaller bin size reduces the number of visibilities pro-
cessed in each work-group. A similar pattern can be seen in the performance of
15x15 and 16x16 work-groups, which both require 8 warps, but contain 225 and 256
work-items respectively.
Another observation is that while small work-groups generally outperform large
work-groups, work-groups below 6x6 in size show an opposite trend. Since these
work-groups are all smaller than a single warp in size, since each multiprocessor pro-
cesses a single work-group at a time, these small work-groups don’t contain enough
work-items to make use of the full set of processing elements. Witha 1x1 work-group,
each multiprocessor only performs operations on one processing element, while the
other 31 idle.
6.2 Performance Profiling
The performance profiles shown in Figures 5.2 and 5.3 show a breakdown of the
time spent in each stage for separate gridding implementations. These plots can
be used to determine the ratio between run-times of different stages to determine
performance bottlenecks for a single plot as well as measure speed-up by comparing
different plots together.
The gridding implementation developed in this project, labelled as OpenCL with
GPU sort, demonstrates a speedup of 1.46x the original Miriad implementation. Ex-
48 CHAPTER 6: Discussion
cluding the time taken by the device and host transfer stages, as well as the transfers
listed part of bin-processing, this speedup is 2.28x. While this value is lower than
the performance obtained in other parallel gridding processes, detailed in Chapter 3,
the two values can’t be directly compared due to the different datasets used.
A comparison of both OpenCL implementations in these plots reveals the impact
sorting has on the runtime of pre-sorted gather based gridding. Compared to the
CPU based sort, sorting the visibilities on the GPU is 103x faster. Combined with
the other stages of the gridding process, this resulted in a total speedup of 6.36x.
6.3 Performance Comparison
Figure 5.4 shows several sharp increases in runtime of the GPU gridding algorithm
as the number of visibilities grows. Since the bitonic sorting kernel requires the list
of visibilities to be padded with empty values so that its length is a power of two,
these sudden runtime increases represent a combination of two factors. Firstly, the
sorting kernel needs to process twice as many visibilities which doubles the time
required for the operation. Secondly, the current version of GPU gridding pads the
list of visibilities with zeros on the host, which results in the amount of data needed
to be transferred doubling at each jump in runtime on the graph. These steep in-
creases in runtime could be partially reduced by padding the visibility data with
empty values on the GPU.
As shown in Figure 5.6, the relative performance of gridding on a GPU compared to
49
on a CPU greatly increases with large convolution widths. This is a major benefit
of the gather approach compared to the the scatter approach, since, while larger
convolution windows increase the number of calculations performed per visibility in
both algorithms, in a gather based kernel this extra work is spread across a large
number of threads. In both the CPU implementation and scatter kernels, the num-
ber of operations performed on each visibility is proportional to the convolution
width squared.
50 CHAPTER 6: Discussion
Chapter 7
Conclusion
This project has implemented the gridding stage of aperture synthesis on a GPU
using OpenCL. Its performance has also been compared with the single threaded
gridding process used in Miriad. This chapter summarises the process of developing
the GPU gridding algorithm and concludes with future considerations for extending
this work.
7.1 Project Summary
The initial target of my research was to write a CPU based gridding implementa-
tion in C. The purpose of this implementation was to gain an understanding of the
gridding process and to develop wrapper code to handle input and output tasks not
supported by the GPU. In order to avoid rewriting a large amount of code unrelated
to the main task of gridding, this program was implemented by replacing the MapIt
51
52 CHAPTER 7: Conclusion
function call in Miriad’s Mapper subroutine with a function call to my own gridding
function and returning the gridded output to Miriad. Once I verified that all grid-
ding calculations were being performed within the function , I began research into
different approaches to perform this operation on parallel hardware.
My first attempt at an OpenCL implementation running on the GPU made use of a
kernel based on the scatter approach described in Section 4.1. This implementation
was similar to the original CPU implementation, since the operations performed
by the kernel on each visibility were exactly the same as the original. The major
difference was that these operations were performed in parallel by work-items on
the GPU instead of inside a loop on the CPU. Although this program was able to
run extremely fast, I was unable to overcome the problems caused by simultaneous
writes to the same memory address.
After further examination of the gridding process, I developed a new gridding ker-
nel using the gather approach outlined in Section 4.2. This kernel was primarily
designed to eliminate the issue of simultaneous writing, which was achieved by
launching a separate thread for each pixel in the output grid. Initially this kernel
was incredibly slow, taking over half an hour to grid the sample dataset described in
Chapter 5. Since the gather kernel managed to produce correct results, improving its
performance replaced correcting the scatter kernel as the main focus of development.
It soon became apparent that the gather kernel’s performance could be drastically
improved by sorting the visibilities based on their location in the u − v plane and
modifying the gridding kernel to only process visibilities close to the grid location
53
designatedd by it’s global ID. This additional sorting step is explained in Section 4.3.
The first version of the pre-sorted gather approach performed the visibility sorting
operation on the CPU before transferring the sorted visibility data to the graphics
card and running the gridding kernel. Because I was planning to eventually im-
plement sorting on the GPU, I wrote my own version of the bitonic sort algorithm
which ran on a single CPU core. Performance tests revealed that although this new
approach to gridding was several hundred times faster than the original unsorted
gather approach, it was still five times slower than Miriad.
Profiling this new gridding implementation showed that the visibility sorting stage
was responsible for 90% of the total runtime. In order to improve overall perfor-
mance the sorting algorithm was replaced by a sorting kernel running on the GPU.
This sorting kernel was taken from OpenCL Sorting Networks example NVIDIA’s
GPU Computing SDK, which perfectly matched my requirements with only slight
modification. This modification finally managed to improve the performance of my
gridding implementation enough to run faster than Miriad over a wide range of pa-
rameters.
7.2 Future Consideration
This research thoroughly investigated many aspects of a GPU based gridding imple-
mentation. However, there are still many related areas yet to be explored as well as a
number of areas within the scope of this project which warrant further investigation.
These include utilising different memory regions on the device and performing the
54 CHAPTER 7: Conclusion
remaining CPU based binning stages on the GPU. Combining gridding with other
stages of the aperture synthesis pipeline in OpenCL and adapting this work to other
hardware architectures are also discussed.
The pre-sorted gather approach to gridding consists of four stages: Determining
each visibility’s bin, sorting the visibilities, constructing an array of indices of the
first and last visibility in each bin and gridding. Currently only the sorting and
gridding stages are implemented on the GPU, while the other stages are processed
on the host. Determining each visibility’s bin on the GPU could be performed faster
than on the CPU. Building the array of indices on the CPU requires the sorted visi-
bilities to be transferred to the host and the array of bin locations to be transferred
to the device. Performing this calculation the GPU would not only be faster, but
also eliminate both of these transfers.
Since each work-group is comprised of work-items located in the same bin, each
work-item processes the same set of visibilities. Currently the gridding kernel re-
quests each visibility from global memory individually, waiting after each request.
By allocating a small amount of local memory as a visibility cache, the kernel could
alternate between filling the cache by requesting a series of consecutive visibilities
in parallel and looping though each visibility in the cache. This optimisation does
not guarantee a performance increase on all devices since the OpenCL specification
allows for devices without work-group specific local memory to map it to a region of
global memory. On such a device, any attempts at caching data from global memory
in local memory would actually slow down a kernel.
55
The aperture synthesis pipeline, outlined in Section 2.2, consists of several sequential
stages that converts radio signals collected by radio telescopes into a two-dimensional
image of the radio source. As discussed in Chapter 3, parallel versions of most stages
of aperture synthesis have been developed in CUDA in order to process data gen-
erated by the Murchison Widefield Array in real time [9]. Future work could focus
on an OpenCL version of this pipeline which could make use of a combination of a
wider variety of GPUs as well as other devices supporting OpenCL.
Since kernels written in OpenCL are capable of running on any OpenCL device
with sufficient memory, the gridding implementation developed in this project is
able to run on a wide variety of hardware without any modification. A subsequent
project could focus on optimising the gridding kernel developed for the NVIDIA
Tesla C1060 for various other devices and compare performance across a wide range
of parameters. Another potential area for further research could be in implementing
a version of gridding capable of running on multiple OpenCL devices simultaneously.
56 CHAPTER : Conclusion
Appendix A
Original Proposal
Radio interferometric image reconstruction on commodity
graphics hardware using OpenCL
Alexander Ottenhoff
01 April 2010
The Problem
Astronomers can gain a better understanding of the the creation and early evolu-
tion of the universe, test theories and attempt to solve many mysteries in physics
by producing images from radio waves emitted by distant celestial entities. With
the construction of vast radio-telescope arrays such as the Square Kilometre Ar-
ray (SKA), Australian SKA Path-finder (ASKAP) and Murchison Wide-field Array
(MWA), many engineering challenges need to be overcome. ASKAP alone will gen-
57
58 CHAPTER A: Original Proposal
erate data at a rate of 40Gb/s, producing over 12PB in a single month [6] and SKA
will produce many times this, so data processing and storage are major issues. As
we reach the limit of how fast we can make single core CPUs run we need to look
to parallel processors such as multi-core CPUs, GPUs and digital signal processors
to process this vast amount of data. One of biggest problems limiting the popular-
ity of parallel processors has been the lack of a standard language that runs on a
wide variety of hardware, although a new language named OpenCL may change that.
First published by the Khronos group in late 2008, OpenCL is an open standard
for heterogeneous parallel programming [17]. One of the major advantages of code
written in OpenCL is that it allows programmers to write software capable of run-
ning on any device with an OpenCL driver, eliminating the need to rewrite large
amounts of code for each vendor’s hardware. This partially solves the issue of vendor
lock-in, a major problem in general purpose GPU (GPGPU) programming up until
now where due to the lack of standardisation, software is often restricted to only
running on a single architecture only produced by one company.
In this project I aim to develop an efficient way to adapt radio-interferometric imag-
ing to parallel processors using OpenCL, in particular the gridding algorithm as this
has traditionally been the most time-consuming part of the imaging process [30].
Due to the large amount of data that will be generated by the ASKAP, the amount
of data which can be processed in real-time may be a serious performance bottle-
neck. Since a cluster of GPUs with equal computational performance to a traditional
supercomputer consumes a fraction of the energy, an efficient OpenCL implemen-
tation would be a significantly less expensive option. I will primarily target GPU
architectures in particular the NVIDIA Tesla C1060, although I will also attempt
59
to benchmark and compare performance on several different devices.
Background
Radio interferometry background
The goal of radio astronomy is to gain a better understanding of the physical universe
via the observation of radio waves emitted by celestial bodies. Part of this is achieved
by forming images from the signals received by radio telescopes. For a single dish
style radio telescope, the angular resolution R of the image generated from a signal
of wavelength λ is related to the diameter of the dish D.
R =λ
D(A.1)
Since R is a measure of the finest object a telescope can detect, a dish designed
to create detailed images of low frequency signals can be several hundred metres in
diameter. Constructing a dish of this size is however both difficult and extremely ex-
pensive, so most modern radio astronomy projects are utilise an array of telescopes.
Aperture synthesis is a method of combining signals from multiple telescopes to
produce an image (as shown in figure ??) with a resolution approximately equal to
that of a single dish with a diameter of the maximum distance between antennae.
The first stage of this process involves taking the signals from each pair of antennas
60 CHAPTER A: Original Proposal
and cross-correlating them to form a baseline. The relationship between the number
of an antennas in an array a, and the total number of baselines b including those
autocorrelated with themselves is shown by (A.2).
b =a(a− 1)
2+ a (A.2)
These signals are combined to produce a set of complex visibilities, one for each
baseline, frequency channel and period of time. The next stage is to generate a
dirty image from these complex visibilities by translating and interpolating them to
a regular grid so that the Fast Fourier Transform (FFT) can be applied. Finally, the
dirty image may be deconvolved to eliminate artifacts introduced during imaging.
A common name for the stage of aperture synthesis where complex visibilities are
mapped to a regular grid is gridding. The relationship between the 2-dimensional
sky brightness I, 3-dimensional visibility V and primary antenna beam pattern A
is shown in (A.3) [27].
A(l,m)I(l,m) =
∫ ∞−∞
∫ ∞−∞
V (u, v)e2πi(ul+vm) du dv (A.3)
The primary beam pattern A is removed during the deconvolution stage to obtain
the sky brightness. For radio-telescope arrays with sufficiently large baselines or a
wide field of view, images are distorted due to the curvature of the earth introducing
a height element w to the location of each antenna. One technique used to counter
this distortion is faceting, where the sky is divided into patches small enough that
the baselines can be treated as coplanar and then combining them into one image.
61
Another common approach known as W-projection involves gridding the entire grid
treating w as 0 and then convolving each point in the dirty image by a function G̃
provided in (??) [3].
G̃(u, v, w) =i
we−πi[
(u2+v2)w
] (A.4)
Parallel hardware background
Computer manufacturers have shifted their focus in recent years from designing fast,
single core processors to creating processors which can execute multiple threads si-
multaneously and minimise memory access latency. Since these multi-core processors
are still relatively new, a diverse range of architectures are available, including multi-
core x86 processors such as the AMD Phenom and Intel Core i7, IBM’s Cell/B.E.
and GPUS like NVIDIA’s Tesla and AMD’s Radeon 5800 series. One of the fac-
tors limiting the usage of parallel processors by developers is the vast amount of
code that has been developed for single-processor computers. Often, due to inter-
dependencies between operations rewriting these legacy programs to take advantage
of multiple concurrent threads is not a trivial task.
While originally developed as co-processors optimised for graphics calculations,
GPUs are being designed with increasingly flexible instruction sets and are emerging
as economical massively parallel processors. NVIDIA’s recent Tesla C1060 GPU is
capable of 933 single precision GigaFLOPS [8] (floating point operations per sec-
ond) compared to the fastest CPU available at the time, Intel’s Core i7 Extreme
965, which has been benchmarked at 69 single precision GigaFLOPS [?]. Part of
62 CHAPTER A: Original Proposal
the reason that GPUs can claim such high performance figures is their architecture
as shown in figure ??. By devoting more transistors to data processing, GPUs are
highly optimised for performing simple vector operations on large amounts of data
significantly faster than a processor using those transistors for other purposes. This
performance, however, comes at the expense of control circuitry meaning that GPUs
can’t make use of advanced run time optimisations commonly found on modern desk-
top CPUs such as branch prediction and out-of-order execution. GPUs also sacrifice
the amount of circuitry used for local cache, which has a major impact on the average
amount of time a process needs to wait when between requesting data from memory.
OpenCL is a programming language created by the Khronos group with the design
goal of enabling the creating of code that can run across a wide range of parallel
processor architectures without needing to be modified. To deal with the many dif-
ferent types of processors that can be used for processing data the OpenCL runtime
separates them into 2 classes: hosts and devices. The host, generally a single CPU
core, is in charge of managing memory and transferring programs compiled for the
device at run-time (kernels) and data to and from devices. A device’s job is to
simply execute a kernel in parallel across a range of data, storing the results locally,
and alert the host when finished so the results can be transferred back. Command
queues are used so the host can queue up several instructions waiting for device
execution while still being free to perform whatever other operations are necessary
while waiting for results. An important feature for code executed on the device is
the availability of vector data types, allowing each ALU on a device with SIMD
instructions to perform an operation on multiple variables simultaneously. Because
of the diverse range of devices supported, memory management on the device is left
up to the programmer so they can efficiently make use of the limited local cache
63
available on GPU threads as well as take advantage of optional device features such
as texture memory.
Plan
This project aims to evaluate whether GPUs programmed using OpenCL are a suit-
able platform for running the gridding stage of imaging radio astronomy data in
real-time. So far various research papers and journal articles have been read in an
effort to understand the variety of techniques currently being used to improve grid-
ding performance in existing projects [1, 3, 15, 23, 32], as well as previous efforts to
parallelise gridding on similar processors [19, 26, 28–30]. The next step will be to
construct a theoretical model through analysis of the algorithms used in the most
relevant papers and research into the specifications of the target language and plat-
form [7, 17]. This model will be used to determine where any data dependencies
exist in the algorithm and to plan out a GPU optimised solution.
Before implementing this model on a GPU target using OpenCL, a serial version
written in ANSI C. The serial implementation will be developed first as a reference
to determine the correctness of the OpenCL version. This will then be followed
by an OpenCL implementation, optimised for NVIDIA’s Tesla C1060 processor on
a x86 workstation running Ubuntu Linux. Various optimisations will be tested to
improve the execution time, and the final version will be benchmarked on several
different platforms.
64 CHAPTER A: Original Proposal
Figure A.1: The software pipeline. [29]The first stage of this process involves taking the signals from each pair of antennasand cross-correlating them to form a baseline. These signals are combined to producea set of complex visibilities, one for each baseline, frequency channel and period oftime. The next stage is to generate a dirty image by translating and interpolatingthem to a regular grid and applying the Fast Fourier Transform (FFT). Finally, thedirty image may be deconvolved to eliminate artifacts introduced during imaging.
Figure A.2: A comparison of CPU and GPU architectures. [7]By devoting more transistors to data processing, GPUs are highly optimised for per-forming simple vector operations on large amounts of data significantly faster thana processor using those transistors for other purposes. This performance, however,comes at the expense of control circuitry meaning that GPUs can’t make use ofadvanced run time optimisations commonly found on modern desktop CPUs suchas branch prediction and out-of-order execution. GPUs also sacrifice the amount ofcircuitry used for local cache, which has a major impact on the average amount oftime a process needs to wait when between requesting data from memory.
References
[1] PJ Beatty, DG Nishimura, and JM Pauly. Rapid gridding reconstruction with aminimal oversampling ratio. IEEE transactions on medical imaging, 24(6):799–808, 2005.
[2] BG Clark. An efficient implementation of the algorithm’CLEAN’. Astronomyand Astrophysics, 89:377, 1980.
[3] T. J. Cornwell, K. Golap, and S. Bhatnagar. Wide field imaging problems inradio astronomy. In IEEE International Conference on Acoustics, Speech, andSignal Processing, 2005. Proceedings. (ICASSP ’05). Vol. 5: p. v-861-v-864,pages 861–, March 2005.
[4] TJ Cornwell. Radio-interferometric imaging of very large objects. Astronomyand Astrophysics, 202:316–321, 1988.
[5] TJ Cornwell, MA Holdaway, and JM Uson. Radio-interferometric imaging ofvery large objects: implications for array design. Astronomy and Astrophysics,271:697, 1993.
[6] TJ Cornwell and G. van Diepenb. Scaling Mount Exaflop: from the pathfindersto the Square Kilometre Array.
[7] NVIDIA Corporation. OpenCL Programming Guide for the CUDA Ar-chitecture. Available from: http://www.nvidia.com/content/cudazone/
download/OpenCL/NVIDIA\_OpenCL\_ProgrammingGuide.pdf Last accessedon: .
[8] NVIDIA Corporation. Tesla c1060 computing processor board specifi-cation. Available from: http://www.nvidia.com/docs/IO/56483/Tesla\
_C1060\_boardSpec\_v03.pdf Last accessed on: .
[9] RG Edgar, MA Clark, K. Dale, DA Mitchell, SM Ord, RB Wayth, H. Pfister,and LJ Greenhill. Enabling a high throughput real time data pipeline for alarge radio telescope array with GPUs. Computer Physics Communications,2010.
65
66 References
[10] S. Freya and L. Mosonic. A short introduction to radio interferometric imagereconstruction.
[11] K. Golap, A. Kemball, T. Cornwell, and W. Young. Parallelization of WidefieldImaging in AIPS++. In Astronomical Data Analysis Software and Systems X,volume 238, page 408, 2001.
[12] A. Gregerson. Implementing Fast MRI Gridding on GPUs via CUDA.
[13] Mark Harris. Mapping computational concepts to gpus. In SIGGRAPH ’05:ACM SIGGRAPH 2005 Courses, page 50, New York, NY, USA, 2005. ACM.
[14] Intel. Intel microprocessor export compliance metrics. Available from: http://www.intel.com/support/processors/sb/CS-023143.htm Last accessed on: .
[15] JI Jackson, CH Meyer, DG Nishimura, and A. Macovski. Selection of a convo-lution function for Fourier inversion usinggridding [computerised tomographyapplication]. IEEE Transactions on Medical Imaging, 10(3):473–478, 1991.
[16] W.Q. Malik, H.A. Khan, D.J. Edwards, and C.J. Stevens. A gridding algorithmfor efficient density compensation of arbitrarily sampled Fourier-domain data.
[17] A. Munshi. OpenCL: Parallel Computing on the GPU and CPU. SIGGRAPH,Tutorial, 2008.
[18] ST Myers. Image Reconstruction in Radio Interferometry.
[19] S. Ord, L. Greenhill, R. Wayth, D. Mitchell, K. Dale, H. Pfister, and RG Edgar.GPUs for data processing in the MWA. Arxiv preprint arXiv:0902.0915, 2009.
[20] D. Rosenfeld. An optimal and efficient new gridding algorithm using singularvalue decomposition. Magnetic Resonance in Medicine, 40(1):14–23, 1998.
[21] RJ Sault, PJ Teuben, and MCH Wright. A retrospective view of Miriad. Arxivpreprint astro-ph/0612759, 2006.
[22] T. Schiwietz, T. Chang, P. Speier, and R. Westermann. MR image reconstruc-tion using the GPU. In Proc. SPIE, volume 6142, pages 1279–90. Citeseer,2006.
[23] FR Schwab. Optimal gridding of visibility data in radio interferometry. InIndirect Imaging. Measurement and Processing for Indirect Imaging, page 333,1984.
[24] H. Sedarat and D.G. Nishimura. On the optimality of the gridding reconstruc-tion algorithm. IEEE Transactions on Medical Imaging, 19(4):306–317, 2000.
[25] DJ Smith. Maximum Entropy Method. MARCONI REV., 44(222):137–158,1981.
67
[26] TS Sorensen, T. Schaeffter, KO Noe, and M.S. Hansen. Accelerating the noneq-uispaced fast Fourier transform on commodity graphics hardware. IEEE Trans-actions on Medical Imaging, 27(4):538–547, 2008.
[27] GB Taylor, CL Carilli, and RA Perley. Synthesis imaging in radio astronomyII. In Synthesis Imaging in Radio Astronomy II, volume 180, 1999.
[28] A.S. van Amesfoort, A.L. Varbanescu, H.J. Sips, and R.V. van Nieuwpoort.Evaluating multi-core platforms for HPC data-intensive kernels. In Proceedingsof the 6th ACM conference on Computing frontiers, pages 207–216. ACM, 2009.
[29] Ana Lucia Varbanescu, Alexander S. Amesfoort, Tim Cornwell, Andrew Mat-tingly, Bruce G. Elmegreen, Rob Nieuwpoort, Ger Diepen, and Henk Sips.Radioastronomy image synthesis on the cell/b.e. In Euro-Par ’08: Proceedingsof the 14th international Euro-Par conference on Parallel Processing, pages749–762, Berlin, Heidelberg, 2008. Springer-Verlag.
[30] Ana Lucia Varbanescu, Alexander S. van Amesfoort, Tim Cornwell, Ger vanDiepen, Rob van Nieuwpoort, Bruce G. Elmegreen, and Henk Sips. Buildinghigh-resolution sky images using the cell/b.e. Sci. Program., 17(1-2):113–134,2009.
[31] T.L. Wilson, K. Rohlfs, and S. Huttemeister. Tools of Radio Astronomy.Springer Verlag, 2009.
[32] M. Yashar and A. Kemball. TDP CALIBRATION & PROCESSING GROUPCPG MEMO COMPUTATIONAL COSTS OF RADIO IMAGING ALGO-RITHMS DEALING WITH THE NON-COPLANAR BASELINES EFFECT:I. 2009.