Post on 25-Jul-2020
Diploma Thesis
Hybrid implementation of an
iterative reconstruction algorithm for
Positron Emission Tomography sinograms
on CPUs and GPUs
submitted to University of Applied Sciences Regensburg,
Department of Computer Science and Mathematics
by Sebastian Schaetz
Supervisors
Prof. Dr. Markus Kucera
Dr. Frank Kehren
September 26, 2009
To my parents
Abstract
Positron emission tomography (PET) is a medical imaging modality that al-
lows the detection of tissue with particular properties or the observation of
metabolic processes in living organisms over time. It plays a vital role in can-
cer diagnosis and heart examination and represents a promising method for
early diagnosis of dementia. With the advancement of scanner technology as
well as an ever increasing demand for higher quality images and higher pa-
tient throughput the image reconstruction system can become a bottleneck
of a PET system. To meet this demand for more computing power in the
reconstruction system a novel way to speed up the image reconstruction pro-
cess for modern PET systems is presented. A common graphics processing
device in conjunction with the CUDA framework is used to speed up exist-
ing algorithms for the CPU. Projectors for full 3D reconstruction are ported
to the CUDA device and new algorithms are implemented where necessary.
The main focus of this port is to speed up the calculation and to maintain
the numerical accuracy of the result. On the basis of the CPU reconstruction
system a new system that uses GPU projectors is implemented. With this
modification a two-fold speed up is achieved compared to the highly opti-
mized CPU implementation running on 4 CPU cores. Different optimization
methods are explored and applied where suitable. The results calculated by
GPU and CPU projectors are numerically identical allowing them to be in-
terchanged. The system is validated and the algorithms are integrated into
the reconstruction system of Siemens Healthcare Molecular Imaging PET
scanners.
Contents
1 Introduction 9
1.1 Previous work . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2 Acknowledgement . . . . . . . . . . . . . . . . . . . . . . . . . 15
2 Positron emission tomography 16
2.1 Physics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.1 Radiation and Radioactive Decay . . . . . . . . . . . . 16
2.1.2 Positron Decay . . . . . . . . . . . . . . . . . . . . . . 17
2.1.3 Positron Annihilation . . . . . . . . . . . . . . . . . . 18
2.1.4 Radiation Detection and Scintillation . . . . . . . . . 19
2.1.5 Positron Emitter Production . . . . . . . . . . . . . . 21
2.2 Data Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.1 Scintillation Detectors . . . . . . . . . . . . . . . . . . 23
2.2.2 Detected Events . . . . . . . . . . . . . . . . . . . . . 23
2.2.3 2D and 3D Data Acquisition . . . . . . . . . . . . . . 26
2.2.4 Optimization . . . . . . . . . . . . . . . . . . . . . . . 27
2.3 Coordinate System, Data Structures and
Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3.1 Coordinate System . . . . . . . . . . . . . . . . . . . . 28
2.3.2 Parallel Beam Space and Line of Response Space Sino-
grams . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.3.3 Sinogram Compression . . . . . . . . . . . . . . . . . . 30
2.3.4 Other Data Structures . . . . . . . . . . . . . . . . . . 34
2.4 Image Reconstruction . . . . . . . . . . . . . . . . . . . . . . 34
2.4.1 Basic Principles . . . . . . . . . . . . . . . . . . . . . . 36
2.4.2 The Maximum-Likelihood Expectation Maximization
Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 37
7
3 Hybrid implementation of PET image reconstruction 43
3.1 Projectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.1.1 Ray projection through pixel images . . . . . . . . . . 43
3.1.2 Projector Algorithm . . . . . . . . . . . . . . . . . . . 45
3.1.3 Implementation . . . . . . . . . . . . . . . . . . . . . . 46
3.2 Optimization of CPU projector . . . . . . . . . . . . . . . . . 48
3.2.1 Analysis of current Implementation . . . . . . . . . . . 48
3.2.2 Optimization of current Implementation . . . . . . . . 49
3.3 The Compute Unified Device Architecture . . . . . . . . . . . 51
3.3.1 Hardware Architecture . . . . . . . . . . . . . . . . . . 52
3.3.2 Programming Model . . . . . . . . . . . . . . . . . . . 54
3.3.3 Execution Model . . . . . . . . . . . . . . . . . . . . . 57
3.3.4 Memory Model and Access . . . . . . . . . . . . . . . 58
3.3.5 CUDA Toolchain . . . . . . . . . . . . . . . . . . . . . 59
3.3.6 Debugging CUDA code . . . . . . . . . . . . . . . . . 60
3.4 Implementing projectors in CUDA . . . . . . . . . . . . . . . 61
3.4.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . 61
3.4.2 Implementation . . . . . . . . . . . . . . . . . . . . . . 62
3.4.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . 66
3.4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.4.5 Other algorithms . . . . . . . . . . . . . . . . . . . . . 72
3.4.6 Texture Units . . . . . . . . . . . . . . . . . . . . . . . 75
3.5 Debugging and Validation . . . . . . . . . . . . . . . . . . . . 76
3.5.1 Debugging . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.5.2 Validation . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.6 Product considerations . . . . . . . . . . . . . . . . . . . . . . 81
3.6.1 Timing the Tests . . . . . . . . . . . . . . . . . . . . . 84
3.6.2 Memory Tests . . . . . . . . . . . . . . . . . . . . . . . 85
3.6.3 Reconstruction Tests . . . . . . . . . . . . . . . . . . . 88
3.7 Product Integration . . . . . . . . . . . . . . . . . . . . . . . 93
3.7.1 Integration Overview . . . . . . . . . . . . . . . . . . . 94
3.7.2 Multithreading Implementation . . . . . . . . . . . . . 95
3.7.3 Parallel CPU and GPU projection . . . . . . . . . . . 101
3.7.4 Hybrid Implementation . . . . . . . . . . . . . . . . . 101
4 Conclusion 103
8
Chapter 1
Introduction
Positron Emission Tomography (PET) is a tomographic imaging technology
which utilizes the positron emission decay of radioisotopes. PET allows the
detection of tissue with particular properties or the observation of metabolic
processes in living organisms over time.
The subject about to undergo a PET scan is injected a substance com-
mon to its body that is known to accumulate in the region of interest or
interact in the metabolic process that is to be observed. The substance
called tracer is prepared to include positron emitting nuclei. The nuclei are
carefully chosen for an optimal balance between patient safety and imag-
ing effectiveness. After injection the subject’s body metabolizes the tracer.
The tracer emits positrons at a rate specific to the chosen radioactive nu-
cleus. The subject is placed inside the PET scanner where the radioactive
emissions can be measured very precisely. The PET scanner is able to de-
tect single positron emissions. The subject remains in the scanner a certain
amount of time and the tomograph continues to measure positron emissions.
After the scan is completed complex algorithms are applied to the acquired
data to calculate 3D images of the measurements. The images then show in
which regions of the subject’s body the tracer accumulates the most or how
it is transported and distributed over time.
A common application for this scanning modality is cancer staging and
examination following cancer treatment. Many types of cancer cells have an
unusual high metabolism thus requiring disproportional amounts of sugar.
To detect these cancer cells, a positron emitter is attached to sugar molecules
and injected into the patient’s body. Cancerous tissue takes up more of the
9
tracer than healthy tissue. In the resulting images cancer cells show more
positron emissions than healthy cells.
The current application range of Positron Emission Tomography can be
divided into medical applications and research applications. In the area of
research PET is used in brain studies to explore human brain functions. In
pre-clinical studies (before testing in humans) specific animal PET devices
are very common. PET technology is particularly valuable in cancer research
with animals as it allows repeated probing of the same subject without killing
it, thus substantially reducing the numbers of animals required.
The three important areas in clinical usage are oncology (cancer diagno-
sis and treatment), cardiology (treatment of heart disorders) and neurology
(treatment of disorders of the nervous system). PET applications for cancer
diagnosis and treatment include the detection of tumors, the grading of ma-
lignancy by studying the uptake of metabolic tracers, the documentation of
how widespread the cancer is in a patient, the testing for returning cancer
after treatment and the evaluation of effectiveness of cancer therapy. In
cardiology the applications include the preliminary examination of patients
before heart transplantation and the diagnosis and assessment of severity of
coronary artery disease. In neurology PET is for example used for the man-
agement of brain tumors. Furthermore PET shows to be a superior method
for early diagnosis of dementias [16]. This could become a very important
application as soon as effective treatments are available.
PET system manufacturers try to increase the performance of their sys-
tem with every generation of new scanners. The performance of PET sys-
tems for clinical use can in general be measured by the quality of the gen-
erated images and the number of patients that can be scanned in a given
time frame. The image quality is optimized by using better, more sensitive
scanners capable of higher resolutions that are able to measure the emitted
positrons in a very exact way. This results in newer scanners producing data
sets that are multiple times larger than data from earlier scanner genera-
tions. An example for this is the switch from 2D data acquisition to 3D.
Additionally physicist are modeling the data acquisition process more de-
tailed than ever allowing for very powerful image reconstruction techniques
to remove noise and compensate for inaccurate measurement. Examples for
this are attenuation correction where the densities of the patient’s body are
taken into account during image calculation and PSF reconstruction where
10
the point spread function of a scanner is considered.
The measured datasets get larger and larger and the image calculation or
reconstruction process gets more complex and accurate. At the same time
newer scanners are expected to have a higher patient throughput. This is
an important factor for clinical use as it determines how many patients can
undergo the possibly life-saving diagnosis. The number of patients scanned
in a given time frame depends on a lot of factors including scanner sensitivity,
the tracer and the clinical workflow.
For image reconstruction the workflow dictates one rule-of-thumb stan-
dard: as soon as one scan is completed, the image of the last scan should
be ready and available. And this should be true not only for scans from
different patients where there might be a longer time period between scans
but also for whole-body scans. During a whole body scan one part of the
body is scanned for a period of time and then the patient is moved so that
another part of his body can be scanned. Assuming a scanning duration of
5 minutes for one part of the body, the image reconstruction system has at
most 5 minutes to calculate an image.
This requirement in conjunction with the growing data sizes and the
more complex reconstruction techniques increases the demand for better per-
formance of the reconstruction system. Current implementations are highly
optimized but reach the computational limits of modern high-performance
computer systems. It is therefore sensible to look for new ways to speed up
the reconstruction process.
The enormous computational performance of graphics devices for gaming
or demanding workstation tasks has long been known and for a while now
academy and industry has tried to leverage the potential of these devices
for general purpose computation. With the release of several tools that
simplify the process of porting algorithms to the GPU such as BrookGPU,
CUDA and CTM (now Stream SDK) in the last 2 years general purpose
computation on graphics processing units has experienced a wide-spread
adoption in research communities as well as industry.
In this work existing algorithms for PET image reconstruction are ported
to graphics processing units using the CUDA framework. They are modi-
fied to fit the programming paradigms of GPGPU and new algorithms are
implemented where necessary. A number of optimization techniques are in-
vestigated and applied to the implementation. The algorithms are tested
11
for performance and accuracy as the goal is to speed up the reconstruction
whereas the calculated results should not differ from the original implemen-
tation. After successful implementation, optimization and validation the
algorithms are integrated into the reconstruction system of Siemens Medi-
cal PET scanners.
1.1 Previous work
Over the last years a substantial amount of research has gone into speeding
up image reconstruction for positron emission tomography as well as other
imaging modalities such as single positron emission computed tomography
(SPECT) and computed tomography (CT). Scientists and developers came
up with many ideas to keep up with the ever increasing demand for image
quality and reconstruction speed.
In the 1990s researchers tried speeding up reconstruction using powerful
processing clusters such as the Intel iPSC/2 or the BBN Butterfly [11].
Atkins et al. used a Transputer with T800 processors in a master-worker
architecture [3] and Jones at al. developed dedicated VLSI hardware for
image reconstruction [25]. In 1995 the successor of the Intel iPSC/2, the
Intel iPSC/860, a cluster with 128 nodes in a hypercube topology was used
by Johnson et al. [23].
One important improvement of PET image reconstruction was the in-
vention of the OSEM algorithm by Hudson and Larkin [18] that increased
the performance of iterative image reconstruction algorithms by an order
of magnitude and thus permitted the use of this superior reconstruction
method in clinical environments for the first time. Based on this algorithm
many researchers achieved even greater speedups by using different hard-
ware.
A lot of research focused on the use of parallel processors and larger
computational clusters. In 2001 Vollmar et al. presented impressive results
with an implementation running on a 7 node 4 processors per node Intel
Xeon system [52]. In 2002 Jones at al. [24] used a single program multi-
ple data implementation of the OSEM algorithm running on a cluster of 9
nodes each containing two Intel Xeon processors interlinked via Gigabit Eth-
ernet. The group took two different approaches towards decomposing the
problem domain: projection space decomposition and image space decom-
12
position. The image space decomposition idea is very similar to the CPU
implementation enhancements presented in this work. The main problem
with these approaches is the costly computational cluster that is required for
reconstruction. Due to their size and infrastructure requirements they are
often not suitable for use in clinical environments. Additionally in todays
environment-aware economy the amount of power the reconstruction system
requires is also a factor.
Nevertheless in 2006 Jones et al. published interesting results achieved
with a parallel OSEM implementation using both message passing (MPI)
and shared memory programming (OpenMP). Their algorithm spread all
necessary projections of one subset not only across multiple computational
nodes but also across processor cores inside one node. With this tech-
nique the communication overhead between nodes decreased as multiple
cores could share memory which resulted in an adequate speedup. However,
the communication overhead was still high, as the algorithm required all
nodes to synchronize and exchange all their datasets after the calculation
of one subset to calculate the correction factor and finally the image, which
is used for the next subset projection. They achieved a parallel efficiency
of about 0.5 when utilizing 64 processors and were thus able to speed up
reconstruction time by a factor of 30.
Although the components in such a cluster can be cheap commercial off-
the-shelf parts with the number of processors the costs of the reconstruction
system increases especially considering not only the processors but also the
node interconnects. In addition, a computational cluster with 64 or more
nodes consumes a lot more power than a normal system would and requires
a lot more space and possibly special cooling. Also the mean time between
failures is much higher in a distributed system as it contains more com-
ponents which results in a higher failure rate. Thus such large computer
systems are often not practical for products designed for clinical use. For
clinical applications the challenge is to find a compromise between cost and
performance of the reconstruction system.
In search of better solutions researchers came up with the idea of using
graphic processing units for image reconstruction. One of the first works
yielding impressive results early on was presented in 1994 by Cabral et
al. [9]. They showed volume rendering and tomographic reconstruction
algorithms for computed tomography utilizing an SGI Onyx Reality Engine
13
processor. They implemented the filtered backprojection algorithm for CT
and a volume rendering algorithm which they found to be very similar in
nature on the graphics processing unit.
About 10 years later in 2005 Wang et al. [54] presented an OSEM im-
plementation for quantitative single photon emission tomography (SPECT)
for graphics cards including attenuation correction and adjustments for
the characteristic point spread functions of the scanner. Using a Nvidia
GeforceFX 5900 graphics card with 256MB video memory they achieved
an impressive 10 fold speedup. Their work represents a very cost effective
solution for quantitative SPECT in clinical environments.
In 2006 Bai et al. [4] presented image reconstruction for PET using
graphics hardware. They used very similar hardware as the GPUs used
in this work however it was no G80 core yet and at that time the CUDA
framework was not available yet. This constrained the developers to a cer-
tain extent as random writing to graphics memory was not possible and the
OpenGL API and Cg had to be used. They utilized the bilinear interpo-
lation routines of the hardware texturing unit and were able to calculate
4 LORs at a time utilizing the RGBA values of a pixel. Although their
system could not outperform a highly optimized CPU implementation, the
GPU represented a cheaper alternative to CPUs.
In 2007 Panin and Kehren filed for patent [43] for an acceleration of
Joseph’s method for full 3D reconstruction. In the patent a method is de-
scribed to speed up the calculation of projection rays through pixel images by
calculating the line integral with linear interpolation. The speedup achieved
with method described in the patent results form reusing intermediary re-
sults that are already calculated for the oblique sinogram segments. The
projector algorithms presented in this work are based upon this
In 2007 and 2008 two groups did similar work compared to [43] as they
took another close look at the OSEM reconstruction algorithm and in par-
ticular at the projectors and found that they can be implemented more ef-
ficiently. Kadrmas [27] found that the projection can be formulated as two
separate operations, namely image rotation and image slanting. Kadrmas
also found that by using these operations, redundant parts of the projection
algorithm can be omitted. This greatly improved the performance of the
algorithm. Additionally depth compression along the z axis was applied to
further reduce the computational complexity. Also the idea to reshuffle the
14
indices to efficiently access the memory was explained.
Similarly Hong et al. [17] also presented an algorithm for the CPU that
exploits symmetries in the projections. They achieve a 70 fold speed-up
when comparing their implementation to an older reconstruction system by
Siemens with 8 computational nodes while maintaining a good numerical
accuracy. Amongst other things they exploited SEE instructions of modern
Intel CPUs to calculate 4 symmetric elements in one step.
Finally in 2007 Scherl et al. [47] presented a fast algorithm for the recon-
struction of computed tomography sinograms on the GPU which used the
CUDA architecture. Although the OSEM algorithm is iterative, CT recon-
struction and PET reconstruction share the same basic operation: projec-
tion. Scherl et al. were able to achieve impressive results utilizing the same
hardware that was used in this thesis. They however used the texturing
unit of the device to calculate the bilinear interpolations. The group com-
pared their work to an implementation on a CELL processor and ascertained
similar performance.
1.2 Acknowledgement
The author thanks his supervisors Markus Kucera for guiding him through
the writing process of this thesis and Frank Kehren for his patience, excellent
advice and for giving him the opportunity to work on such an interesting
topic. Above that the author would like to thank the entire reconstruction
group from Siemens Healthcare Molecular Imaging, especially Zigang Wang,
Jicun Hu and Chuanyu Zhou, as well as Swetha Nandyala, Keith Clark
and Bernd Schaarschmidt. The author very much enjoyed working with the
group and is grateful for their ideas and suggestions as well as their support.
Special acknowledgment is due to Herbert Kopp and Manfred Kraus who
made this thesis possible in the first place.
15
Chapter 2
Positron emission
tomography
2.1 Physics
In the following chapter the physical principles behind PET are explained.
Three physical phenomena lay the foundation for PET and are of great
importance: positron decay, positron annihilation and scintillation. The
following explanations are based on the Rutherford-Bohr model ([6] and [46])
of the atom as a theoretical model. It makes the following assumptions: An
atom consists of a nucleus and electrons. The nucleus contains Z protons and
N neutrons and their sum is the atom’s mass number A = Z + N . Protons
contribute the positive charge to the atom. Electrons are positioned in
energy levels or shells that surround the nucleus and contribute the negative
charge. The common nomenclature used to denote the configuration of a
specific nucleus is ZAX.
2.1.1 Radiation and Radioactive Decay
Radiation in general is energy in the form of particles or waves. Depend-
ing on its effect on matter it can be divided in ionizing and non-ionizing
radiation. Examples for non ionizing radiation are light or radio waves. Of
particular interest for nuclear medicine and radiological imaging is the en-
ergetic ionizing radiation. Ionizing radiation is emitted during radioactive
decay, a process by which an unstable parent nucleus decays into a more
stable descendant nucleus [45] by emitting radiation, ionizing particles or
16
both. The gradient of decay of N atoms can be statistically approximated
by the first order differential equation for exponential decay:
dN
dt= −λN (2.1)
The solution for 2.1 describes the number of unstable atoms that are left
after a certain amount of time t with N0 depicting the number of atoms at
time t = 0 and λ as the decay constant:
N (t) = N0e−λt (2.2)
The half-life t 1
2
= ln2λ characterizes the decay rate of unstable atoms by spec-
ifying the time required for half of the unstable atoms to decay. Since decay
events happen autonomously and continuously the process of radioactive
decay is very similar to a Poisson process. Therefore a Poisson distribution
can be assumed when creating a mathematical model for nuclear decay:
P(k , λ) = e−λ λk
k !(2.3)
Equation 2.3 describes the probability that within a specific time interval k
decay events occur where the expected number of events during this time
interval is λ. For PET imaging a specific type of radioactive decay is of
particular interest: positron decay.
2.1.2 Positron Decay
Positron decay happens in proton rich, unstable atoms. They achieve sta-
bility by converting a proton to a neutron whereas the positive charge is
emitted from the nucleus in the form of a positron as in:
11p
+ −−→ 01n + 1
0β+ + ν. (2.4)
ν is a neutrino that accompanies the positron. Positron decay is in some
sense the opposite of β-decay where an electron is emitted. The general
equation for positron decay from an atom is:
AZX −−→ A
Z−1Y + 01β
+ + ν. (2.5)
17
To balance the atoms charge after decay the resulting nucleus has to eject
one of its orbital electrons. A process called internal conversion is often
responsible for this. The electron is ejected with kinetic energy specific to
the atom. The emitted positron also has kinetic energy and both electron
and positron therefore travel a finite distance outside the nucleus. The range
depends on their kinetic energy and the medium they travel in.
Nuclide t1
2
Emax Emode Range in H2O(mm)
(mins) (MeV) (MeV) Max Mean11C 20.4 0.959 0.326 4.2 1.113N 9.96 1.197 0.432 5.1 1.515O 2.03 1.738 0.696 7.3 2.518F 109.8 0.633 0.202 2.4 0.668Ga 68.3 1.898 0.783 8.2 2.982Rb 1.25 3.40 1.385 14.1 5.9
Table 2.1: Properties of positron-emitting atoms (reproduced from [5])
Table 2.1 shows the properties of chosen positron emitting nuclei. The
properties include the half-life t 1
2
of the nucleus, the maximum and most
likely energy of the emitted positrons Emax and Emode and the maximum
and mean range in water. The energies are roughly proportional to the
distances a positron travels. These properties are important when selecting
a suitable nuclide for PET. The location of the positron decay event has to
be measured as exact as possible but the further away the positron travels
from the nucleus, the greater the uncertainty about where the event actually
occurred.
As a result of emitting a positron and an electron, the atom is at least
2 electron masses lighter than before. There’s also a chance that a proton
rich atom does not achieve stability by positron decay but by capturing an
electron.
2.1.3 Positron Annihilation
After the positron is emitted from the nucleus it has kinetic energy so it
travels a certain distance away from the nucleus until its kinetic energy is
close to zero. When the positron then collides with an electron the two
particles annihilate and leave electromagnetic radiation. The most probable
18
form of radiation that is emitted as a result of this collision is of two photons
of 511 keV. With high probability two and with less than 1% probability
three photons are emitted. If the kinetic energy of the colliding particles
is zero the photons are emitted at 180◦ to each other. In many cases the
momentum of the two particles is not precisely zero before annihilation so
photon pairs are not always emitted strictly at 180◦.
Figure 2.1: Decay and Annihilation exemplified by 189F
Figure 2.1 illustrates positron decay and annihilation exemplified by the
Fluorine isotope 189F, one of the most used nuclei for PET today.
2.1.4 Radiation Detection and Scintillation
A couple of different techniques exist to detect radiation: ionization cham-
bers, Geiger-Mueller tubes, semiconductor radiation detectors and liquid
and solid scintillation detectors. Although there has been research towards
semiconductor detectors for medical imaging using the Schottky CdTe de-
tector diode [14] and CdZnTe detectors [15] solid scintillation detectors are
the predominant technology in PET. For SPECT however usage of semicon-
ductor radiation detectors is on the rise.
A solid scintillator is a transparent crystal that has the ability to emit
light when being hit by γ rays. For this principle the assumed model of
19
the atom has to be expanded. Electron energy states are limited to discrete
energy levels [50]. A valence band is the highest energy band of a discrete
energy level of an atom with electrons present. The following first unfilled
band is called the conduction band. The energy hole Eg between valence and
conduction band is a few electron volts large [5]. The electrons in the highest
energy band can absorb energy from hitting γ rays and get excited. They
move into the conduction band and while releasing scintillation photons they
de-excite instantly. The emitted light is usually in the ultraviolet spectrum
however by incorporating impurities into the scintillation crystal the band
structure can be manipulated so that the crystals emit visible light. This
light can be captured by photomultiplier tubes that are thus able to convert
the measured photons into an electric signal. This process is also called
luminescence. The design of scintillation detectors is described in 2.2.1.
Property NaL(Tl) BGO LSO YSO GSO BaF2
Density ρ (g/cm3) 3.67 7.13 7.4 4.53 6.71 4.89Effective Z (Zeff ) 50.5 74.2 65.5 34.2 58.6 52.2Decay constant (ns) 230 300 40 70 60 0.6Output (photons/keV) 38 6 29 46 10 2△E/E (%) 5.8 3.1 9.1 7.5 4.6 4.3
Table 2.2: Properties of commonly used scintillators (reproduced from [5])
Three properties of the scintillator influence the quality of the mea-
surement and thus affect system design and achievable image quality. The
scintillator crystal has to be able to stop the 511keV photons because if a
photon does not deposit its energy into the crystal the γ ray is not counted.
A crystal therefore has to be able to effectively stop 511keV photons. The
density ρ of the scintillator and the effective atomic number Zeff determine
how effective a crystal is in stopping 511keV photons. Secondly the sig-
nal decay time is important. Once a proton hits the scintillator, the event
should be detected and processed immediately so that the detector is ready
to measure new events as they hit. This is quantified by the decay constant.
The light output of the scintillator is a third important property. The higher
the light output the less sensitive the photo detectors have to be resulting in
a high ratio of number of crystals to number of photo detectors (see 2.2.1).
The light output also determines the energy resolution △E/E which affects
the detectors ability to recognize and reject scattered events. Scattered
20
events contribute noise and originate from photons that are deflected and
lose some of their energy. Table 2.2 lists some commonly used scintillators
in PET and their properties. Vendors use different types of scintillators
however the most common today are LSO (Siemens) a combination of LSO
and YSO [29] (Phillips) and BGO [49] (General Electric).
2.1.5 Positron Emitter Production
The positron emitting isotopes are created with the help of a cyclotron.
The halogens (elements from group 17 of the periodic table) are particularly
suitable positron emitters in PET imaging, especially fluorine. The unsta-
ble halogens are produced with the help of a cyclotron, a form of particle
accelerator that is capable of producing fast protons. In the case of 18F fluor
production 18O oxygen in highly enriched water is bombarded with very fast
protons produced in a cyclotron.
S
B
D1 D
2
High Frequency
Oscillator
High Speed
Protons
Figure 2.2: Functional diagram of cyclotron
Figure 2.2 illustrates the functional principle of a cyclotron. Protons
emit from the center S of the system and are accelerated by the difference
of potential in the gap between electrodes D1 and D2. The difference in
potential is generated with a high frequency oscillator connected to the two
electrodes. Particles experience acceleration only in the gap and not inside
21
the electrodes. A magnetic field B is applied in orthogonal direction to the
electrodes that forces the protons on a circular trajectory. Every time they
reach the gap between D1 and D2 polarity of the electrode will have changed
and the protons will thus be accelerated. So the protons circulate inside the
system and get faster and faster and their trajectory gets larger and larger
until they are fast enough to leave the cyclotron. They then hit oxygen-18
to produce a neutron and fluorine-18 [5].
OH
OH
O
F
HO
HO
OH
OH
OHO
HO OH
Glucose Fluorodeoxyglucose
Figure 2.3: Chemical structure of Glucose compared to Fluorodeoxyglucose
The fluorine-18 is then used to mark sugar molecules and produce fluo-
rodeoxyglucose. Figure 2.3 shows the chemical structure of a regular sugar
molecule and the structure of a sugar molecule with the positron emitter18F attached.
Many medically relevant metabolic processes consume sugar. If sugar
marked with radioactive nuclei is introduced into those processes, the sugar
is consumed. Thus metabolic processes that use a lot of sugar can be made
visible with PET scanners. Cancerous tissue often has a disproportionately
high growth rate and thus uses a lot of sugar as its energy source. Hence it
is possible to detect many different kinds of tumors with PET.
2.2 Data Acquisition
The following chapter discusses how the data is acquired within a PET
system. The design of scintillation detectors is sketched and two different
acquisition modes are explained, namely 2D and 3D measurement. Addi-
tionally the challenges of the acquisition process are listed. Furthermore
data compression principles are explained because they are an integral step
of the reconstruction process.
22
2.2.1 Scintillation Detectors
One key to accurate data acquisition is to determine the precise location of
the gamma ray hitting the detector. The solution lies in the scintillation
detector design. The idea is to have many detectors that are as small as
possible to get high resolution. One problem arising when trying to use
many detectors is the size and the number of photomultipliers that are used
to detect the photons emitted while scintillation occurs. Their mode of
operation is as follows: they are able to amplify the signal from the crystals
and give off a short current pulse indicating that a gamma ray was detected.
The smaller and closer together the scintillation crystals are the better,
is the detection efficiency and the sampling frequency. The problem is that
using a phototube for each scintillation crystal is not practical. Therefore
a method devised by Casey and Nutt [10] is used in modern PET scanners
to reduce the number of photomultipliers and at the same time improve the
spatial resolution of the detection system. The idea is to group a block of
scintillation crystals together, for example 8 ∗ 8, and use 4 photomultipliers
to measure their light output. The 4 photomultipliers are able to determine
which one of the 8 ∗ 8 crystals was hit with the help of a lightguide that is
located between the crystals and the photomultipliers.
When a gamma ray hits a crystal the energy is converted to light. The
light is transmitted via the lightguide to all the phototubes. Depending on
their combined measurements they are able to determine which crystal was
hit. The x and y coordinate of the hit crystal within one detector block can
be calculated with
x =(b + d) − (a + c)
a + b + c + dand y =
(a + b) − (c + d)
a + b + c + d(2.6)
with a, b, c and d representing the light output of each of the four pho-
tomultipliers. Image 2.4 shows the schematic layout of one detector block.
The 8 ∗ 8 crystals, the lightguide and the four circular photomultipliers are
shown.
2.2.2 Detected Events
If an event is detected and counted as valid or not depends on three factors.
The two gamma rays have to hit the detectors within a specific time window
for the system to recognize the rays as belonging together and forming an
23
Crystal Block
Lightguide
Phototubes a
b
c
d
Figure 2.4: Schematic layout of scintillation detector block
event. This is known as the coincidence window. Each of the gamma rays
have to emit a specific amount of energy into the scintillation crystals that
is within specified boundaries. Lastly the ray formed by the two gamma
rays has to be an accepted line of response (LOR) that is a trajectory that
the system deems relevant. If an event conforms to all three criteria it is
called a prompt event or prompt. However false positives are possible due
to various reasons.
True Scatter
Figure 2.5: True and Scattered positron events
Figure 2.5 shows true and scattered positron events as they are detected
by the PET scanner. In the left image a true coincidence is shown that is
detected correctly. The two gamma rays reach the detector ring on opposite
sides, form a valid LOR and are detected within the energy and time window.
The image on the right shows a scattered event. One gamma ray is Compton
scattered [12] inside the subject. The detector system does not recognize
24
the scattering. Scattered rays can be well inside the energy window of the
detectors and thus the system can not distinguish between scattered events
and prompts. As a result a LOR between the two detectors is assumed
which does not correspond to the actual location of the positron decay.
Random Multiple
Figure 2.6: Random and Multiple coincidences
Image 2.6 shows a random event on the left side where two annihilations
take place at almost the same time. The detection system recognizes two
gamma rays that did not emit from the same annihilation as a correlated
coincidence as they are detected within the coincidence time window. As a
result a wrong LOR is assumed and a wrong count is registered. The right
image shows almost the same event however three photons are detected
within the time window. The system can not determine the correct line of
response so the counted photons are discarded and the events are lost.
Uncertain Attenuated
Figure 2.7: Uncertain and Attenuated positron events
Illustration 2.7 finally shows an uncertain event on the left side. At the
edge of the field of view emitted photons may hit more detector crystals
by passing through them. They only streak the first detector and deposit
some energy but are not stopped. Thus they are not counted in the detector
crystal they hit and as a result the protons are detected in the wrong crystal
25
and a wrong line of response is assumed and counted. This is called parallax
effect. Finally the last images on the right shows an attenuated photon that
is not counted by the detection system as it is no longer inside the energy
window. This results in a lost count because inside the time window the
system detects only one photon which is then discarded.
There are different techniques to work around those problems. For ex-
ample the detection system can be optimized so that the detection time
window is very tight. Additionally the detection window can be optimized
so that only protons with the correct energy are detected. Another way to
reduce the effects of inaccurate measurement is to account for those prob-
lems during reconstruction. Multiple algorithms are available to correct for
the effects of random, scattered and attenuated events.
2.2.3 2D and 3D Data Acquisition
There are two different acquisition modes in PET imaging. Older scanners
often operate in 2D acquisition mode. In this mode septa made from lead or
tungsten separate each crystal ring from one another thus confine the axial
angle of possible lines of response. Gamma rays that move towards the
detection system on an oblique angle hit the septa and are absorbed. Only
events that hit crystals on the same ring or on neighboring rings are counted.
Those events follow a trajectory perpendicular or nearly perpendicular to the
axial axis of the scanner. Figure 2.8 shows the different possible gamma ray
trajectories for 2D acquisition mode (top) with the septa in place between
the crystals and for 3D acquisition mode (bottom) without the septa.
2D acquisition is not optimal due to the afore mentioned restrictions
that essentially result in loss of events. Only a very small fraction of all
emitted events (about 0.5%) are detected. Removing the septa results in a
lot more possible trajectories, more detected events and therefore a higher
scanner sensitivity but also a higher fraction of scattered events. Photons
with flight paths oblique to the axial axis of the scanner are measured too.
The benefit in this increase of sensitivity can be seen after reconstruction.
Images reconstructed from 3D measurements show a statistically significant
reduction in image noise compared to those reconstructed from 2D mea-
surements [30]. 3D datafiles however can be multiple times larger than 2D
data files as they contain a lot more different LORs. Reconstruction time is
significantly influenced by the size of the input data and fully 3D reconstruc-
26
tion adds a certain amount of complexity to the reconstruction algorithm
and its implementation. Fully 3D Image reconstruction can take up to an
order of magnitued longer than 2D reconstruction.
Figure 2.8: Difference between 2D and 3D data acquisition mode
2.2.4 Optimization
To improve the sampling in axial direction, intermediate slices are created.
They contain LORs from two neighboring crystal rings. Figure 2.9(a) shows
that this results in 2 ∗Rcr − 1 samples in axial direction when Rcr is to the
number of crystal rings in the scanner. The dotted lines represent the ad-
ditional gamma ray trajectories that are taken into account and the dashed
lines visualize the intermediate LORs.
The number of parallel LORs that are accepted by the detection system
is limited. The LORs at the edges of the gantry are not relevant because the
subject is positioned in the middle. For example as shown in Figure 2.9(b)
the horizontal sinogram plane with θ = 0◦ contains LORs from 315◦ to
45◦. This is a very effective method to remove unnecessary data in order to
save memory and reduce the computational complexity of the reconstruction
process. It also avoids bad parallax effects. The volume of all LORs that
are not discarded form the field of view (FOV) of the scanner.
27
(a) Intermediate samples (b) Sinogram width and relevant LORs
Figure 2.9: Efficient data structuring
2.3 Coordinate System, Data Structures and
Compression
This section describes the coordinate system used for PET systems and
two different ways to interpret the data. It explains how the essential data
structures are constructed and how the data can be compressed for easier
storage and faster image reconstruction.
2.3.1 Coordinate System
The coordinate system is defined so that every possible line of response can
be described explicitly. There are 4 parameters describing a line of response:
the inclination along the axial axis of the scanner φ, the projection ρ which
is the transaxial axis intercept, the azimuthal angle in the transaxial plane
θ and the axial axis intercept z . Figure 2.10 illustrates this concept. While
the range of θ is a full cycle the axial inclination φ is only a couple of
degrees depending on the depth of the scanner. This is the coordinate
system for data measured in 3D acquisition mode. 2D data is missing the
first parameter - the inclination along the axial scanner axis.
The sinogram can thus be seen as a four-dimensional array that contains
in each of its elements the number of coincidence events that were detected
along the corresponding LOR. One element of this array is called a bin. A
28
sinogram view consists of all projections for a given projection angle θ.
Figure 2.10: The PET coordinate system
The angle φ depends on the transaxial axis intercept due to the circular
nature of the detector. LORs further away from the center have a steeper
angle φ than LORs at the center. The LORs form a distorted plane. Figure
2.11 shows this issue in an exaggerated manner. This has to be considered
when modeling the scanner geometry for the reconstruction process.
Figure 2.11: Oblique LORs are not parallel
29
2.3.2 Parallel Beam Space and Line of Response Space Sino-
grams
Projecting the arc-shaped crystal ring onto a plane results in bins of different
width. The bin size depends on the radial coordinate resulting in wider bins
at the center of the projection and smaller bins at the edges. The sinogram
with varying bin size is called LOR space sinogram.
Figure 2.12: Comparison of LOR and PB space projection
A PB space sinogram can be calculated by applying radial arc correc-
tion with linear interpolation to the LOR space sinogram. The difference is
illustrated in Figure 2.12. Parallel beam space sinograms are easier to op-
erate on because a constant distance can be assumed between neighboring
bins. However as a result of linear interpolation during radial arc correction
the bins are not statistically independent of one another anymore which can
cause problems during further reconstruction steps that require the data to
be statistically independent from each other.
2.3.3 Sinogram Compression
To reduce the required storage space, data transfer times and to simplify
computations done on 3D sinograms, different methods are applied to com-
press the data. A commonly used compression method is spanning. The
span concept is applied on the axial axis of the scanner and compresses di-
rect (φ = 0) and oblique events (φ 6= 0) by merging them into one single
30
LOR. This is possible because the events are emitted from the same loca-
tion, their trajectory just differs in φ. The number of events merged into
one LOR defines the level of spanning. If for example 5 events from inter-
mediate planes and 6 events from direct planes are merged the dataset is
called a span 11 sinogram. So the number of events merged is denominated
by a number following the word span. Figure 2.13 illustrates this principle.
Each solid and dotted lines are merged into one LOR as they originated
from approximately the same location. The spanning reduces resolution at
the edges of the FOV, due to the information loss bud also reduces data size.
Therefore it is a trade-off and a small span is preferable. Siemens Healthcare
PET systems use mostly span 11.
Figure 2.13: The span concept for spans 3, 5, 7 and 9
As there are LORs with a larger angle φ than are covered by the initial
span, segments are created. Segments partition the LORs depending on
the steepness of their angle φ. Segment 0 contains all LORs with an angle
φ within span 11, segments −1 and 1 contain LORs with larger −φ and
φ respectively, and LORs with even larger φ are sorted into segments −2
and 2 and so on. The number of segments depends on the maximum ring
difference, the largest ring difference that is allowed for an LOR. This also
defines the maximum of φ. The segment concept is illustrated in Figure 2.14.
Groups of LORs with steep angles are sorted in Segment 1 and Segment -1
respectively.
As a side effect of the compression the axial resolution close at the border
is reduced. In the border areas of the gantry the oblique events can not
be measured as their trajectories only cross the detector system on one
side. Therefore the number of z slices varies from segment to segment. For
31
Figure 2.14: Segments 1, 0 and -1 for span 9 and mrd 13
segment 0 it can be calculated by:
slices(0) = 2 ∗ Rcr − 1 (2.7)
where #rings is the number of crystal rings. This is true because of the
intermediate samples that have been introduced between crystals. For all
segments other than 0 the number of slices can be calculated by:
slices(n) = 2 ∗ #rings − 1 − 2 ∗(−span − 1
2+ |n| ∗ span
)(2.8)
where the expression in brackets equals to the minimum ring difference
of the segment, which equals to the number of missing slices on one end of
the detector system. For each segment the angle φ can be calculated by
φ(n) =
tan−1
(n∗span∗zcrystal
Gradius
)if n ≥ 0
−tan−1(−n∗span∗zcrystal
Gradius
)if n < 0
where zcrystal equals to the size of one crystal in z direction and Gradius
is the radius of the gantry including the depth of interaction of a gamma
ray in the crystal. The depth of interaction is the distance a photon travels
inside the detector crystal before it has released all its energy.
A suitable way to visualize the span and segment configuration of a
scanner is the michelogram named after Christian Michel. In the diagram
32
the horizontal axis shows the crystal rings on one side of the scanner and
the vertical axis the rings on the other side of the scanner. A dot or an
asterisk denotes that those two crystal rings form an allowed gamma ray
trajectory (LOR). An empty bin specifies and forbidden connection of rings
due to the limitation given by the maximum ring difference. Connected
dots indicate that those rings together form one allowed sinogram plane.
Figure 2.15 shows a michelogram of a PET system with 55 crystal rings,
a maximum ring difference of 38 and 7 segments. The segments are the
diagonally connected areas.
To compare the data size a dataset produced by an older scanner with
32 crystal rings, a maximum ring difference of 22 and 5 segments would be
about 75.62MB large (assuming 239z and 288ρ and 288θ samples and 4 byte
for each bin). A typical dataset from a modern scanner like the one shown
in the michelogram would be 240.74MB large (assuming 559z and 366ρ and
366θ samples). Without the compression however (i.e. span1) the dataset
from that scanner would be roughly 1192.51MB large assuming the number
of possible LORs in z direction to be 2769.
Figure 2.15: Michelogram of a 55 ring, 38mrd, span 11 3D PET system
33
2.3.4 Other Data Structures
The image is a cuboid represented as three-dimensional array with x y and
z coordinates. x and y correspond to the transaxial axis intercept ρ and its
azimuthal angle θ and thus describe the transaxial plane of the scanner. z
directly corresponds to the axial axis intercept of the sinogram or the depth
coordinate of the scanner.
A circular mask is used in conjunction with the x -y plane of the image
to restrict algorithms to only access and process data that is relevant for
reconstruction. This is done to restrict the image to the are of the FOV
that is covered by all projection angles θ. It restricts the x coordinate of the
transaxial plane depending on the y coordinate. A two-dimensional array
stores xstart and xstop values for each possible y . So for a given y only x
coordinates between xstart and xstop should be accessed.
2.4 Image Reconstruction
Image reconstruction is the process that covers all steps, methods and algo-
rithms that are necessary to generate an image from the scanner’s measure-
ments. It contains pre- and post processing steps, correction methods and
in the center of this process is the reconstruction algorithm. Two different
approaches to image reconstruction algorithms exist.
Analytic algorithms assume a continuous data sampling. The data is
discretized after reconstruction. The most important analytic algorithm is
Filtered Back projection. It has the advantage of being fast and allows an
easy control of the spatial resolution and noise correlations. Its drawbacks
are less spacial resolution due to smoothing in the filtering step and higher
image noise. The reason for this is that the algorithm is based on an in-
accurate model of PET physics that was originally developed for computed
tomography. Despite those shortcomings it still is a relevant reconstruction
method nowadays. The filtered back projection algorithm consists of two
basic steps. The scanner’s measurements are in principal filtered to enhance
the image by for example reducing noise and increasing the contrast. After
filtering the data is backprojected into the image.
In contrast to the analytic methods stand the algebraic algorithms. They
depend on the discrete representation of input data and reconstructed image.
The reconstruction problem is described as linear equation system
34
b = A ∗ x ⇒ x = b ∗ A−1 (2.9)
where x is the image, b is the sinogram and A is the geometry description
of the gantry. This linear equation system is solved with iterative optimiza-
tion algorithms since A−1 can’t be calculated directly. The most common
method used to solve this LES is the Maximum-likelihood Expectation-
Maximization algorithm (MLEM). One of the most important evolutions in
PET in the last years has been the increasing role of iterative methods. They
are based on a more accurate model of the image acquisition process and
therefore improve the quality of reconstructed images. The major drawback
of those methods is the higher computational complexity which reflects itself
in the time it takes to reconstruct an image. The runtime of an iterative
algorithm can be several orders of magnitude longer than those of analytic
methods.
Figure 2.16: Comparison of analytic and iterative reconstruction algorithms
Figure 2.16 shows the difference in image quality between two common
analytic and iterative image reconstruction algorithms. The top row shows
the image reconstructed with filtered back projection; reconstruction time
was 10 seconds. The bottom row shows the image reconstructed with the
unweighted ordered subset expectation maximization algorithm, reconstruc-
tion time was 170 seconds. Due to less noise in the image calculated with
35
an iterative method it is easier to read and interpret the image. The image
reconstructed with filtered back projection is noisy and less detail is visible.
The superior image quality achieved with iterative methods makes them
the algorithms of choice. Additionally filtered backprojection does not work
very well with low count rates. Due to the need of short scan times iterative
methods are preferred.
2.4.1 Basic Principles
The LORs contained in the sinogram can also be interpreted as projections.
The process of projecting is equivalent to mapping the positron emissions
towards an angle θ in a three dimensional subject onto a two dimensional
plane. The projection consists of a number of line integral over emissions
with angle θ through the subject. The line integral represents only one frac-
tion of the positron emissions through the subject, namely all the emissions
in the direction of the integral. For this reason tomographic reconstruction
for PET differs from tomographic reconstruction for CT. In CT the entire
attenuation of the X-ray through the body is known for any given angle
θ. In PET only the counts that were emitted in θ direction are measured.
Therefore the mathematical model for CT does not fit PET imaging per-
fectly, however reasonably good images can be reconstructed by filtered back
projection derived from CT.
Figure 2.17 shows the basic principle of projection. The numbers visible
in the subject are the number of emissions in θ = 90◦ direction. They are
counted over time and represented as pixel image and a continuous func-
tion. This kind of projection is called forward projection as measurements
from the subject are projected forward onto a plan. During a PET scan
projections from all possible angles are generated by taking all emissions
at all angles into account. If for example an event at the angle θ = 36◦ is
measured, it is added to the equivalent sinogram view. Because the photons
are emitted at approximately 180◦ to each other the sinogram views with θ
are equivalent to the sinograms θ = +180◦ and thus sinograms cover only
the range [0◦, 180◦).
36
Figure 2.17: Projection of positron emissions at θ = 90◦ into sinogram row
2.4.2 The Maximum-Likelihood Expectation Maximization
Algorithm
Today the Maximum-Likelihood Expectation Maximization (ML-EM) algo-
rithm is one of the most widely used iterative algorithms in PET. Different
versions of the algorithm were developed and various improvements to the
algorithm are in use. The basic principle however is the same. A statistical
algorithm for calculating the maximum likelihood from incomplete data was
published by Dempster et al [13] in 1977. 5 years later Shepp and Vardi [48]
developed a new very accurate mathematical model of emission tomography.
Using this model they were able to use the EM algorithm for image recon-
struction. They showed that this algorithm reduced the statistical noise
artifact compared to algorithms in use at that time.
The input data of the algorithm consists of the emission sinogram m∗
and a 3D image λ. In the first iteration the image λ0 is homogeneous, for
example each voxel has the value 1. The algorithm calculates the most
likely image λ∗ that could have produced the emission sinogram at hand.
For that purpose it iteratively modifies the 3D image λn to approximate
it’s emission mn to the measured emission m∗. This is done by applying a
correction factor in each iteration. This factor is calculated by reprojecting
the quotient of the initial emission m∗ sinogram over the emission sinogram
mn that the current image produces. The informal equation 2.10 emphasizes
this principle for an iteration n.
37
λn+1 = λn ∗ back-projection
(m∗
forward-projection(λn)
)(2.10)
The normalization factor in equation 2.11 is necessary after each iteration. It
divides the image by a summation of all coincidence lines. This compensates
for areas of the FOV that are sampled with differing numbers of LORs.
1
back-projection(1)(2.11)
The formal termination condition is convergence of the solution. However
in practical applications it can either be a fixed number of iterations or
a quality factor of the reconstructed image. Sometimes visual judgment
of the image is called upon to determine when the reconstruction can be
stopped but PET system manufacturers also measure convergence behavior
and decide then on a recommended number of iterations.
Mathematical Model after Shepp and Vardi
This section reproduces the mathematical model of emission tomography
devised by Shepp and Vardi [48]. The image is seen as a density distribution
of emission events λ = λ(x , y , z ). The measured data is a function m∗(d)
representing the number of counts measured in each of D detector units d .
The problem is to estimate the density distribution λ from the the measured
data m∗. The density distribution λ(x , y , z ) is discretized into boxes b =
1, . . . ,B giving an unknown number of counts m(b) in each box. So we
want to estimate λ(b) or guess the number of unobserved counts m(b) in
each box.
Shepp and Vardi model this by defining λ(b) as the integral of λ(x , y , z )
over box b and m(b) as a Poisson distributed number with mean λ(b) that
is generated independently in each box as:
P(m(b) = k) = e−λ(b) λ(b)k
k !, k = 0, 1, . . . (2.12)
They assume a probability
p(b, d) = P(detected in d | emitted in b) (2.13)
38
and conclude that the probability of an emission in b being detected at all
is given by
p(b) =D∑
d=1
p(b, d) ≤ 1. (2.14)
They further show that the density of the detected counts is equal to the den-
sity of emitted counts λ(x , y , z ) so that equality holds in 2.14. They note a
problem with this model regarding the discretization of λ(x , y , z ) into boxes
b = 1, . . . ,B . However they argue that it is an acceptable approximation
and more accurate compared to using transmission tomography models for
emission tomography. They also mention that their physics model is not
entirely accurate as it doesn’t take scatter, randoms and multiple emissions
into account.
ML-EM Algorithm for PET after Shepp and Vardi
From the above model Shepp and Vardi construct a likelihood function, cal-
culate the maximum likelihood and provide an iterative approach to max-
imization based on the EM algorithm. They give the likelihood function
by
L(λ) = P(m∗|λ) =∑
A
∏
b=1,...,Bd=1,...,D
e−λ(b,d) λ(b, d)m(b,d)
m(b, d)!(2.15)
which gives the likelihood that the emission density distribution λ leads to
measurements m∗. They simplify this equation by taking the logarithm of it.
This is possible because of the increasing nature of the logarithmic function
which does not affect the maximization [28].
l(λ) = logL(λ) (2.16)
They go on with finding the maximum of this function and conclude by
providing an iterative method for maximizing the maximum likelihood l(λ)
using the EM algorithm [13]:
39
λnew (b) = λold (b)D∑
d=1
m∗ ∗ p(b, d)B∑
b′=1
λold (b ′)p(b ′, d)
. (2.17)
λ is defined as the maximized probability distribution and p(b, d) can be
seen as the fraction that b contributes to the projection ray described by d .
create initial image λn filled with onesload emission measurements m∗
n = 0
loop until image converges (λn ≈ λn−1 ≈ λ∗) orother termination condition is reached
calculate forward projection mn of current imageλn
divide emission by forward projection mn = m∗/mn
backproject the quotient mn into λtmp
calculate next image λn+1 = λn+1 ∗ λtmp
n = n + 1
Figure 2.18: Basic ML-EM algorithm
An implementation of the maximum likelihood expectation maximiza-
tion algorithm would approximate the image λn over n iteration steps until
the result converges. In every iteration the image λn is forward-projected
into mn and then m∗ divided by mn is back projected. Figure 2.18 shows
this algorithm in pseudo-code.
OS-EM Algorithm after Hudson and Larkin
The ML-EM algorithm described above requires many iterations until the
image converges. Each iteration requires a substantial amount of computa-
tion namely the forward- and back-projection of the entire dataset. In 1994
Hudson and Larkin [18] improved the ML-EM algorithm to increase the rate
of convergence thus reducing the algorithm’s complexity.
They introduce subsets and specify an iteration as one pass through all
subsets. Subsets are means of partitioning the entire dataset in a way that
information different from each other is introduced into the calculation as
soon as possible. Different information in essence means projections from
angles as far apart as possible.
40
Figure 2.19 shows the modification of the algorithm. Whereas the ML-
EM uses the entire dataset in each iteration and calculates the correction
factor λtmp for it, the OS-EM algorithm calculates the correction factor
for each subset. One subset contains much less data than the entire set
and the projection steps are more complex than the division m∗/mn and
the multiplication λn+1 ∗ λtmp . Hudson and Larkin show that with this
modification to ML-EM a speedup of one order of magnitude is possible.
The speedup originates from the fact that less data has to be forward- and
back projected in one subset compared to one iteration of the original ML-
EM. At the same time the image approximated with one subset calculation
in OS-EM roughly corresponds to the image approximated with one full
iteration of ML-EM. So if 8 subsets are used, the image converges 8 times
faster when compared to regular ML-EM.
create initial image λ0 filled with onesload emission measurements m∗
partition emission measurements m∗ into s subsets over angleθn = 0
for(i=0; i < iterations; i++)
for(si=0; si < s; si++)
calculate forward projection mn of current imageλn
devide emission by forward projection mn = m∗/mn
backproject the quotient mn into λtmp
calculate next image λn+1 = λn+1 ∗ λtmp
n = n + 1
Figure 2.19: Basic OSEM algorithm
It is practical to create the subsets at a view level where one view v
contains all projections with the same azimuthal angle θ. For this reason
the number of subsets s must divide the number of views #θ in the sinogram
(2.18).
#θ mod s = 0 (2.18)
Equation 2.19 shows how the set of views sx belonging to one subset x
can be calculated where #θ/s is the number of views per subset. Subsets
41
are enumerated within {1, . . . , s}.
sx =
#θ/s⋃
i=1
(i − 1) ∗ #θ
s+ x − 1 (2.19)
42
Chapter 3
Hybrid implementation of
PET image reconstruction
3.1 Projectors
The projectors are algorithms to project the sinogram into the image (back-
projection) or to project the image into the sinogram (forward-projection).
Forward-projection is essentially what happens during the data acquisition
process. During the iterative reconstruction process the data has to be back-
and forward-projected multiple times. Because of that the projectors are an
integral and at the same time the most time consuming part of the recon-
struction process. About 90% of the reconstruction time is spent calculating
forward- and backward-projections (not counting correction methods for e.g.
randoms or attenuation and other pre- and post processing steps). So when
thinking about speeding up the reconstruction process the projection algo-
rithms are the first thing to look at.
3.1.1 Ray projection through pixel images
The projectors are based on an algorithm called Joseph’s Method derived
from a paper published by Peter Joseph in 1982 [26]. In the paper the two -
at that time state of the art - techniques for reprojecting rays through pixel
images are discussed and a new, more accurate method is introduced. The
algorithm is explained for two dimensional space however the method can
be easily expanded to three dimensions. Joseph defines rays as straight lines
43
y(x ) = − cot (ω)x + y0 or (3.1)
x (y) = − tan (ω)y + x0 (3.2)
with ω representing the angle between x -axis and ray. It is assumed that
the image f (x , y) is a smooth function. Depending on ω Joseph defines the
line integral as an integral over x or y depending on ω:
if | sin (ω)| ≥ 1√2
integral in x direction (3.3)
if | cos (ω))| ≥ 1√2
integral in y direction (3.4)
which means that for flat angles between 45◦ and 135◦ and −45◦ and −135◦
the line integral is
1
| sin (ω)|
∫f (x , y(x )) (3.5)
and for all other angles it is
1
| cos (ω)|
∫f (x (y), y). (3.6)
Joseph continues by approximating the respective integral by a Riemann
sum. Due to the case differentiation for ω, a ray never hits more than 2
pixels per row. Joseph calculates the exact location of the ray hitting the
respective row and calculates the linear interpolation between them. The
line integral is then the sum of the interpolations of each row multiplied
by the scaling factor. Assuming the ray hits column n at y(xn) which is
somewhere between pixel m and m + 1 the linear interpolation between the
two pixels Pn,m and Pn,m+1 is calculated by:
Plerp = w ∗ Pn,m+1 + (1 − w) ∗ Pn,m (3.7)
with w being the non integer part of y(xn)
w = y(xn) − ⌊y(xn)⌋. (3.8)
44
3.1.2 Projector Algorithm
The projector algorithm used in this work is based on the sheared projector
by Panin and Kehren [43] which uses Josephs method for rayprojecting rays
through pixel images. They presents a very efficient algorithm for back- and
forward-projection of LOR based 3D sinograms utilizing simple shearing
operations. The algorithm is inspired by the paper ”A fast algorithm for
general raster rotation” by Alan W. Paeth [42] which presented an efficient
way to rotate raster images. To calculate the rotation, the rotation matrix is
separated into shear matrices that are applied consecutively. The algorithm
presented here exploits all the advantages of the Rotate- and Slant algorithm
but uses Joseph’s Method for the actual interpolation.
The idea presented in [42] can be leveraged in 3D projector algorithms
as the basic operation of the projector is to rotate a 3D image volume. To
project the bins of segment 0 with φ = 0◦ a linear interpolation has to be
calculated. The image is only rotated by θ around one single axis. For the
other segments however a bilinear interpolation has to be calculated as both
angles θ and φ are 6= 0.
The algorithm utilizes the fact that the bilinear interpolation for the
oblique segments can be separated into two linear interpolations one for θ
and one for φ. The linear interpolation for θ is the same for an entire view
independent of φ. So when rotating the image around the axial axis to
calculate segment 0 an intermediate image called the sheared image is saved
and used again for the other segments of this view. As a result only the
linear interpolation has to be calculated for every φ using this sheared image.
Redundant calculations are eliminated and the computational complexity is
reduced. Figure 3.1 illustrates this principle. The first image shows the
projection of segment zero and the calculation of the sheared image at the
same time. The next image shows how the sheared image and segment zero
of the view relate to each other and on the left side an oblique segment is
calculated utilizing the sheared image.
Assuming y(xn) and x (ym) are precomputed and there are four times
more oblique events in the sinogram than straight LORs the complexity
O(n) is reduced by about 45% compared to a method that calculates both
linear and bilinear interpolations.
Back- and forward-projectors are basically the same operations just in
reverse order. The forward-projection algorithm’s input is the 3D image
45
Figure 3.1: Projector using sheared image
volume. The segment 0 part of the sinogram view is calculated and at
the same time the sheared image is stored. Using this sheared image the
other segments of the sinogram view are calculated. The sinogram is then
scaled by the corresponding factor as described in section 3.1.1 Joseph’s
Method. The back-projector works the other way around. Its input is the
sinogram which is scaled after Joseph’s Method. Secondly its segments 6= 0
are projected into the image which results in the sheared image. Then the
image is unsheared and segment 0 is projected.
3.1.3 Implementation
The projectors are implemented in C++ optimized for multi-core and multi-
processor systems. They utilize parallel programming paradigms on two
levels.
On the lowest level, the single instruction multiple data (SIMD) princi-
ple as it is common in modern x86 architectures is used. The SSE extension
in x86 CPUs allows programmers to use vector based operations. SSE ca-
pable processors can operate on up to four 32bit floating point values at the
same time with one single instruction. The extension allows the usage of
additional data types namely m64 (MMX), m128 (SSE1) and m128d
as well as m128i (SSE2). m64 is a 64 bit wide variable that can store
multiple data types (8x8bit, 4x16bit etc.). The most interesting data type is
m128 as it can store four 32bit floating point values. One prerequisite to
46
vo i d ∗ mm malloc ( i n t s i z e , i n t a l i g n )vo i d ∗ mm free ( vo i d ∗p )
Listing 3.1: SSE memory allocation
using m128 is to align the data in memory to 16 bytes. This is because the
mm load and mm store intrinsics expect the last 4 bits of the address to
be zeros. This can be achieved by using the functions in listing 3.1 instead
of malloc and free. mm malloc takes an additional parameter to specify the
alignment boundary. This is documented in Intel C++ Intrinsics Reference
[20].
As a consequence apart from using special functions to allocate and free
memory the data arrays, in particular the sinogram and the images have
to be padded in a way so that their size is a multiple of 16. It is done by
simply increasing the z coordinate of sinograms and images. The padding
introduces an overhead, which is outweighed by the considerable speedup
gained from using SSE. Instead of loading every single floating point variable
and operating on it, four variables can be loaded at a time and four variables
can be operated on in parallel.
The second parallel programming paradigm used in the projector’s im-
plementation is multithreading. The algorithm creates multiple threads and
calculates the projections in parallel. In this particular implementation a
view based parallelism is used, so that multiple sinogram views are pro-
jected at the same time. This means that the inner loop of the OSEM
algorithm is parallelized (compare Figure 2.19) to speed up the calculation
of one subset. During forward-projection every thread accesses the same
image to calculate segment 0 of the sinogram. At the same time every
thread saves its own sheared image because every thread calculates a dif-
ferent view angle θ. Then every thread calculates the other segments using
its sheared image. The sheared image is discarded after projection. During
backward-projection every thread projects the segments 6= 0 into it’s own
sheared image and unshears it while projecting segment 0. After all views
are projected, the resulting images from each thread have to be added up
resulting in the final image.
Additional optimization is applied to the algorithm as described in [27].
The most important is the index reshuffling of the image. While the index
47
order normally would be x , y and z , the projectors use the z , x , y ordering
scheme. This allows the projectors to access memory in a contiguous way
because the innermost projector loop iterates over z . With index order zxy
this loop can access elements right next to each other in memory.
3.2 Optimization of CPU projector
A significant drawback of the CPU projector implementation described in
3.1.3 can be found in the back-projection part of the algorithm. For each
used thread a separate image is required. After projection, the images of
all views have to be summed up. This can become a scalability problem as
CPUs today contain more and more cores and therefore more images would
have to be added up.
3.2.1 Analysis of current Implementation
An analysis of the existing implementation demonstrates this issue. The
algorithm is executed with multiple numbers of threads (1-16) on two dif-
ferent systems. A 2x dual core system comprised of two dual core 2.66GHz
Intel Xeon processors with a bus speed of 1333MHz and 4MB cache and a
4x quad core system with two quad core 2.66GHz Intel Xeon processors also
with a bus speed of 1333MHz and 8MB cache.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16120
170
220
270
300
Number of Threads
Run
time
in s
econ
ds
2x dual core CPUs2x quad core CPUs2x dual core CPUs2x quad core CPUs
(a) Runtime in seconds
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160
0.2
0.4
0.6
0.8
1
Number of Threads
Par
alle
l Effi
cien
cy
2x dual core CPUs2x quad core CPUs2x dual core CPUs2x quad core CPUs
(b) Parallel efficiency
Figure 3.2: CPU projector efficiency analysis
Figure 3.2(a) shows the reconstruction runtime plotted against the num-
ber of threads running for both the dual core and the quad core systems.
Both systems are fastest when 4 threads are used for reconstruction. Addi-
tionally, the dual core’s system overall performance is slightly better than
48
that of the quad core system. Similarly, when plotting the parallel efficiency
E =1
P
Tseq
T (P)(3.9)
where P is the number of threads, Tseq is the sequential time (1 thread) and
T (P) is the runtime using P threads as in Figure 3.2(b) the efficiency drops
below 0.5 when more than 4 threads are used. The efficiency is also very
similar for both systems, which comes as a surprise as the quad core system
should in theory be able to execute twice as many instructions as the dual
core system.
The reason for this is that the algorithm is inherently memory bound.
There are relatively few computations per memory access and large amounts
of data have to be accessed. Each thread has to iterate over an entire image
and access its elements. Memory access of different threads is thus scat-
tered across system memory which inhibits optimization mechanisms like
caching and fast consecutive memory access. Additionally if more threads
are used more memory is used. The exponential decrease in efficiency (or
the exponential increase in parallel overhead) can be attributed to context
switching between threads and the scattered memory access caused by using
multiple separate images for each thread and the overload of the memory
bus. Furthermore the adding of all images after back projection adds extra
overhead.
3.2.2 Optimization of current Implementation
The algorithm can be optimized by a more fine-grained domain decompo-
sition. The current implementation parallelizes the loop over the different
view angles θ. Each thread calculates it’s own view. One level below the
view-loop is the loop over the integration direction (compare subsection
3.1.1). So the idea is to let every thread calculate the same view at the
same time, but each view calculates only a small part of the image. Figure
3.3 shows the algorithm for back projecting one sinogram view. The opti-
mized algorithm would parallelize the outer loops over x . Thread Tn would
loop x from n ∗ ρ-samples/#T to (n + 1) ∗ ρ-samples/#T − 1 where #T
is the number of threads. The problem set is 5 dimensional: view , x , y ,
segment , z and instead of spreading the highest dimension view over multi-
ple threads the integration direction being the second highest dimension is
49
decomposed. The advantage is that each thread can operate on the same
image which draws memory access closer together and therefore improves
cache usage and also avoids the summation process of each thread’s images
after all views are projected. Additionally some factors are the same for an
entire view calculation and thus can be shared by all threads.
This new method was implemented and runtime tests were performed
to evaluate the efficiency gained from the modification. Figures 3.4(a) com-
pares the runtime of the two algorithm implementations on the dual core
system and 3.4(b) shows the equivalent data for the quad core system. For
both systems the runtime decreases with an increasing number of threads.
On the dual core system reconstruction takes about 143 seconds with 8
threads using the original algorithm implementation but takes only 113 sec-
onds with 8 threads using the optimized implementation.
Overall the performance increase gained from adding more threads is not
significant. This is independent of the number of CPU cores available in the
system. The reason for this is found in the memory bound nature of the
algorithm and also having more threads than cores does not make a lot of
sense due to computation time that is lost in the increased number of task-
switches. However when looking at the two different implementations it is
important to determine on what level the algorithm should be parallelized.
It affects how well the underlying hardware is able to optimize execution as
well as the magnitude of the parallel overhead.
Josephs’s method scaling
for(x=0; x < ρ-samples; x++)
for(y=0; y < ρ-samples; y++)
for(segment=1; segment<segments; segment++)
for(z=0; z < z -samples; z++)
calculate linear interpolation for oblique segments
shared image calculated, now unshear and project segment 0
for(x=0; x < ρ-samples; x++)
for(y=0; y < ρ-samples; y++)
for(z=0; z < z -samples; z++)
calculate linear interpolation for segment 0 and un-shear
Figure 3.3: Back-projector structogram
50
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
120
170
220
270
Number of Threads
Run
time
in s
econ
ds
2x dual core original2x dual core modified2x dual core original2x dual core modified
(a) Runtime dual core system
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
120
170
220
270
Number of Threads
Run
time
in s
econ
ds
2x quad core original2x quad core modified2x quad core original2x quad core modified
(b) Runtime quad core system
Figure 3.4: Comparison of original and modified implementation
3.3 The Compute Unified Device Architecture
The Compute Unified Device Architecture (CUDA) is a programming frame-
work based upon a new generation of graphics cards produced by Nvidia.
The framework makes it easy for programmers to leverage the power of mod-
ern graphics devices for general purpose computation on GPUs (GPGPU)
without having to understand the graphics pipeline. A paradigm shift in the
design of graphics processing units has made this possible. Prior to CUDA,
graphics cards were designed as special purpose processors - they contained
many small special purpose functional units that all performed a dedicated
task. Those functional units were chained together to form the graphics
pipeline. With new generation GPUs this design is superseded by a more
generalized approach. Instead of dedicated elements that are designed to
only solve one specific task, graphics cards today consist of general purpose
functional units that can be ”dynamically allocated to vertex, pixel, geom-
etry or physics operation” [38] to form the traditional graphics pipeline.
This is important for backward compatibility. Important for general pur-
pose computing on the GPU is the fact that those functional units can not
only be programmed to act as the traditional graphics pipeline but to solve
virtually any existing data parallel problem. This could not be done in an
easy way before as older graphics cards had a graphics-specific instruction
set, could only perform gather and no scatter memory operations and were
in general limited by the graphics pipeline.
Other than Nvidia many other companies are working on similar tech-
nologies. ATI/AMD offered ”CTM - Close to Metal” which is a GPGPU
51
interface for their ATI graphics cards [2]. There’s also an open source
framework called BrookGPU that ”abstracts and virtualizes many aspects
of graphics hardware” [7]. Additionally there are a number of commer-
cial products on the market that focus on virtualizing a number of parallel
hardware architectures including GPUs, the Cell processor and multicore
CPUs for high-performance computing (HPC). In 2008 the Khronos Com-
pute Working Group was formed to standardize general purpose parallel
programming. The work group released the OpenCL 1.0 standard by the
end of 2008 [34]. It is very similar to the ideas behind CUDA as Nvidia was
involved in the creation of the standard but incorporates both CPUs and
GPUs in it’s architecture.
The following chapter discusses the hardware architecture of the CUDA
capable Nvidia G80 graphics chip in more detail. This chip is used in Nvidia
Geforce 8800 and Nvidia Tesla products, the devices that were used in this
work.
3.3.1 Hardware Architecture
Producers of graphics processor tend to be secretive about implementation
details of their products. However to write efficient programs for a given
platform one has to have a good understanding of the underlying hardware
architecture. Information about it can be found in the Nvidia Geforce 8800
GPU Architecture Overview [38], the CUDA Programming Guide [39] and
the course material from a CUDA course at University of Illinois [19]. Self-
conducted experiments helped to clarify certain aspects of the architecture.
One Geforce 8800 chip can be seen as a streaming processor array (SPA).
Figure 3.5 shows the principal schematics of the 8800 chip. It contains a
number of texture processing clusters (TPC) which contain texture units
(TEX) and streaming multiprocessors (SM). The streaming multiproces-
sors consist of stream processors (SP) and super function units (SFU). In-
struction fetch and decode units are part of the streaming multiprocessor.
Therefore all stream processors on a multiprocessor execute always the same
instruction. Branching on a multiprocessor is possible by disabling unnec-
essary stream processors. A Geforce 8800 Ultra chip contains 8 TPCs with
2 SMs each and each SM contains 8 SPs resulting in 128 (8 ∗ 2 ∗ 8) stream
processors running at 750MHz. Table 3.1 gives an overview of the different
kinds of memories that are available at different levels of the architecture.
52
Figure 3.5: Geforce 8800 architecture
Memory Location Cache Access Scope Latency
Register On Chip N/A RW 1 Thread 1 CycleShared On Chip N/A RW Block 1 CycleLocal Off Chip No RW 1 Thread 400 - 600 CyclesGlobal Off Chip No RW All Threads 400 - 600 CyclesConstant Off Chip Yes R All Threads DRAM, cachedTexture Off Chip Yes R All Threads DRAM, cached
Table 3.1: CUDA Memory Overview
Every streaming multiprocessor has an on-chip register set with 8192
32bit registers. The shared memory is 16kB on-chip memory per multipro-
cessor divided into 16 banks for CUDA 1.0 devices. It is one of the most
important advantages over old GPU architectures for GPGPU as this mem-
ory represents a programmable cache and when used efficiently can result in
a huge memory access speedup. Local memory is comparatively slow mem-
ory that resides on off-chip DRAM and is only used if not enough registers
are available. Global memory is the main device memory and can be up to
many hundred megabytes large. It is not cached so access is rather costly.
Constant memory is a read only section of the DRAM with a maximum size
of 64kB. Each multiprocessor has a cache working set of 8kB for constant
memory. Texture memory is read only memory of arbitrary size allocated
53
from global memory. It is accessed with texture units which allow special
addressing modes and filtering. Each multiprocessor contains an 8kB cache
working set for texture memory. Instruction memory is not visible to the
programmer but it is implemented as cached DRAM. In the case of the
Geforce 8800 Ultra the global memory is GDDR3 partitioned in 6 parts and
each partition provides a 64bit interface yielding a 384bit combined inter-
face. The memory clock on Ultra cards is 1080MHz the memory of the GTX
is clocked at 900MHz. The GPU is connected to the host system via PCIe
1.1 16x which allows data transfer rates of up to 4000MB/s.
3.3.2 Programming Model
The CUDA programming model provides means of partitioning a given data
parallel problem so that it can be computed efficiently on a CUDA device.
The problem as a whole is seen as the grid. The grid contains blocks and
each block contains threads. Figure 3.6 visualizes this concept.
Figure 3.6: CUDA programming model
A thread can be understood as a thread of execution as is known in
modern operating system. However CUDA threads are light weight, they
have almost no overhead, switching between them is very cheap and there
is no need for a stack. A thread is identified by its thread ID inside a
thread block. A thread block is a one- two- or three-dimensional array
54
1 vo i d mulVolCPU ( f l o a t ∗ v , f l o a t f )2 {3 f o r ( uns i gned s h o r t z=0; z <8; z++)4 f o r ( uns i gned s h o r t y=0; y<8; y++)5 f o r ( uns i gned s h o r t x=0; x<8; x++)6 v [ z ∗8∗8 + y∗8 + x ] ∗= f ;7 }
Listing 3.2: Multiply volume CPU code
of threads. Threads in a block can share data through the aforementioned
shared memory and can synchronize their execution through synchronization
points. The number of threads in a block is limited to 512. However blocks
of the same size can be batched together into a grid of blocks. Threads
from different blocks can not communicate or synchronize with each other.
A block is identified inside the grid by a block ID. A grid can also be a two
dimensional array as seen in figure 3.6. The program running on the GPU
is called a kernel. The programmer can choose grid dimensions and specify
the block size and inside the kernel each thread can be identified by block
and thread ID. This programming model is called ’parallel thread execution’
or PTX. It is also a generic virtual instruction set and virtual machine for
data parallel problems. The PTX is described in detail in [40].
A code example makes the programming principle clearer. Consider a
volume v comprised of floating point values implemented as a 3 dimensional
array with each dimension having the size 8. The challenge is now to effi-
ciently multiply each element of the array with a factor f. Listing 3.2 gives
an example implementation of how to solve this problem in C.
The function mulVolCPU takes a pointer to the volume and a factor as
argument. Inside the function 3 nested loops iterate over every dimension
of the volume. The dimensions are defined as x , y and z. Each voxel
is addressed consecutively and multiplied by the specified factor. There is
room to optimize this code by using vectors instead of scalars utilizing single
instruction multiple data processor extensions and a multithreaded approach
could further speed up the solution if multiple CPU cores are available. For
the sake of clarity and simplicity it is left that way.
Listing 3.3 shows how this problem could be solved using CUDA and a
modern Nvidia graphics card. As in the CPU case a function mulVolGPU
is given that takes pointers to the volume and a factor as argument. It is
55
1 vo i d mulVolGPU( f l o a t ∗ v , f l o a t f )2 {3 dim g r i d ( 8 ) ;4 dim th r e ad s ( 8 ) ;5 mulVolKerne l<<<g r i d , th r eads >>>(v , f ) ;6 }78 g l o b a l vo i d mulVo lKerne l ( f l o a t ∗ v , f l o a t f )9 {
10 uns i gned s h o r t z = b l o c k I d . x ;11 uns i gned s h o r t x = th r e a d I d . x ;12 f o r ( uns i gned s h o r t y=0; y<8; y++)13 v [ z ∗8∗8 + y∗8 + x ] ∗= f a c t o r ;14 }
Listing 3.3: Multiply volume GPU code
assumed that ∗v is a pointer to the volume residing in GPU global memory.
In line 3 and 4 the size of the grid is defined: 8 blocks with each block
containing 8 threads. Then the kernel mulVolKernel() is started - the grid
dimensions are assigned in the angle brackets and the parameters are passed
in parenthesis.
What happens now is that in each block 8 threads and an overall of 8
blocks are started yielding 8 ∗ 8 = 64 threads. Those threads are spread
across all multiprocessors and essentially run at the same time. Each thread
will now define two variables z and x. z is set to blockId .x a predefined
variable specifying the ID of the block the thread runs in and x is set to
threadId .x, the id of the thread inside the block. Those variables are now
used as the z and x coordinate to address the voxels. The following loop
iterates only over y and multiplies each voxel with the factor as z and x
are defined implicitly by block and thread IDs. In essence on the GPU 64
threads run in parallel and multiply 64 voxels at the same time compared to
one voxel at a time for the CPU example. The programmer uses the block
and thread IDs to select and address the memory each thread works on.
In that CUDA is different from parallel programming languages such as
High Performance Fortran [31]. In HPF the programmer is able to specify
how to distribute memory in a top down manner using its DISTRIBUTE
directive whereas in CUDA the various threads access their data as they need
it and has therefore more of a bottom-up approach. Data distribution is not
specified explicitly and allows for more dynamic memory access. In CUDA
56
multithreading is implemented directly as each kernel is started multiple
times in various threads depending on the grid parameters.
3.3.3 Execution Model
A thread execution manager is responsible for generating threads and grid
blocks based upon the parameters specified in the kernel calls. Thread
blocks are then serially distributed to the multiprocessors. Each block is
guaranteed to execute entirely on one multiprocessor so that the shared
memory space resides on the same physical chip to guarantee fast memory
access. Depending on the number of registers and the size of the shared
memory space one thread block requires, it is possible to execute up to
8 thread blocks or 768 threads at a time on one streaming multiprocessor.
This is achieved by assigning every thread block time slices. If thread blocks
are finished and terminate, the thread execution manager can schedule other
blocks to run on the multiprocessor. The granularity of scheduling is defined
by warps - they are the scheduling units of the multiprocessors. A warp is
a batch of threads belonging to the same block. The warp size of the G80
chip is 32 threads. Warps are executed in a SIMD fashion meaning every
thread in a warp executes the same instruction but operates on different
data. As a streaming multiprocessor contains 8 stream processors, it takes
4 clock cycles to dispatch the same instruction to all threads in a warp.
The streaming processor maintains the thread IDs and schedules the
thread execution. Scheduling operations work as follows: the streaming
multiprocessor fetches a warp instruction from the instruction cache and
places it into a free instruction buffer slot. A scoreboarding mechanism
identifies instructions in the buffer that are ready to run. A warp is ready
to run if all its required values are deposited in registers. The warps are
then scheduled with a round robin method whereas the scoreboarding mech-
anism prevents any hazards that might occur when reordering instructions
by assigning instruction priorities. A decoupling of memory and processor
pipeline is achieved by this execution model. The memory pipeline can work
on fetching values for a warp while the processor pipeline executes instruc-
tions of another warp. So it is in fact important to launch more threads than
available stream processors as this makes it possible to hide memory access
latency. To quantify how well memory access latency can be hidden the
occupancy of a kernel can be calculated. It is defined as the ratio of concur-
57
rent threads for a kernel on one streaming multiprocessor to the maximum
number of threads supported by one streaming multiprocessor. The number
of concurrent threads is a function of the number of threads in one thread
block because each multiprocessor has a limitation on the number of warps it
can execute, the maximum active registers per thread and the shared mem-
ory requirements of a thread block. For memory bound kernels, increasing
the occupancy can speed up its execution by hiding memory access latency.
However for computation bound kernels forcefully increasing occupancy can
result in reduced performance due to side effects such as register spilling to
off-chip memory.
Branching is handled by serializing the different execution paths. This
can result in inefficient execution. However for small branches the com-
piler is able to create predicated instructions - a very efficient way to avoid
branching. Instructions are predicated by conditions - if the condition is
true the instruction is executed; NOP is executed if the condition is not
true. Mahlke et al. [33] contains a comprehensive discussion of instruction
predication.
3.3.4 Memory Model and Access
Registers are assigned to blocks and can not be shared between them and
each thread in a block only accesses registers assigned to itself. Shared
memory is also assigned to blocks and is only accessible by that block. Due to
the parallel nature of the architecture, efficient simultaneous access to shared
memory is possible as shared memory on each streaming multiprocessor is
divided into 16 memory banks. The programmer has to avoid bank conflicts
- simultaneous access to the same memory bank from different threads of
one half warp. Bank conflicts can be circumvented by aligning memory
fields and devising a reasonable memory access pattern. Constant memory
is a very efficient way to access values that are common for all threads in a
block, for example mathematical constants or other kernel parameters.
Global memory access is very slow as it is basically uncached DRAM.
However memory reads by consecutive threads in a warp can be combined by
the hardware into several, wide memory reads which are a lot faster than ran-
dom reads. The requirement is that the threads in the warp must be reading
memory in order. More precisely a thread number N in a half warp should
access address HalfWarpBaseAddress + N and HalfWarpBaseAddress should be of
58
type type ∗ with sizeof (type) equal to 4, 8 or 16. HalfWarpBaseAddress should
be aligned to 16∗ sizeof (type) bytes. The same applies to memory reads. An
efficient programming pattern is therefore to fetch the data a thread block
needs to operate on into shared memory in a consecutive manner, then op-
erate on that data and finally write it back in the efficient way described
before. Shared memory can thus be seen as a programmable shared cache.
Newer CUDA devices with compute capability 1.2 or higher do not have
those restrictions. These devices are able to perform coalesced memory
access as soon as threads from the same half-warp access memory within
one segment of global memory. One segment can be up to 128 bytes wide.
The access pattern however does not influence weather or not memory reads
are coalesced.
The hardware background to this optimization is based on the bus width
of the memory subsystem of the GPU. It is between 384bit for first gener-
ation CUDA devices and in newest models up to 512bit wide. Fetching a
32bit floating point value takes a certain time, usually 400-600 GPU cycles,
however fetching 12 32bit floating point values (384bit) that are located
next to each other in global memory takes the same amount of time. The
memory subsystem always fetches the full bus width and discards the data
that was not requested. By following aforementioned alignment rules one
can always exploit the full potential of the memory bus and thus speed up
memory access.
3.3.5 CUDA Toolchain
The CUDA toolchain allows developers to write code for the graphics card
in C with some additional CUDA specific syntax. Kernels can be started
from the host program with function calls. The CUDA compiler automati-
cally adds functions to upload kernel instructions to the GPU. The CUDA
framework provides C functions to allocate memory on the GPU and to
copy data from and to the GPU. The toolchain integrates nicely into Visual
Studio by Microsoft. Additionally the framework contains FFT and BLAS
libraries.
The CUDA framework allows developers to mix conventional C/C++
code running on the CPU with CUDA specific kernels and function calls
that operate on the GPU. To accomplish this, the toolchain detects CUDA
specific code, extracts it from the source and compiles it with a proprietary
59
compiler. Conventional C/C++ code is passed to a user defined compiler
(Micorosft Visual C MSVC, Intel ICC or GCC). The CUDA kernels are then
injected as load images into the object files along with routines to upload
them to the GPU. In the linking step additional runtime libraries are added
to support the aforementioned C functions to allocate memory and upload
data and to start kernels. When compiling kernels, PTX code is generated.
From this intermediate ISR code the compiler generates device specific code
which resemble the load images injected into the object files. The abstract
PTX code is generated as an intermediate step because devices differ in com-
pute capability, for example newer GPUs support double precision floating
point computation or atomic operations.
3.3.6 Debugging CUDA code
CUDA provides an emulation mode to debug kernels. After recompiling
kernels with the parameter deviceemu set, instead of running on the GPU
the kernel is executed on the CPU. All threads that would normally run
in parallel on the graphics card run on the host system sequentially. This
enables the developer to set breakpoints, examine variables and read out
memory. Even output to the screen is possible from inside the kernel. A
side effect of this method is that computation is not actually performed on
the GPU so the emulation mode is not useful for locating errors, for example
examining the differences in floating point calculations. The threads run in
succession which might hight race condition errors in the code or other
concurrency related problems.
Nvidia also provides a port of the GNU Project Debugger GDB for
CUDA code. This tool allows realtime debugging of code running on the
graphics card. This is useful to debug a kernel without side effects caused by
emulation. It allows developers to stop execution at any line in the kernel
code, step through kernels, read out current device memory and switch
between blocks as well as threads.
In some cases it might also be necessary to examine what instructions
the compiler generates from the C code to identify bottlenecks and detect
causes for errors. For this purpose the third party tool decuda by Wladimir
J. van der Laan is available [51]. It’s a disassembler for the generated kernel
binary files. It allows developers to not only examine the intermediate PTX
code but also the actual code that is executed by the GPU.
60
3.4 Implementing projectors in CUDA
The following chapter describes how the algorithms for forward- and backward-
projection are implemented in CUDA to run on a CUDA capable graphics
card.
3.4.1 Requirements
The hardware running the projection algorithm has to meet certain require-
ments for the projectors to produce the same numerical results as the CPU.
Furthermore there are basic requirements the CUDA hardware has to fulfill
to match the efficiency of the CPU algorithm.
The first question is weather or not graphics devices have enough mem-
ory capacity to store the data structures that are required for reconstruc-
tion. The minimal datasets required for the projection of one sinogram
view are the sinogram view itself and two images, one for input or out-
put and the other to store intermediate calculations, more precisely the
sheared image. Assuming 32bit floating point values for each voxel and
pixel and a typical scanner geometry of 55 crystal rings with 7 segments,
span 11 and a maximum ring difference of 38 one sinogram view with 559z
and 336ρ samples is 0.72MB large and the 2 images would be together
2∗366ρ∗366ρ∗109z = 111.4MB large. For span 1 reconstruction a sinogram
could have up to 2769z samples yielding 3.6MB per sinogram view. This
is required for some PSF reconstruction algorithms [44]. Modern graphics
cards have memory banks multiple times larger than these requirements. Ta-
ble 3.2 shows memory specifications for a number of selected Nvidia graphics
cards from the last three GPU chip generations and proofs that a GPU recon-
struction algorithms would not be constricted by available memory (sources:
[55], [36], [37]).
Device Memory Memory Interface Memory Bandwidth
GeForce 8800 GTX 768MB 384bit 86.4GB/sGeForce 8800 Ultra 768MB 384bit 103.7GB/sTesla C870 1536MB 384bit 77GB/sGeForce 9800 GX2 2x500MB 512bit 2x64GB/sGeForce GTX 295 2x896MB 2x448bit 2x111.9GB/s
Table 3.2: Selected Nvidia cards memory specification
61
The second important requirement is the numerical reproduction of the
CPU projector results on the GPU. Therefore it has to be determined if the
device architecture supports floating point operations similar to the CPU
and in particular if it supports the IEEE 754 floating point standard. The
algorithms operate on single precision floating point numbers. Therefore
only single precision operations have to be validated.
According to [41] accumulation and multiplication operations are stan-
dard compliant. However the compiler might merge ADD and MUL to a
combined multiply-add MAD instruction. This instruction cuts off the re-
sult of the multiplication which might lead to inaccurate results. But the
compiler can be forced to generate separate multiply and add instructions.
Division is specified as to have a maximum error of 2ulp. Trigonometric
functions sin and cos are specified with maximum errors of 2ulp. These
variances lie within the boundaries of the acceptable error.
To conclude the basic requirements analysis it is determined that it is
possible to port the CPU projector algorithm to the graphics device as both
computational accuracy and available memory are adequate.
3.4.2 Implementation
The projector implementation is based on the optimization principle from
listing 3.3 on page 56. In the example the parallel capabilities of the GPU
are exploited to simultaneously perform operations on a large part of an
array. This parallelism can be used in the projection algorithm because its
basic operation is manipulating large arrays. This is discussed in section 3.1.
So the algorithm is implemented in a way that allows an optimization of the
calculations by parallelization. This idea is illustrated in figure 3.3 for the
backprojector: during one view projection the algorithm iterates multiple
times over an image and manipulates it voxel by voxel. This subsection
describes in detail how the projector algorithm is implemented in CUDA.
The forward-projector can be ported by parallelizing the outer loops
of the algorithm and assigning them to thread-blocks and threads. Figure
3.7(a) shows the first part of the forward projector algorithm, the calculation
of the segment 0 projection and the calculation of the sheared image. The
sinogram and one y-slice of the image is shown. The algorithm has to iterate
over the entire image volume and calculate the linear interpolation between
two neighboring voxels on the x axis. The figure shows the interpolation in
62
x direction with the line integral in y direction. Two arrows converging in
one pixel of the sinogram symbolize one interpolation. The volume indices
are accessed by the different parallelization elements. Along the z -axis all
threads from one thread-block are allocated. The x axis is handled by
different blocks. The threads calculate the interpolation for one y image
slice, save it to the sheared image and add it to the correct sinogram bin.
They loop over the y axis and perform the same calculation for each image
slice. In effect an entire image slice is processed simultaneously. So for each
voxel in the x/z plane there exists a thread divided in blocks along the x
axis and each thread iterates over the y axis. For interpolation along the y
axis with the line integral in x direction the process is the same apart from
exchanging the x and y dimensions.
(a) Calculation of segment 0, sheared im-age
(b) Projection of segments 6=0
Figure 3.7: CUDA implementation of the forward projector
The projection of segments 6= 0 is shown in figure 3.7(b). Here the
oblique angles are calculated which are always independent of θ or the di-
rection of the line integral. It is an interpolation in z direction calculated
from the sheared image, which is the image that already has incorporated
the rotation θ around the axial axis of the scanner. The interpolation has to
be calculated multiple times with different oblique angles for each segment
6= 0. Thus one thread loops over y as well as all the oblique segments. For
each iteration over the oblique segments the data is the same however, the
angle and thus the interpolation coefficients change. Due to the angle, the
number of interpolations per sinogram segment is different. Each thread
adds the interpolated values to the correct sinogram pixel in the correct
63
sinogram segment.
The backward projector is the inverse operation of the forward projec-
tor. First the backward projector has to calculate the sheared image from
the sinogram, and then the image is unsheared. The calculation of the
sheared image during backward-projection is identical to the calculation of
the oblique sinograms during forward projection. During calculation of the
sheared sinogram the z axis corresponds to the threads, the x axis separates
the blocks from each other and each thread loops over y and the number of
segments. Each thread calculates one voxel of the sheared image.
(a) Unshearing of image step 1 (b) Unshearing of image step 2
Figure 3.8: Unshearing of image during backprojection
The unshearing of the image is split up into two separate operations.
Depending on the direction of the line integral the interpolation has to be
calculated in x or y direction. The CPU algorithm iterates over the sheared
image, calculates the interpolation and adds the two interpolation results
to the unsheared image. Parallelizing this operation is not straight forward.
Assuming that there is one thread for each element of an image slice, for
each iteration over the direction of the line integral the thread would write
to two voxels in the output image. A thread with the same thread ID in
a neighboring thread block would also write to two voxels in the output
image but due to the nature of interpolation they would both try to write
to the same voxel. Two different threads from two different blocks trying to
modify the same memory results in undefined behavior if the read-modify-
write operation is not atomic. There is no way to synchronize threads from
64
different thread blocks and atomic operations are not available on all CUDA
capable devices and are very costly due to serialization of the commands.
A way to work around this issue is illustrated in Figure 3.4.2. The kernel is
split up into two separate functions, the first function calculating the first
part and the second function calculating the second part of the interpolation.
By doing so writing to the same memory location at the same time is not
possible, as each thread only writes to one single image voxel.
The particular arrangement of threads, blocks and loops to the image
coordinates is sensible because the image is indexed in the order zxy . Map-
ping the z coordinate to threads allows them to consecutively access mem-
ory. This increases the performance of the kernel as the hardware is able
to combine consecutive memory access of threads in a warp to one single
wide memory access that is a lot faster (compare section 3.3.4). The blocks
are mapped to the x or y coordinate depending on the direction of the line
integral or in the case of oblique segment calculation the blocks always rep-
resent the y coordinate. The remaining coordinate and the segments are
handled by loops inside a thread.
Each of the steps described above is implemented as a separate kernel
function. Additionally there are separate kernels for the different line inte-
gral directions along the x or y axis. So for one projection method there are
9 kernel functions required including the Joseph’s scaling kernel. Table 3.3
shows a brief overview of all required kernels and their function.
Name Function
fwd-x Project segment 0 and sheared image (y direction)fwd-y Project segment 0 and sheared image (x direction)fwd-ob Project segments 6= 0bwd-ob Project all segments into sheared imagebwd-x1 Unshearing first part of interpolation (x direction)bwd-x2 Unshearing second part of interpolation (x direction)bwd-y1 Unshearing first part of interpolation (y direction)bwd-y1 Unshearing second part of interpolation (y direction)joseph-scaling Joseph’s method scaling
Table 3.3: Projector kernel functions
The main reason for splitting up the kernels like this is synchronization.
The only synchronization possible during kernel execution is a call to a
special kernel function syncthreads() that synchronizes all threads of one
65
1 cudaMal loc ( ( vo i d ∗) &image , s i z e o f ( f l o a t ) ∗ imag e s i z e ) ;2 cudaMemset ( image , 0 , s i z e o f ( f l o a t ) ∗ imag e s i z e ) ;3 cudaMemcpy ( image , cpu image , s i z e o f ( f l o a t ) ∗ image s i z e ,4 cudaMemcpyHostToDevice ) ;5 cudaMemcpy ( cpu image , image , s i z e o f ( f l o a t ) ∗ image s i z e ,6 cudaMemcpyDeviceToHost ) ;7 cudaFree ( image ) ;
Listing 3.4: CUDA memory operations
block. Global synchronization however is not possible within the runtime
of one kernel. Therefore the logical parts of the projector that require a
previous operation to be finished entirely before the next operation can be
started are split up into separate functions.
Due to the large amounts of memory available on CUDA devices it is
possible to transfer large chunks of data in one single transfer to the device
and then calculate multiple projections. In particular, devices can hold all
the data required for the projection of one entire subset, which consists
of a series of sinogram views (the number of views depends on how many
subsets are used), an image and a sheared image. Example calls to allocate,
set, copy and free memory are given in listing 3.4.
The image pointer is modified by the cudaMalloc call to point to mem-
ory on the GPU and is not valid for CPU operations. The memory set
and memory copy functions use this pointer to identify a GPU memory ad-
dress. The last parameter of cudaMemcpy specifies whether data should be
downloaded or uploaded from or to the device. cudaFree finally makes the
memory available again for further allocations.
3.4.3 Optimization
To make full use of the parallel capabilities of CUDA devices a number of
optimization techniques are applied. By optimizing the CUDA kernels the
GPU reconstruction is no longer only a port from CPU to GPU but is a
new implementation of the same algorithm.
One straight forward idea to optimize the algorithm is register reduction.
It can speed up the kernel execution for two reasons. The first reason is oc-
cupancy. The more register a kernel uses, the less blocks can run simultane-
ously on one streaming multiprocessor as the blocks on one multiprocessor
66
1 /∗ r e g i s t e r imp l ementa t i on ∗/2 f l o a t k = d img [ z + x ∗ ZSAMPLES +3 i y y ∗ RHOSAMPLES ∗ ZSAMPLES ] ;4 f l o a t h = d img [ z + x ∗ ZSAMPLES +5 ( i y y +1) ∗ RHOSAMPLES ∗ ZSAMPLES ] ;6 a = k + (h−k ) ∗ pc ;78 /∗ sha r ed memory imp l ementa t i on ∗/9 s h a r e d f l o a t sha r ed [ ] ; // d e f i n e on CPU
10 con s t uns i gned i n t z = th r e a d I d x . x ;11 sha r ed [ z ] = d img [ z + x ∗ ZSAMPLES +12 i y y ∗ RHOSAMPLES ∗ ZSAMPLES ] ;13 sha r ed [ z+ZSAMPLES ] = d img [ z + x ∗ ZSAMPLES +14 ( i y y +1) ∗ RHOSAMPLES ∗ ZSAMPLES ] ;15 a = sha r ed [ z ] + ( sha r ed [ z+ZSAMPLES]− sha r ed [ z ] ) ∗ pc ;
Listing 3.5: Register and shared memory implementation of interpolation
share all its resources. Simultaneous block execution however is used to hide
memory latencies when fetching data from global memory. If not enough
blocks are active the memory latencies can not be hidden and the perfor-
mance of the kernel decreases. In extreme cases where a lot of registers are
required, the compiler may generate instructions to offload registers to locale
memory. This is known as register spilling. The performance of the kernel
suffers drastically if registers are spilled to local memory as it is unbuffered
off-chip DRAM. To reduce the number of required registers two techniques
are applied in the projection kernels. The first idea is to replace registers
by shared memory. This however depends on the shared memory usage, the
register usage and the occupancy of a kernel. Occupancy is a function of
how many resources a kernel uses. The resources are registers and shared
memory. If a kernel uses either too many registers or too much of the avail-
able shared memory the occupancy goes down. The projectors generally use
a lot of registers - up to 12 per thread and not so much shared memory - only
up to 2kB of the existing 16kB. Therefore some of the registers are moved
to shared memory to even the resource consumption of the kernels. Listing
3.5 shows two implementations of an interpolation: one using registers and
the other using shared memory to store intermediate results.
Using the shared memory implementation the compiler can reduce the
amount of required registers from 12 to 10 resulting in 100% occupancy as
opposed to 83% when using 12 registers. Another method to reduce the
67
1 /∗ uns i gned i n t con s t an t memory d e f i n e s ∗/2 #de f i n e RHOSAMPLES 03 #de f i n e ZSAMPLES 14 #de f i n e K NUMUI 256 /∗ d e f i n e a r r a y wi th i m p l i c i t a l l o c a t i o n on GPU ∗/7 c o n s t a n t uns i gned i n t u i [K NUMUI ] ;89 /∗ d e f i n e pa ramete r s on CPU and copy to GPU ∗/
10 kpu i = ( uns i gned i n t ∗)11 mal l oc ( s i z e o f ( uns i gned i n t ) ∗ K NUMUI ) ;12 kpu i [RHOSAMPLES] = RhoSamples ;13 kpu i [ZSAMPLES] = ZSamples ;14 cudaMemcpyToSymbol ( u i , kpui ,15 K NUMUI ∗ s i z e o f ( uns i gned i n t ) ) ;1617 /∗ use i n k e r n e l on GPU ∗/18 s = shea r [ x ∗ u i [ZSAMPLES] + y ∗ u i [RHOSAMPLES ] ] ;
Listing 3.6: Usage of constant memory to reduce register usage
number of registers is constant memory. It is a 64kB large part of global
memory and each multiprocessor has 8kB dedicated cache available. Only
read operations are possible when using constant memory, however read ac-
cess on a cache hit is as fast as register access. Therefore it is used to make
constants and parameters available to the kernel. The projector kernels use
two arrays of constant memory, one storing floating point constants, the
other one containing integer constants. The constants are used by access-
ing elements of the array via predefined indices. Listing 3.6 illustrates this
principle by making the image dimensions available to kernels via constant
memory. Global constants are defined to access elements of the array, the
memory on the GPU is allocated using the constant keyword, the pa-
rameters are specified on the host and are copied to the GPU where they
can be used in the kernel. A more elegant method would be to use one
single structure, however as of CUDA version 1.1 the compiler has problems
generating correct code when structures are stored in constant memory due
to the variable lengths of the members.
The second optimization principle is related to memory. Memory access
is by nature very costly and often constraints the performance of memory in-
tensive algorithms. Memory access on modern graphics cards takes between
400 and 600 cycles compared to 1 cycle for register access. Therefore it is
68
important to reduce memory reads and writes by optimizing the code to use
caches and to reduce memory access in general. This can often be achieved
by bundling many small memory accesses into one large bulk access. An-
other technique is to hide memory latency by rearranging instructions or
executing threads in parallel. While some threads wait for a memory fetch
to finish, other threads might use the GPU to calculate results thus not
waisting any GPU time while waiting.
GPUs support all those optimization techniques. Some are built in hard-
wired parts of the device architecture. For example latency hiding by ex-
ecuting other threads while others are blocking on memory is part of the
GPUs integrated functionality. The thread execution manager takes care of
the optimal arrangement of threads and memory access. All the program-
mer has to do in this matter is to start enough threads so that there are
threads ready to run while others wait. The metric to determine if enough
threads are started is occupancy.
Section 3.3.4 describes a way to access memory in a way so that the
hardware can optimize memory access by fetching large junks of memory at
once instead of accessing only small amounts with multiple separate memory
round trips. This is an important optimization technique and can drastically
speed up kernel execution. Even if the threads of one block do not have
a memory access pattern that can be optimized, with the help of shared
memory this can be circumvented. The idea is that all threads in one block
copy the data they operate on to shared memory before starting any cal-
culations. The threads can work together to fetch the data in a coalesced
manner that can be optimized by the hardware. After this the threads
synchronize to ensure that all threads have finished fetching data from the
memory and the shared memory is populated with the required data by call-
ing syncthreads(). They can then start their calculations using the data
in shared memory. Shared memory access can be up to 100 times faster
than uncoalesced global memory access [41]. After the threads finish their
calculations the same principle can be applied to write back data. Threads
write their intermediate and final results to shared memory and after every
thread is finished calculating they issue coalesced memory writes.
Listing 3.7 shows a typical code section taken from the fwd-ob kernel
function that populates shared memory before calculation. Each thread
fetches one image voxel from the sheared image; one thread-block fetches
69
1 con s t uns i gned i n t z = th r e a d I d x . x ;2 s y n c t h r e a d s ( ) ;3 sha r ed [ z ] = shea r [ z + x∗ZSAMPLES +4 y∗ZSAMPLES∗RHOSAMPLES ] ;5 s y n c t h r e a d s ( ) ;
Listing 3.7: Populating CUDA shared memory
an entire z row. Each thread can now calculate the interpolation (compare
Figure 3.8(b)) along the z axis using the values from shared memory. The
results are also stored in shared memory and written to the sinogram in a
similar fashion after all calculations are finished.
Data transfers between host system and graphics device result in over-
head that is unique to CUDA algorithm. To speed up these transfers the
CUDA API offers special function calls, to allocate and use so called pinned
memory. Pinned memory is page locked memory that can not be moved
or paged out from RAM. The CUDA driver is able to track page locked
memory as its address never changes. If memory transfers are initiated be-
tween pinned memory and the CUDA device the driver is able to perform
direct memory access (DMA) copies between the device and the memory.
On memory transfers between regularly allocated memory and the CUDA
device the memory is copied to small CUDA private pinned memory buffers.
This involves additional CPU time and an extra copy operation which de-
grades performance. Listing 3.8 shows the function that wraps the CUDA
call to allocate pinned memory inside a C function.
In some cases using pinned memory can worsen overall system efficiency
because it lessens the amount of available memory. However during PET
reconstruction all data structures have to fit into user memory anyway to
avoid swapping and guarantee optimal performance. Therefore the recon-
struction systems are equipped with enough memory so that CUDA pinned
memory can be used without any drawbacks.
3.4.4 Results
First tests are done using a workstation test system with two dual core Intel
Xeon processor clocked at 2660MHz, 8GB of RAM, the Intel S5000PSL
server mainboard (8x PCIe) and a Gefoce 8800 Ultra clocked at 1512MHz.
The reconstruction is calculated using the unweighted OSEM algorithm.
70
1 e x t e r n ”C” f l o a t ∗2 CUDA mallocPinned ( uns i gned long i n t s i z e )3 {4 f l o a t ∗ t ;5 cudaMal locHost ( ( vo i d ∗∗) &t , s i z e ) ;6 r e t u r n t ;7 }
Listing 3.8: Wrapper function for pinned memory allocation
Three iterations with 21 subsets are calculated. The sinogram dimensions
are 336x336x559, the final image dimensions are 336x336x109.
The first results are encouraging. Figure 3.9 shows the runtime of the
reconstruction using different setups and optimization. Table 3.4 describes
the meaning of the reconstruction setups and shows the speedup
Sp =Tprevious
Tcurrent(3.10)
of each configuration compared to the previous one. The CPU setups im-
plement the same algorithm as the existing validated software. The ”GPU”
setup is a naive implementation with no optimizations applied. It is imple-
mented to proof the feasibility and as a reference point to benchmark the
optimizations used. In the ”GPU: opt1” setup the same optimization that is
used on the CPU is used for the GPU: the index is reshuffled from xyz to zxy
for faster continuous data access. The ”GPU: opt2” setup contains the opt1
optimization and utilizes the pinned memory functions of the CUDA API as
described above. In the ”GPU: opt3” in addition to the opt2 optimizations
all kernels are modified to use coalesced memory accesses where possible.
The GPU reconstruction with best optimizations applied takes only 48% of
the time of the validated reconstruction algorithm used in current produc-
tion systems.
Figure 3.10 shows the absolute error sum, the average error and the
maximum error when comparing the reconstructed image to a reference
image reconstructed using existing validated software. The scale of the y-
axis is logarithmic. It shows a small error for the two CPU implementations
with a error sum of 0.00015. The error of the GPU implementations is
higher yet constant for all setups. The error sum is 0.05, with an average
of 5.99 ∗ 10−9 and a maximum error of 3.71 ∗ 10−6. This is well within
71
CPU: 1 thread CPU: 4 threads GPU GPU: opt1 GPU: opt2 GPU: opt30
50
100
150
200
250
300
Setup
Run
time
in s
BackprojectionForwardprojectionOverall Reconstruction Time
Figure 3.9: Runtime comparison of different setups
Setup Description Sp
CPU: 1 thread one CPU thread (one core) is used -CPU: 4 threads 4 CPU threads (all four cores) are used 2.14GPU GPU with simple implementation (no optimization) 1.19GPU: opt1 GPU with optimizations: index reshuffle xyz to zxy 1.33GPU: opt2 GPU with optimizations: opt1 and pinned memory 1.11GPU: opt3 GPU with optimizations: opt2 and coalesced access 1.19
Table 3.4: Setup description and speedup
acceptable boundaries.
3.4.5 Other algorithms
Apart from the projectors themselves other kernels are created to support
the projector kernels. As mentioned earlier coalesced memory access is
very important. To meet the requirements for coalesced memory access
of a two-dimensional array as it is used in the kernels with for example
width ∗ blockId .y + threadId.x, the width has to be a multiple of 16 [41]. The
width of one dataset however is typically 120 (109 padded to 120). So a
preprocessing step is added to the projectors that pads the dataset width
72
CPU: 1 thread CPU: 4 threads GPU GPU: opt1 GPU: opt2 GPU: opt3
10−10
10−8
10−6
10−4
10−2
100
Setup
Err
or
Error SumError AverageError Max
Figure 3.10: Absolute error comparison of different setups
from 120 to 128 and after projection the padding is removed. Figure 3.11
shows the difference between a padded and an unpadded array in an 128
elements wide memory.
Figure 3.11: Unpadded and padded memory
In the upper part of the image the unpadded array is displayed. The
width of the array does not match the width of the memory, it is 8 elements
too short. As a result the second dimension of the array always starts at
at different memory addresses. For the CUDA device to access this data
efficiently the x dimension always has to start at the first address of a mem-
ory block as illustrated in the lower part of the image. At the end of each
z dimension 8 empty elements are added to pad out the array. Listing 3.9
shows a kernel that can pad and unpad a three-dimensional image volume.
73
1 g l o b a l vo i d padunpadVolume ( f l o a t ∗ in ,2 f l o a t ∗ out , un s i gned s h o r t i n t inDim ,3 uns i gned s h o r t i n t outDim ,4 uns i gned s h o r t i n t xy )5 {6 con s t uns i gned i n t x = b l o c k I d x . x ;7 con s t uns i gned i n t z = th r e a d I d x . x ;89 f o r ( uns i gned s h o r t i n t y=0; y<xy ; y++)
10 out [ z+y∗outDim+x∗outDim∗ xy ] =11 i n [ z+y∗ inDim+x∗ inDim∗ xy ] ;12 }
Listing 3.9: Kernel to pad and unpad 3 dimensional arrays on the GPU
Input and output arrays are passed to the kernel as well as their respec-
tive dimensions. The kernel then pads or unpads the input array depending
on inDim and outDim by iterating over the input and output volumes using
blocks for the x dimension and threads for the z dimension.
Another additional algorithm is implemented in a kernel for index reshuf-
fling. It efficiently reorders the indices from xyz to zxy , the required index
order for the projection kernels. The reshuffling implementation is compa-
rable to the CUDA Matrix Transpose SDK Example [35] where the indices
of a two dimensional array are exchanged. The implementation is based on
the idea of using shared memory as a buffer and reading and writing data
as efficiently as possible from and to memory. The algorithm is illustrated
in Figure 3.12. It shows one half of a 4x4x4 volume on the left side indexed
by xyz , in the middle the intermediate buffer or shared memory and on the
right side the reshuffled image.
Data is accessed as coalesced as possible in global memory and read
and written scattered to and from shared memory as the access to global
memory has to follow strict rules to be efficient, whereas shared memory
operations can be scattered and are still faster. All blocks in one thread
execute coalesced reads in x direction, write the data to shared memory and
then loop to their next z dimension where again each thread reads data and
writes it to shared memory. After each thread filled up its part of the shared
memory which is ensured by a call to syncthreads (); the threads start to
write back data to the new array again in a coalesced manner. They scatter
their access to shared memory so that they can execute fully coalesced write
74
Figure 3.12: Index reshuffling on CUDA device
operations to global memory.
3.4.6 Texture Units
The texture units are an integral part of every graphics processing unit. On
modern G80 chips they support implicit interpolation when accessing data
in 2D and 3D textures with different interpolation methods. In conjunction
with texture caches they provide a very effective way to access memory and
have interpolation calculated implicitly. These powerful resources would be
ideal for the implementation of projectors but are not exploited in this work
for mainly two reasons. First, one of the main goals of this work is to create
a proof of concept that reconstruction algorithm implementations can be
ported to the GPU and that both CPU and GPU implementations calcu-
late the same results. Therefore a more ”literal” port from CPU to GPU is
preferred over implementing an entirely new algorithm. The implicit inter-
polation of the texture units are largely out of the control of the programmer
and can thus not be tuned easily to produce the same results as the CPU in-
terpolation algorithm. Second, at the time this algorithms are implemented
there is no 3D texturing support for CUDA. Now with CUDA version 2.2
3D textures are supported. It is therefore suggested as one of the next steps
of this work to prototype an implementation utilizing the texture units of
the GPU. It is likely that another significant speedup on top of the speedup
of the current GPU implementation might be achieved if CPU results can
75
be reproduced using the texture units.
3.5 Debugging and Validation
One crucial requirement of this work is to reproduce the results of the recon-
struction process used in PET products today. An in-depth validation of the
new implementations are therefore necessary. Also debugging methods and
tools are required to validate the reconstruction process step by step and to
quickly detect, locate and resolve errors. For these tasks a number of special
purpose utilities are created and used next to already existing products.
3.5.1 Debugging
For low level debugging the Visual Studio 2005 built in debugger is used in
conjunction with the CUDA emulation mode described in subsection 3.3.6.
For numeric evaluation of the reconstructed image the hex editor XVI32
[32] is used. It allows easy navigation through the raw image data and
automatically interprets the data as IEEE 754 single precision floating point.
Figure 3.13 shows the hex editor in action.
Figure 3.13: Hexeditor
Another debugging tool is an application called SinogramViewer. It is
specifically developed for the objective of this work and provides a number
of important features for debugging and development. The following is an
incomplete list of the features of the SinogramViewer tool.
• side by side comparison of sinograms and images with 3D axis selection
for images and angle selection for sinograms
76
• sinogram rotation and index reshuffling
• step by step control of reconstruction steps instrumentalizing the re-
construction binaries for online side by side data comparison during
reconstruction
• support for multiple image and sinogram sizes
• animated image rotation (3D view)
• data comparison (sinogram and image)
• test data creation algorithms (sphere, cylinder, cube)
• C# CPU projectors
The tool is a C# .NET application that is developed and enhanced over
time. It starts out as a simple sinogram and image viewer, provides an envi-
ronment for first tests with projector algorithms and evolves into a powerful
debugging tool with many different features. Figure 3.14 shows the large im-
age and sinogram loading part of the applications with 2 sinograms loaded.
The bottom slider allows scrolling through the different view angles of the
sinograms. Furthermore, a CPU projector can be started to backproject
a specified view angle of both sinograms. The difference between to two
sinograms can be calculated as well.
Another tool used is VINCI, a commercial product ”designed for the
visualization and analysis of volume data generated by medical topographic
systems with special emphasis on the needs for brain imaging with Positron
Emission Tomography” [53]. It allows among other things the simultaneous
in depth examination of multiple images, plotting of profiles and image
arithmetic.
3.5.2 Validation
For validation a flexible test script and a command line tool for image com-
parison is created. The script allows an automated run of reconstruction
tools with various input data and parameters as well as the automated val-
idation of the reconstruction result. It is implemented as a windows batch
script. Validation is done with the help of precalculated reference reconstruc-
tion results. These are created with the same input data and parameters
77
Figure 3.14: Sinogram Viewer Tool
78
1 SET VALBIN=rawdf2 SET TESTDESC=(Bra in , Large , UW−OSEM)3 SET TESTCASENAME=BrainLUW4 SET PARAMTEST=−−a l go uw−osem −− i s 3 ,28 −− f l − l 7 3 , .5 −−f o r c e −−gpu6 SET PARAMGOLD=−−a l go uw−osem −− i s 3 ,28 −− f l − l 7 3 , .7 −−f o r c e8 REM d e l e t e e x i s t i n g f i l e s9 d e l /Q %TESTOUTDIR%%TESTCASENAME%\∗
10 d e l /Q %GOLDOUTDIR%%TESTCASENAME%\∗1112 REM c r e a t e f i l e s13 %TESTBIN% −e %INDIR%%INFILE%14 −−o i %TESTOUTDIR%%OUTFILE% %PARAMTEST%15 %GOLDBIN% −e %INDIR%%INFILE%16 −−o i %GOLDOUTDIR%%OUTFILE% %PARAMGOLD%1718 REM compare f i l e s19 %VALBIN% %TESTOUTDIR%%OUTVOLUME%20 %GOLDOUTDIR%%OUTVOLUME%
Listing 3.10: Testscript excerpt
as the test case but the validated reconstruction process is used to create
them. The test is successful if the reference images can be reproduced by the
new reconstruction process. An alternative that is also possible is to provide
reference binaries. Those are proven binaries of a validated reconstruction
system. Both the test binaries and the reference binaries are executed with
the same parameters and input data. After both binaries finished recon-
struction the results are compared and analyzed. Listing 3.10 shows a small
excerpt of a test script that uses a reference and a test binary to reconstruct
a brain scan using the unweighted OSEM algorithm.
To compare and analyzed the results a command line tool called rawdiff
is created to calculate the difference between two reconstructed images. It is
implemented as C++ console application. It takes two paths and filenames
to the images as input and compares those two images. It can be called from
within the test script and its output is displayed on the screen or written to
a log file for automated test run protocols.
The tool calculates five different metrics to compare the two images and
thereby determines the accuracy of the result. The metrics are the absolute
error Eabs , the relative error Erel , the mean squared error Emse , the root
79
mean squared error Ermse and the normalized mean squared error Enmse .
The absolute error
Eabs(x , y , z ) = λ(x , y , z ) − λgold (x , y , z ) (3.11)
is used to determine the overall accuracy of the calculated result compared
to the reference image. The comparison tool outputs the maximum absolute
error max (Eabs) and the mean absolute error
Emeanabs(x , y , z ) =1
xyz∗
zmax∑
z=0
ymax∑
y=0
xmax∑
x=0
(λ(x , y , z ) − λgold (x , y , z )). (3.12)
The second metric is the relative error
Erel (x , y , z ) =λ(x , y , z ) − λgold (x , y , z )
λgold (x , y , z )(3.13)
which is used to determine the number of decimal places that match. Again
the maximum max (Erel ) and the mean
Emeanrel (x , y , z ) =1
xyz∗
zmax∑
z=0
ymax∑
y=0
xmax∑
x=0
(λ(x , y , z ) − λgold (x , y , z )
λgold (x , y , z )
)(3.14)
relative errors are displayed by the comparison tool. The number of inaccu-
rate decimal places n is calculated by
n = log10
(Erel
eps
)(3.15)
with eps = 6∗10−8 for single single precision floating point numbers accord-
ing to IEEE 754. Additionally the mean squared error and its normalized
form are used as proposed in [1]. The MSE is defined as
Emse(x , y , z ) =1
xyz∗
zmax∑
z=0
ymax∑
y=0
xmax∑
x=0
(λ(x , y , z ) − λgold (x , y , z ))2. (3.16)
The root mean squared error as a good indicator of accuracy is often used to
compare the discrepancy between images that can diverge. It is calculated
80
1 rawdf .\ r e con . i . . \ . . \ dev\ data \ image 00 . v −−abs2 compar ing .\ r e con . i w i th . . \ . . \ dev\ data \ image 00 . v3 8339557 e r r o r s found ;4 abs sum : 0.04994055 abs av : 5 .98839 e−0096 abs max : 3 .70748 e−006 at adr 12259472
Listing 3.11: rawdiff tool output
by
Ermse(x , y , z ) =√
Emse(x , y , z ). (3.17)
The normalized MSE is calculated by first normalizing the images relative to
their intensities with λ(x , y , z )norm = λ(x , y , z )/µ whereas µ is the intensity
of the image λ.
Listing 3.11 shows the example output for the comparison of a recon-
structed image. The absolute error sum, average and maximum is displayed
as well as the address of the maximum error.
Different test cases include different input data sizes, different input data
content (e.g. wholebody scans, brain scans, mathematical data such as a
sphere), different input data sizes and different reconstruction parameters
and modes such as unweighted reconstruction, attenuation corrected recon-
struction, LOR space or PB space reconstruction.
Figure 3.15 shows four of the used test datasets. The first image is a
reconstruction of a neck scan from the nose to the center of the sternum.
The second image shows the scan of a phantom. A phantom is a device filled
with radiating substances that can be placed into PET systems. Depending
on the phantom type they simulate different regions of a patient’s body.
The third image is a full body patient scan and the bottom image shows a
reconstructed image of a uniform phantom. All images have the contrast
enhanced for better visibility and are reconstructed without attenuation
correction.
3.6 Product considerations
The GPU reconstruction algorithms developed are planned to be used in the
PET Reconstruction System - a headless workstation-like computer running
81
Figure 3.15: Testimages
82
Windows XP 64bit with 8GB of RAM and 2 multicore processors. The new
version of the reconstruction system should contain a CUDA capable graph-
ics device for GPU image reconstruction. A series of tests are devised to
determine which CUDA card should be used, whether or not the CUDA
device can be used as display adapter and for reconstructing PET sino-
grams at the same time and which mainboard should be selected for the
system. The tests can be categorized in three different groups: memory
tests, reconstruction tests and stability tests. The memory tests are used
to compare the CUDA device memory subsystems. Its performance is an
essential factor for image reconstruction runtimes. Additionally a number of
tests to benchmark the host to device data transfer rates are executed. The
reconstruction benchmarks are implemented to determine the performance
of the different system configurations running real world reconstruction al-
gorithms. The stability tests are done to determine how the systems behave
under constant load.
For all memory tests the data packet size was in the range from 10MB
to 200MB staring from 10MB and gradually increasing in 10MB steps. For
the analysis only the timings for 10MB, 50MB 100MB and 200MB transfer
sizes were taken under consideration. They represent different data packets
relevant to reconstruction like the full sinogram, the sinogram subset, the
image or a large image. For the reconstruction tests the algorithm was full
3D UW-OSEM. 3 iterations and 21 subsets were calculated. The dataset is
taken from a real brain scan, the sinogram dimensions are 336x336x559, the
final image dimensions are 336x336x109.
The mainboard choice is narrowed down by availability, price and com-
patibility with processors and RAM. The two mainboards to choose from
are the server board Intel S5000PSL [21] and the workstation board Intel
S5000XVN [22]. The list of CUDA capable cards available at that time for
the reconstruction system are the Gefoce 8800 Ultra clocked at 1512MHz, a
special version of the Gefoce 8800 Ultra clocked at 1663MHz, the Tesla C870
a dedicated general purpose GPU without video output and the Geforce
8800 GTX. In the cases where the tested CUDA device was not the display
adapter (always for the Tesla card), a Nvidia Quadro NVS 290 was used
for video output. Due to driver incompatibilities the card used as display
adapter has to be compatible to the card used for reconstruction so they
both have to be CUDA capable cards.
83
1 cudaEventCreate ( ) ; // c r e a t e an even t2 cudaEventRecord ( ) ; // i n s e r t even t i n t o st ream3 cudaEventSynch ron i z e ( ) ; // s y n c h r o n i z e to even t4 cudaEventElapsedTime ( ) ; // c a l c u l a t e t ime d i f f e r e n c e5 cudaEventDest roy ( ) ; // d e s t r o y an even t
Listing 3.12: CUDA timing events
3.6.1 Timing the Tests
Timing CUDA operations whether they are kernel calls or other function
calls is not a trivial task. The problem is after starting a kernel from the host
the function call immediately returns. Before CUDA version 1.0 one had to
force the calling context to wait for the CUDA function to return which it
did by busy waiting. With newer CUDA versions Nvidia introduced event
mechanisms. The entire program execution is seen as a stream into which
events can be inserted. Between events the elapsed time can be calculated
after synchronizing to the last event to make sure the stream execution
has reached it. Listing 3.12 lists the functions available for timing CUDA
executions.
To simplify the use of the CUDA event system a library is created that
wraps the CUDA event system functionality in a transparent manner. It pro-
vides the function cudaEvent(char ∗ id , char ∗ tag) ; which handles the event
creation and recording. It takes an id to identify the event inserted into
the stream and also allows tagging of events to group them together. The
function cudaEventFinish() finishes all timing operations by synchronizing on
the last events and then calculates the time difference between consecu-
tive events using cudaEventElapsedTime(). Additionally the time differences
for events with the same tags are added up, and an average is calculated.
cudaEventFinish() then returns a structure containing the calculated timing
information. The library can also handle multiple threads which is nec-
essary if multiple CUDA devices are used together because their contexts
should be spread across different threads. The library is able to distinguish
between threads and CUDA contexts and can log their timing information
separately.
84
3.6.2 Memory Tests
The main difference between the two mainboards is the width of their PCI
Express interconnects. The PCI Express bus of the server board has 8
interconnect lanes whereas the workstation board has 16. The transfer rate
of each bus lane is 250MB/s for PCI Express 1.0 [8]. Hence the theoretical
transfer rate of the workstation board is twice the rate of the server board.
During reconstruction a significant amount of data has to be transfered from
host to device and back: during forwardprojection the image is uploaded
and the sinogram is downloaded, during backward projection the sinogram
is uploaded and the image is downloaded. Depending on the number of
subsets and the iterations the amount of data transfered differs, but when
assuming a regular sized sinogram and 3 iterations with 21 subsets the data
transferred amounts to 2970MB. Therefore the transfer rate of the bus has
an influence on the reconstruction time.
10 50 100 2001
1.5
2
2.5
3
3.5
Datasize in MB
Tra
nsfe
r R
ate
in G
B/s
PCIe 8x pageablePCIe 8x pinnedPCIe 16x pageablePCIe 16x pinned
Figure 3.16: Host to device PCI Express 8x and 16x pinned and pageable
Figure 3.16 shows a comparison of the transfer rate of the Gefoce 8800
Ultra (1663MHz) used in the two different mainboards and using pageable
or pinned memory. The first two graphs (PCIe 8x pageable and PCIe 8x
pinned) show the transfer rate of the card working in the Intel S5000PSL.
From the graphs one can conclude that it does not matter whether pageable
or pinned memory is used when using only an 8 lane wide PCI Express
bus. The transfer rate is hitting the maximum possible transfer rate of the
85
bus which effectively is about 1.5GB/s. Theoretically it should be 2GB/s
(8x250MB/s). The other two curves (PCIe 16x pageable and PCIe 16x
pinned) show the device operating in the Intel S5000XVN. It can be seen
that for pageable memory the maximum speed is about 2GB/s where the
maximum theoretical bandwidth is 4GB/s. In this case the memory copies
from the allocated memory to the dedicated area of pinned memory reserved
for CUDA host to device transfers that is performed by the CPU, limits
the transfer rate. The maximum transfer rate of 3GB/s is achieved when
using pinned memory and a 16 lanes wide bus. It again does not reach the
maximum theoretical rate of 4GB/s but is twice as fast as the PCIe 8x bus
and is thus consistent with the other measurements if it is assumed that the
device can reach 75% of the possible transfer rates.
10 50 100 2001,4
1,6
1,8
2
2,2
2,4
2,6
2,8
3
3,2
Datasize in MB
Tra
nsfe
r R
ate
in G
B/s
Gefoce 8800 Ultra (1512MHz) PCIe 8xTesla C870 PCIe 8xGefoce 8800 Ultra (1512MHz) PCIe 16xTesla C870 PCIe 16x
Figure 3.17: Transfer rate for PCI Express 8x and 16x (pinned)
Figure 3.17 shows that this is true for the Gefoce 8800 Ultra (1512MHz)
and the Tesla C870 cards as well. The timings are taken from pinned mem-
ory transfers. Devices connected via PCI Express 8x bus are able to transfer
data about half as fast as the devices using the 16 lane bus.
Figure 3.18 shows the maximum host to device memory transfer rates
for all available cards. The Tesla C870 is the fastest card overall when com-
paring host to device copies and the Geforce 8800 GTX performs worst.
The difference however is only marginal and a difference of 0.4GB/s trans-
fer rate does not significantly influence the runtime of the reconstruction
86
10 50 100 2002,97
2,98
2,99
3,00
3,01
3,02
3,03
3,04
Datasize in MB
Tra
nsfe
r R
ate
in G
B/s
Gefoce 8800 Ultra (1663MHz)Gefoce 8800 Ultra (1512MHz)Tesla C870Geforce 8800 GTX
Figure 3.18: All cards maximum host to device performance
algorithms. Yet the difference between 8x and 16x PCI Express transfer
rates when using pinned memory is significant. The Intel S5000XVN board
allows CUDA devices to transfer data at almost twice the rate compared to
the Intel S5000PSL.
Figure 3.19 shows a comparison of the on-device memory transfers of all
available CUDA cards. It shows the Geforce 8800 Ultra (1663MHz) with its
memory clock at 1125MHz to be the fastest device with a maximum transfer
rate of 87GB/s. The second fastest card with 81GB/s is the same type of
card with lower clock frequency, the Geforce 8800 Ultra (1512MHz). The
Geforce 8800 GTX can copy memory on device at 71GB/s and the Tesla
C870 at 65GB/s. The card’s maximum transfer rate is directly proportional
to their respective memory clock frequencies.
As a result of the memory test it is recommended to equip the new
Reconstruction System with the Intel S5000XVN workstation board due to
its 16 lanes wide PCI Express bus which enables the system to transfer
data from the host to the device at twice the rate compared to the 8x bus.
The fastest CUDA device is the Geforce 8800 Ultra (1663MHz) which is the
recommended device after this test.
87
10 50 100 20060
65
70
75
80
85
90
60
65
Datasize in MB
Tra
nsfe
r ra
te in
GB
/s
Gefoce 8800 Ultra (1663MHz)Gefoce 8800 Ultra (1512MHz)Tesla C870Geforce 8800 GTX
Figure 3.19: On device transfer rates
3.6.3 Reconstruction Tests
The reconstruction tests show which system configuration is able to recon-
struct PET images the fastest. Table 3.5 shows the overall reconstruction
runtime in seconds sorted in ascending order for a 3D UW-OSEM 3 itera-
tions and 21 subset reconstruction calculating a 336x336x109 image from a
336x336x559 sinogram. Figure 3.20 shows the same information in a graph
for better comparison. The PCIe label designates the mainboard used for
the test. In the case of the PCIe 8x the Intel S5000PSL was used, in case of
PCIe 16x the test system contained the Intel S5000XVN mainboard. The
fastest system is comprised of the Gefoce 8800 Ultra clocked at 1663MHz,
the PCIe 16x mainboard and an additional graphics card for display. Only 3
seconds slower is the same configuration with the Geforce card handling both
reconstruction and display. The slowest card is the Tesla C870 performing
worst when operating in either mainboard.
The margin between best performing and worst performing configuration
is with almost 13 seconds significant. The graph shows that there is no big
difference in performance between the Geforce Ultra cards when ignoring
the first measurement. The influence of an additional graphics device for
display as opposed to using the same graphics device for reconstruction and
display is hard to measure. The timings range from significant influence to
almost no influence at all. During the test it is noticed that moving the
88
No. Configuration Time in s
1 Gefoce 8800 Ultra (1663MHz) PCIe 16x No Display 55.192 Gefoce 8800 Ultra (1663MHz) PCIe 16x Display 58.353 Gefoce 8800 Ultra (1512MHz) PCIe 16x No Display 58.494 Gefoce 8800 Ultra (1663MHz) PCIe 8x Display 59.395 Gefoce 8800 Ultra (1663MHz) PCIe 8x No Display 59.706 Gefoce 8800 Ultra (1512MHz) PCIe 16x Display 59.827 Gefoce 8800 Ultra (1512MHz) PCIe 8x Display 62.668 Gefoce 8800 Ultra (1512MHz) PCIe 8x No Display 63.189 Tesla C870 PCIe 16x No Display 64.8410 Tesla C870 PCIe 8x No Display 68.07
Table 3.5: Reconstruction time overview
mouse while the system reconstructs an image using the GPU and while the
same GPU is also used for display can influence the test result i.e. slow done
the reconstruction. When wildly moving the mouse across the screen the
performance of the reconstruction system decreases dramatically. Therefore
the mouse was never moved during any of the tests. When moving the
mouse, the graphics device has to recalculate the screen image. The GPU
requires resources for doing so hence those resources are not available for
reconstruction. The GPU seems do be able to dynamically allocate resources
to either computation or display on the fly.
Figure 3.21 shows the overall performance of all system configurations
split up into the separate tasks of the reconstruction process. The ”Back-
projection of 1” represents the calculation of the backprojection of an initial
sinogram filled with 1 which is done right after when the reconstruction al-
gorithm starts. This is the smallest task of the process taking from 6 to 9
seconds. The ”Backprojection” is the time the algorithm takes for all back-
projection jobs during reconstruction excluding the projection of the initial
sinogram. The ”Forwardprojection” constitutes all forwardprojection cal-
culations during reconstruction. The times for ”Other” contain all other
tasks done during reconstruction like data input, data output, calculation
of the quotient and the new image for each subset, initialization including
parameter precalculation and the calculation of the circular mask.
Figure 3.22 shows how those four tasks make up the entire reconstruction
process. 78% of the reconstruction time is spent projecting images and
sinograms backward and forward and 22% of the time is spent doing other
89
1 2 3 4 5 6 7 8 9 10
54
56
58
60
62
64
66
68
70
Configuration No.
Run
time
in s
Figure 3.20: Reconstruction time overiew
1 2 3 4 5 6 7 8 9 10
5
10
15
20
25
30
Configuration No.
Tim
e in
s
Backprojection of 1BackprojectionForwardprojectionOther
Figure 3.21: Reconstruction time detailed overiew
90
things. This image also shows that the reconstruction time can be reduced
by about 12% if the backprojection of 1 is only calculated once and reused
for all reconstructions as it is always the same process and data.
12%
34%
32%
22%
Backprojection of 1BackprojectionForwardprojectionOther
Figure 3.22: Reconstruction components
Figure 3.23 plots the runtime of the reconstruction against the maxi-
mum theoretical floating point operations per second of the hardware (bot-
tom x-axis) and the maximum theoretical data transfer rate (top x-axis).
The devices are from left to right the Tesla C870, the Geforce 8800 GTX,
the Geforce 8800 Ultra (1512MHz) and the Geforce 8800 Ultra (1663MHz).
The fastest reconstruction times for the hardware are plotted using the In-
tel S5000XVN workstation mainboard and an additional graphics device as
display adapter. Table 3.6 shows the plotted values.
No. Device s Gflop/s GB/s
1 Tesla C870 64,84 346 38,42 Geforce 8800 GTX 64,55 346 43,23 Geforce 8800 Ultra (1512MHz) 58,49 384 51,844 Geforce 8800 Ultra (1663MHz) 55,19 425 54
Table 3.6: Reconstruction time overview
91
340 350 360 370 380 390 400 410 420 430 44052
54
56
58
60
62
64
66
Rate of Execution Gflop/s
Run
time
in s
36 38 40 42 44 46 48 50 52 54 56
52
54
56
58
60
62
64
66
Transfare Rate in GB/s
Run
time
in s
Figure 3.23: Reconstruction performance transfer and execution rate
It is interesting to compare the different transfer rates and execution
rates and the resulting reconstruction runtimes. The Tesla C870 and the
Geforce 8800 GTX have the same maximum theoretical rate of execution
however the Geforce 8800 GTX is able to transfer data 5GB/s faster. The
influence of this on the reconstruction runtime is marginal. One can conclude
that the reconstruction runtime is not limited by the maximum transfer rate
of the CUDA device but by its execution rate. For this implementation the
execution rate seems to set a maximum for the performance of the algorithm.
The transfer rate specified here is calculated by multiplying the memory
bus width by the memory clock. Official numbers are twice that number
as they take the double data rate into account. The maximum execution
rate is calculated by multiplying the number of floating point ALUs with
their clock speed multiplied by two because multiply-add instructions can be
executed within one cycle. Official numbers are higher than those specified
here (e.g. 520Gflop/s for the G80 chip) as those numbers take graphics-
specific operations into account.
92
3.7 Product Integration
After the successful development of a prototype implementation of the re-
construction algorithm using CUDA and a graphics card the algorithm is
integrated into the Siemens PET reconstruction system. The following sec-
tion describes the requirements and some key implementation details. The
most important requirements are:
• loose integration so that CUDA features can be turned on or off at
compile time and at runtime
• in case of an error the reconstruction system is to fail gracefully
• the reconstruction system has to produce the same results when using
CUDA devices or the regular CPU implementation
• flexible integration so that different and or multiple CUDA devices
can be used for reconstruction including simultaneous use of multiple
cards
The loose integration requirement is met by using the preprocessor con-
stant CUDA GPU and by implementing the command line switch −−gpu for
the executable. The prepocessor constant encloses all GPU specific classes
and function calls and allows the build toolchain to compile a version of
the executable with or without GPU functionality included. The command
line switch can be specified when starting the reconstruction executable to
instruct the system to use the GPU accelerated reconstruction mode if pos-
sible.
The graceful fail requirement is fulfilled by the extensive use of C++
exception handling. CUDA functions are only called from a low level C
CUDA wrapper. All calls to CUDA built-in functions return an error code of
type cudaError t that can be checked for cudaSuccess. If the function does not
succeed the CUDA wrapper method that calls the built-in function returns
with an error. As there are a lot of calls to CUDA built-in functions, this is
simplified by using a macro that wraps all functions, checks for success and
returns if the call failed. Listing 3.13 shows that macro.
It is called like this: CUDA SAVE CALL(cudaSetDevice(GPUId)); and
all functions that contain it have to return an integer value. The
definition of a function that uses the macro might look like this:
93
1 #de f i n e CUDA SAVE CALL( c a l l ) \2 i f ( t r u e ) { \3 c u d aE r r o r t my cuda e r r o r = c a l l ; \4 i f ( my cuda e r r o r != cudaSucces s ) \5 r e t u r n my cuda e r r o r ; \6 e l s e \7 ( vo i d ) 0 ; \8 } e l s e ( vo i d )0
Listing 3.13: CUDA error check
extern ”C” int CUDA InitGPUPB(). All C++ methods from upper layers that
call low-level CUDA wrapper functions have to check the return value of
those functions. If the return value does not indicate successful execution of
the called function the C++ methods throws an exception. This exception
bubbles up within the reconstruction system and reaches already existing
mechanisms to deal with exceptions including logging mechanisms.
The third requirement, the reconstruction system has to produce the
same results when using GPU or CPU is achieved by a very literal port of
the CPU algorithm to the GPU. Extensive testing and comparison of the
reconstructed images ensures that the requirement is met.
The fourth requirement defines how the GPU projectors should be inte-
grated into the reconstruction system. The CPU projectors start one thread
for each CPU core in the system. It is sensible to extend this principle to
GPU projectors. For every GPU core (there are graphics devices available
that contain 2 GPU cores) a single thread is used. This allows the simul-
taneous use of multiple cards and increases expandability. A second card
could be simply plugged in to the system and the reconstruction system
could use two GPUs for reconstruction.
3.7.1 Integration Overview
The existing reconstruction system is a complex C++ application (>100.000
lines of code). The GPU extensions are integrated with care so that only
small portions of existing code need to be changed. Additionally software
engineering methods used in the existing product are applied to the GPU
extensions as well. In that the class model for the different projectors is
copied and changed to work with GPUs. The GPU projectors have to access
low level C functions to work with the GPU. Those functions are part of a
94
CUDA wrapper that exports its functions via extern ”C” int functionname();
and makes them available for C++ classes to use. Figure 3.24 is the class dia-
gram for the projector system with GPU extensions included. The OSEM3D
class is the class that contains the OSEM reconstruction algorithm. It cre-
ates an object prj3D depending on what kind of projector is required for the
current reconstruction algorithm. An identical interface to all different pro-
jectors is achieved by using a base class Prj3D Sheared that provides virtual
methods for the actual projectors to implement.
This class model is extended in the following way: the base class
Prj3D Sheared takes a parameter bool gpu in the constructor that indicates
whether GPU projectors should be used or not. In the case of GPU pro-
jection the GPU projector initializes all the same as the CPU projector
which is important because GPU and CPU projectors require the same set
of parameters. But in addition to initialization the class creates an ob-
ject of GPU Prj3D Sheared which is organized the same way as Prj3D Sheared.
Only the base class for the projectors is moved to a separate class called
GPU Prj3D Base. It is again a base class that provides virtual methods for
the actual projectors to implement.
If the prj3D object is asked to project data it checks if the GPU flag is
set. If it is set it tries to use its member prjGPU to project the data. If it is
not possible for the GPU to reconstruct the data due to memory limitations
the system falls back to CPU projectors. Based on reconstruction parame-
ters the system calculates how many resources the projector requires. Based
on that information the system searches for CUDA capable GPUs and de-
termines if there are GPUs available that meet the requirements. The CUDA
wrapper provides a function CUDA GetGPUProperties(tGPUProperty ∗ const Prop, int GPUId)
that queries all required information from the selected GPU such as global
memory size, shared memory size, maximum grid and block dimensions, the
major and minor device revision number and the clock rate.
3.7.2 Multithreading Implementation
A multithreaded approach is necessary to being able to utilize multiple GPUs
at the same time and it makes it easier to run other algorithms or even CPU
projectors in parallel to the GPU projectors. The latter is possible because
the CUDA framework is designed so that threads yield if they are blocking
because they have to wait for GPU operations to finish. According to CUDA
95
Figure 3.24: Class diagram
96
documentation, threads will block after 16 consecutive kernel calls because
of a full queue that stores kernel calls or after a memory operation is issued.
The threading concept implements a controller-worker principle. A con-
troller thread dispatches workers threads that perform the calculations. For
each GPU one worker thread is created. This is because each thread is as-
signed a different GPU context, a CUDA API concept that allows a process
or thread to specify which card it works with. The controller threads sig-
nals the worker thread to start working and the worker thread signals the
controller thread as soon as it is finished.
Figure 3.25: Multithreading
Figure 3.25 shows this principle. The main thread or controller thread
creates a worker thread and can then execute other operations. The worker
thread will initialize and wait (blocking) for a signal from the main thread
to start projection. If the main thread decides to start the projection it will
first assign the projection parameters to the thread and then signal it to
start. The main thread can then do other things such as scheduling other
projections using another GPU or even CPU. The main thread would do
97
that by signaling yet another thread. As soon as the main thread started
all projections and finished its operations it will wait for the worker thread
by blocking on a signal. As soon as the worker thread finished projection
it will signal the main thread to indicate that the projected data is ready
to be consumed. The worker thread will then block again for the signal
from the main thread to start yet another projection. For sake of clarity the
figure does not contain the necessary synchronization point after the worker
thread is correctly initialized. This is necessary because otherwise the main
thread would try to access variables of the worker that are not initialized
yet. There is also no exit condition shown in the figure. It is implemented
so that the main thread is able to set a flag that signals the worker thread
to exit. The main thread is able to start any number of worker threads.
Currently the thread and signaling mechanisms are implemented using
POSIX Threads, a POSIX standard for thread implementations. A thread
is started using pthread create () and closed using pthread join (). The sig-
naling and synchronization is currently implemented with 2 semaphores
hat are locked and released by either threads using pthread mutex lock() and
pthread mutex unlock(). This implementation proved to be difficult to main-
tain so an alternative is suggested that better implements the signaling
metaphor.
A signaling class signal that contains only one mutex, a POSIX Threads
condition variable, a boolean signal variable and a unsigned integer waiting
variable is suggested that contains two methods void wait (); and void send(); .
Both methods lock a mutex upon entering to guarantee mutual ex-
clusion. This is necessary because both methods access and modify the
same variables and are called from different threads. Additionally the
pthread cond wait() and pthread cond signal () functions require the mutex to
be locked. For the wait function there are two cases to consider after it
locked the mutex. In the first case the sender already signaled the receiver.
If so the receiver can go ahead without interruption. If however there is no
signal, the waiting thread indicates that one more thread is waiting for the
signal by incrementing the waiting member. After that pthread cond wait() is
called which handles two things: first it transfers the thread into a waiting
state until the signal condition variable is set. Second, the mutex is unlocked
as long as the thread is blocked. The send method considers two states as
well. If no thread is waiting, the signal variable is simply set to indicate the
98
1 vo i d s i g n a l : : wa i t ( ){2 p th r e ad mut e x l o c k (& mutex ) ;3 i f ( s i g n a l ) // s i g n a l a l r e a d y r e c e i v e d4 s i g n a l = f a l s e ;5 e l s e { // wa i t f o r s i g n a l6 wa i t i n g++; // count wa i t i n g t h r e ad s7 // r e l e a s e the mutex and b l o ck the th r ead8 p th r e ad cond wa i t (& cond , & mutex ) ;9 // he r e the mutex i s l o c k ed aga in
10 wa i t i n g −−;11 }12 p th r ead mutex un l o ck (& mutex ) ;13 }1415 vo i d s i g n a l : : send ( ){16 p th r e ad mut e x l o c k (& mutex ) ;17 i f ( w a i t i n g < 1) // no t h r e ad s wa i t i n g18 s i g n a l = t r u e ;19 e l s e // s i g n a l wa i t i n g t h r e ad s20 p t h r e a d c o n d s i g n a l (& cond ) ;21 p th r ead mutex un l o ck (& mutex ) ;22 }
Listing 3.14: Signaling methods
signal. If a thread is waiting and thus blocking on the condition variable,
the method sends the signal.
The sequence diagram in figure 3.26 shows initialization, projection and
shutdown of the projector code for parallel beam (PB) projectors. After ini-
tialization of the projector object the number of available GPUs is detected
and their properties are readout. The method TestGPUCapacityPB() checks
whether or not one or more GPUs fit the requirements for the selected re-
construction mode. If one or more GPUs are found a thread is started, the
thread is assigned the correct GPU context and the GPU is prepared for
reconstruction: memory is allocated on the GPU and for transfer on the
host system, grid and block dimensions are calculated, constant memory is
prepared. After the GPU is setup successfully the projectors are ready and
the OSEM algorithm can use them. The diagram shows a call to backproject
a sinogram. The sequence ends with a call to the destructor of the projector
classes.
99
Figure 3.26: Sequence Diagram
100
3.7.3 Parallel CPU and GPU projection
The GPU projectors are integrated into the product so that they can eas-
ily be modified to support parallel execution of CPU and GPU projection
threads. A scheduling algorithm could be implemented that spreads the
work across all system resources (CPUs and GPUs). At the time of im-
plementation it is difficult to speed up the reconstruction using both CPUs
and GPUs because using one GPU degraded system performance to a de-
gree. Basically one thread is always busy waiting when the GPU is running.
In newer versions of the CUDA framework the thread yields, thus allowing
other threads to occupy the CPU. Additionally the schedule granularity of
one view of the CPU projectors also worsens the efficiency of CPU and GPU
projectors operating in parallel. Taking a ratio for reconstruction times be-
tween one GPU and all CPU cores of 12 as a basis which means that the
CPU cores require twice as much time to project a given data set as the
GPU to project the same data a parallel execution could take 13 less time.
Taking into account the original speed gain of GPU over CPU of 12 , the
parallel execution only accounts for 16 less time. This example assumes no
additional overhead for a parallel implementation. The speedup for parallel
utilization of GPU and CPU projectors is highest if CPU and GPU take
roughly the same amount of time to project. Therefore projectors currently
don’t use CPU and GPU at the same time but use exclusively either one or
the other. Instead of focusing on the challenge of speeding up reconstruc-
tion with projectors running on both computing resources at the same time
more effort is put into trying to recreate CPU results using the GPU for all
different projector methods including LOR and span-1 projectors for PSF
reconstruction.
3.7.4 Hybrid Implementation
The reconstruction system can be called a hybrid system employing both
CPU and GPU computational resources because both system components
are used during reconstruction and for different purposes. The GPU is used
to calculate all projections that are required for reconstruction and some
minor tasks that can be efficiently implemented and optimized for GPU
use. An example for this is the index reshuffling of images before and after
projection. All other tasks such as the actual OSEM calculation, optimiza-
101
tion and correction methods such as normalization and scatter correction
are executed on the CPU. Also some parameters that are required by the
projectors are calculated by the CPU beforehand. Important parameters
that effect the numerical stability of the projectors are even calculated in
double precision.
102
Chapter 4
Conclusion
In this thesis an algorithm for reconstructing images from positron emission
tomography sinograms were ported from the CPU to a modern graphics
processing unit using the CUDA framework. The algorithm was extended
to fit the GPU programming model and optimized for speed to utilize the
GPU to its full potential. The results produced from the GPU implemen-
tation are numerically identical to those calculated by the CPU. The final
and intermediate results can thus be interchanged, which allows a seamless
integration of the GPU routines into existing reconstruction systems as well
as a hybrid implementation that utilizes both CPU and GPU.
The results of this work not only help to speed up the reconstruction
process and thereby potentially enhance the clinical workflow and improve
the patient throughput. They also show that data parallel problems can
be solved efficiently using modern GPU technology. With the introduction
of new frameworks those powerful platforms have been made available not
only to graphics experts but also to scientists and developers of numerical
algorithms. Frameworks such as CUDA and BrookGPU allow for the quick
development of algorithms on GPUs.
As GPUs are a mass market product they are a comparably cheap al-
ternative to high-end multicore multi-CPU workstations. Top-of-the line
GPUs usually sell for about $500 to $700. This allows clinics to upgrade
regular workstations to reconstruction systems simply by installing a rela-
tively cheap graphics device in the system. This would allow radiologists
to reconstruct data on their workstations with different parameters and al-
gorithms somewhat detached from the regular clinical workflow. This gives
103
radiologists the chance to reevaluate interesting scans and run special pur-
pose reconstructions. With physicians having more tools at their hands this
might ultimately result in a more accurate diagnosis for the patient.
The next steps to take are to try to create a hybrid system that extends
the functional decomposition of the problem domain as it is implemented
now to a true domain space decomposition utilizing the computing power
of both CPU and GPU for projection. This was not sensible in the current
system as the CPU was too slow to contribute any major speedup to the
process but is definitely an area where the current system could be improved.
Another option is to use multiple GPUs in one single system parallelize the
problem space across those devices. This would introduce another layer of
parallelization: all GPUs could work on one reconstruction problem by for
example distributing projections across views or it would be also possible to
run multiple reconstructions at the same time.
Additionally other time consuming algorithms that are required for PET
reconstruction could be ported to the GPU. The filtering routines that are
applied to the sinograms during PSF enabled PET reconstruction for ex-
ample could be efficiently implemented using a CUDA device. It is even
possible to implement the entire OSEM algorithm on the GPU.
And also other projectors that require even more processing power than
those currently implemented could be ported to GPUs such as the time
of flight projectors. In time of flight data acquisition the location of the
positron emission is measured more accurately by considering the time dif-
ference it took the two gamma rays to travel towards the detectors. This
effectively results in one more dimension in the sinogram datasets and thus
increases the size of the input data and requires a more complex projection
algorithm which could be effectively implemented on GPUs.
Other possibilities for future developments would be a move away from
CUDA towards OpenCL. Use of OpenCL would allow the execution of al-
gorithms on both CPU and GPU platforms. Apart from this advantages it
is also an open industry standard which ensures code reuse across hardware
vendors. This would allow PET system manufacturers to choose from all
vendors and not restrict themselves to e.g. only Nvidia.
The existing code base could also be refactored to use LOR projectors for
both PB and LOR reconstruction. This is possible because PB projectors
are only a subset of LOR projectors. Doing so would greatly simplify the
104
maintainability of the code.
The introduction of modern GPUs into PET products used in the regular
clinical workflow has given developers a powerful new tool that is easy to
use thanks to frameworks such as CUDA and OpenCL. It enables them to
fulfill the demands made by physicians and clinics for high quality images
and short reconstruction times.
105
List of Abbreviations
CPU Central Processing Unit
CT Computed Tomography
CTM Close to Metal
CUDA Compute Unified Device Architecture
FDG Fluorodeoxyglucose
FLOP Floating Point Operation
FOV Field of View
GPGPU General-Purpose computation on GPUs
GPU Graphics Processing Unit
HPC High Performance Computing
LES Linear Equation System
LOR Line of Response
ML-EM Maximum-Likelihood Expectation Maximization
mrd maximum ring difference
MSE Mean Squared Error
NMSE Normalized Mean Squared Error
OSEM Ordered Subset Expectation Maximization
PB Parallel Beam
PET Positron Emission Tomography
106
PTX parallel thread execution
SIMD single instruction multiple data
SPECT Single photon emission computed tomography
SSE streaming SIMD extensions
ULP Unit of Least Precision
107
List of Figures
2.1 Decay and Annihilation exemplified by 189F . . . . . . . . . . 19
2.2 Functional diagram of cyclotron . . . . . . . . . . . . . . . . . 21
2.3 Chemical structure of Glucose compared to Fluorodeoxyglucose 22
2.4 Schematic layout of scintillation detector block . . . . . . . . 24
2.5 True and Scattered positron events . . . . . . . . . . . . . . . 24
2.6 Random and Multiple coincidences . . . . . . . . . . . . . . . 25
2.7 Uncertain and Attenuated positron events . . . . . . . . . . . 25
2.8 Difference between 2D and 3D data acquisition mode . . . . . 27
2.9 Efficient data structuring . . . . . . . . . . . . . . . . . . . . 28
2.10 The PET coordinate system . . . . . . . . . . . . . . . . . . . 29
2.11 Oblique LORs are not parallel . . . . . . . . . . . . . . . . . . 29
2.12 Comparison of LOR and PB space projection . . . . . . . . . 30
2.13 The span concept for spans 3, 5, 7 and 9 . . . . . . . . . . . . 31
2.14 Segments 1, 0 and -1 for span 9 and mrd 13 . . . . . . . . . . 32
2.15 Michelogram of a 55 ring, 38mrd, span 11 3D PET system . . 33
2.16 Comparison of analytic and iterative reconstruction algorithms 35
2.17 Projection of positron emissions at θ = 90◦ into sinogram row 37
2.18 Basic ML-EM algorithm . . . . . . . . . . . . . . . . . . . . . 40
2.19 Basic OSEM algorithm . . . . . . . . . . . . . . . . . . . . . . 41
3.1 Projector using sheared image . . . . . . . . . . . . . . . . . . 46
3.2 CPU projector efficiency analysis . . . . . . . . . . . . . . . . 48
3.3 Back-projector structogram . . . . . . . . . . . . . . . . . . . 50
3.4 Comparison of original and modified implementation . . . . . 51
3.5 Geforce 8800 architecture . . . . . . . . . . . . . . . . . . . . 53
3.6 CUDA programming model . . . . . . . . . . . . . . . . . . . 54
3.7 CUDA implementation of the forward projector . . . . . . . . 63
3.8 Unshearing of image during backprojection . . . . . . . . . . 64
108
3.9 Runtime comparison of different setups . . . . . . . . . . . . 72
3.10 Absolute error comparison of different setups . . . . . . . . . 73
3.11 Unpadded and padded memory . . . . . . . . . . . . . . . . . 73
3.12 Index reshuffling on CUDA device . . . . . . . . . . . . . . . 75
3.13 Hexeditor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.14 Sinogram Viewer Tool . . . . . . . . . . . . . . . . . . . . . . 78
3.15 Testimages . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.16 Host to device PCI Express 8x and 16x pinned and pageable 85
3.17 Transfer rate for PCI Express 8x and 16x (pinned) . . . . . . 86
3.18 All cards maximum host to device performance . . . . . . . . 87
3.19 On device transfer rates . . . . . . . . . . . . . . . . . . . . . 88
3.20 Reconstruction time overiew . . . . . . . . . . . . . . . . . . . 90
3.21 Reconstruction time detailed overiew . . . . . . . . . . . . . . 90
3.22 Reconstruction components . . . . . . . . . . . . . . . . . . . 91
3.23 Reconstruction performance transfer and execution rate . . . 92
3.24 Class diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
3.25 Multithreading . . . . . . . . . . . . . . . . . . . . . . . . . . 97
3.26 Sequence Diagram . . . . . . . . . . . . . . . . . . . . . . . . 100
109
List of Tables
2.1 Properties of positron-emitting atoms (reproduced from [5]) . 18
2.2 Properties of commonly used scintillators (reproduced from [5]) 20
3.1 CUDA Memory Overview . . . . . . . . . . . . . . . . . . . . 53
3.2 Selected Nvidia cards memory specification . . . . . . . . . . 61
3.3 Projector kernel functions . . . . . . . . . . . . . . . . . . . . 65
3.4 Setup description and speedup . . . . . . . . . . . . . . . . . 72
3.5 Reconstruction time overview . . . . . . . . . . . . . . . . . . 89
3.6 Reconstruction time overview . . . . . . . . . . . . . . . . . . 91
110
Listings
3.1 SSE memory allocation . . . . . . . . . . . . . . . . . . . . . 47
3.2 Multiply volume CPU code . . . . . . . . . . . . . . . . . . . 55
3.3 Multiply volume GPU code . . . . . . . . . . . . . . . . . . . 56
3.4 CUDA memory operations . . . . . . . . . . . . . . . . . . . . 66
3.5 Register and shared memory implementation of interpolation 67
3.6 Usage of constant memory to reduce register usage . . . . . . 68
3.7 Populating CUDA shared memory . . . . . . . . . . . . . . . 70
3.8 Wrapper function for pinned memory allocation . . . . . . . . 71
3.9 Kernel to pad and unpad 3 dimensional arrays on the GPU . 74
3.10 Testscript excerpt . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.11 rawdiff tool output . . . . . . . . . . . . . . . . . . . . . . . . 81
3.12 CUDA timing events . . . . . . . . . . . . . . . . . . . . . . . 84
3.13 CUDA error check . . . . . . . . . . . . . . . . . . . . . . . . 94
3.14 Signaling methods . . . . . . . . . . . . . . . . . . . . . . . . 99
111
Bibliography
[1] Comparative evaluation of visualization and experimental results using
image comparison metrics, Washington, DC, USA, 2002. IEEE Com-
puter Society.
[2] ATI. CTM Guide - Technical Reference Manual. Web
site: http://ati.amd.com/companyinfo/researcher/documents/
ATI_CTM_Guide.pdf, Last accessed: 05/06/2008.
[3] MS Atkins, D. Murray, and R. Harrop. Use of transputers in a 3-D
Positron Emission Tomograph. IEEE transactions on medical imaging,
10(3):276–283, 1991.
[4] B. Bai and AM Smith. Fast 3D iterative reconstruction of PET images
using PC graphics hardware. In IEEE Nuclear Science Symposium
Conference Record, 2006, volume 5, 2006.
[5] Dale L. Bailey, David W. Townsend, Peter E. Valk, and Michael N.
Maisey, editors. Positron Emission Tomography: Basic Sciences.
Springer, 1 edition, 4 2005.
[6] N. Bohr. On the constitution of atoms and molecules, Part 1, Binding of
electrons by positive nuclei. Philosophical Magazine, 26(1):1–24, 1913.
[7] I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston,
and P. Hanrahan. Brook for GPUs: stream computing on graphics
hardware. ACM Transactions on Graphics, 23(3):777–786, 2004.
[8] R. Budruk. PCI express system architecture. Addison-Wesley Profes-
sional, 2003.
112
[9] B. Cabral, N. Cam, and J. Foran. Accelerated volume rendering and
tomographic reconstruction using texture mapping hardware. In Pro-
ceedings of the 1994 symposium on Volume visualization, pages 91–98.
ACM New York, NY, USA, 1994.
[10] ME Casey and R. Nutt. A multicrystal two dimensional BGO detec-
tor system for positron emission tomography. Nuclear Science, IEEE
Transactions on, 33(1):460–463, 1986.
[11] CM Chen, S.Y. Lee, and ZH Cho. Parallelization of the EM algorithm
for 3-D PET imagereconstruction. IEEE Transactions on Medical Imag-
ing, 10(4):513–522, 1991.
[12] Arthur H. Compton. A Quantum Theory of the Scattering of X-rays
by Light Elements. Physical Review, 21(5):483–502, May 1923.
[13] AP Dempster, NM Laird, and DB Rubin. Maximum Likelihood from
Incomplete Data via the EM Algorithm. Journal of the Royal Statistical
Society. Series B (Methodological), 39(1):1–38, 1977.
[14] K.L. Giboni, E. Aprile, T. Doke, M. Hirasawa, and M. Yamamoto. Co-
incidence timing of Schottky CdTe detectors for tomographic imaging.
Nuclear Inst. and Methods in Physics Research, A, 450(2-3):307–312,
2000.
[15] Z. He, W. Li, GF Knoll, DK Wehe, J. Berry, and CM Stahle. 3-D
position sensitive CdZnTe gamma-ray spectrometers. Nuclear Instru-
ments and Methods in Physics Research-Section A Only, 422(1):173–
178, 1999.
[16] K. Herholz, E. Salmon, D. Perani, JC Baron, V. Holthoff, L. Fr
”olich, P. Sch
”onknecht, K. Ito, R. Mielke, E. Kalbe, et al. Discrimination between
Alzheimer dementia and controls by automated analysis of multicenter
FDG PET. Neuroimage, 17(1):302–316, 2002.
[17] IK Hong, ST Chung, HK Kim, YB Kim, YD Son, and ZH Cho. Ul-
tra fast symmetry and SIMD-based projection-backprojection (SSP)
algorithm for 3-D PET image reconstruction. IEEE Transactions on
Medical Imaging, 26(6):789–803, 2007.
113
[18] HM Hudson and RS Larkin. Accelerated image reconstruction using
ordered subsets ofprojection data. Medical Imaging, IEEE Transactions
on, 13(4):601–609, 1994.
[19] Wen-Mei Hwu and David Kirk. Programming Massively Parallel Pro-
cessors. Web site: http://courses.ece.uiuc.edu/ece498/al1/, Last
accessed: 05/06/2008.
[20] Intel. Intel C++ Compiler for Linux Intrinsics Reference. Intel.
[21] Intel. Server Board S5000PSL, Product Brief. Intel.
[22] Intel. Workstation Board S5000XVN, Product Brief. Intel.
[23] CA Johnson, Y. Yan, RE Carson, RL Martino, and ME Daube-
Witherspoon. A system for the 3D reconstruction of retracted-septa
PET datausing the EM algorithm. IEEE Transactions on Nuclear Sci-
ence, 42(4 Part 1):1223–1227, 1995.
[24] JP Jones, WF Jones, F. Kehren, DF Newport, JH Reed, MW Lenox,
LG Byars, K. Baker, C. Michel, ME Casey, et al. SPMD cluster-based
parallel 3D OSEM. In 2002 IEEE Nuclear Science Symposium Confer-
ence Record, volume 3, 2002.
[25] W.F. Jones, M.E. Casey, and L.G. Byars. Design of super-fast three-
dimensional projection system for Positron Emission Tomography,
June 29 1993. US Patent 5,224,037.
[26] P.M. Joseph. Improved algorithm for reprojecting rays through pixel
images. Medical Imaging, IEEE Transactions on, 1(3):192–196, 1982.
[27] DJ Kadrmas. Rotate-and-Slant Projector for Fast LOR-Based Fully-3-
D Iterative PET Reconstruction. IEEE Transactions on Medical Imag-
ing, 27(8):1071–1083, 2008.
[28] F. Kehren. Vollstndige iterative Rekonstruktion von dreidimension-
alen Positronen-Emissions-Tomogrammen unter Einsatz einer spe-
icherresidenten Systemmatrix auf Single- und Multiprozessor-Systemen.
PhD thesis, Rheinisch-Westflischen Technischen Hochschule (RWTH)
Aachen, 2001.
114
[29] T. Kimble, M. Chou, and BHT Chai. Scintillation properties of LYSO
crystals. Nuclear Science Symposium Conference Record, 3:1434–1437,
2002.
[30] M.A. Lodge, R.D. Badawi, R. Gilbert, P.E. Dibos, and B.R. Line. Com-
parison of 2-Dimensional and 3-Dimensional Acquisition for 18F-FDG
PET Oncology Studies Performed on an LSO-Based Scanner. Journal
of Nuclear Medicine, 47(1):23–31, 2006.
[31] DB Loveman. High performance fortran. IEEE [see also IEEE Con-
currency] Parallel & Distributed Technology: Systems & Applications,
1(1):25–42, 1993.
[32] Christian Maas. Freeware Hex Editor XVI32. Web site: http://www.
chmaas.handshake.de/delphi/freeware/xvi32/xvi32.htm, Last ac-
cessed: 02/24/2008.
[33] S.A. Mahlke, R.E. Hank, J.E. McCormick, D.I. August, and W.W.
Hwu. A Comparison of Full and Partial Predicated Execution Support
for ILP Processors. International Symposium on Computer Architec-
ture, 23:138–150, 1995.
[34] A. Munschi. OpenCL Specification Version 1.0,, 12 2008.
[35] Nvidia. CUDA Matrix Transpose SDK Example. Web
site: http://developer.download.nvidia.com/compute/cuda/sdk/
website/samples.html, Last accessed: 09/13/2007.
[36] Nvidia. GeForce 8 Series Overview. Web site: http://www.nvidia.
com/page/geforce8.html, Last accessed: 3/4/2009.
[37] Nvidia. GeForce Family Overview. Web site: http://www.nvidia.
com/object/geforce_family.html, Last accessed: 3/4/2009.
[38] Nvidia. Technical Brief - NVIDIA GeForce 8800 GPU Archi-
tecture Overview. Web site: http://www.nvidia.com/content/
PDF/Geforce_8800/GeForce_8800_GPU_Architecture_Technical_
Brief.pdf, Last accessed: 05/02/2008, 2007.
[39] Nvidia. NVIDIA Compute PTX: Parallel Thread Execution Version
1.1. Nvidia, 2008.
115
[40] Nvidia. NVIDIA CUDA Compute Unified Device Architecture Program-
ming Guide Version 1.1. Nvidia, 2008.
[41] Nvidia. NVIDIA CUDA Compute Unified Device Architecture Program-
ming Guide Version 2.1. Nvidia, 2009.
[42] AW Paeth. A fast algorithm for general raster rotation. In Proceed-
ings on Graphics Interface’86/Vision Interface’86 table of contents,
pages 77–81. Canadian Information Processing Society Toronto, Ont.,
Canada, Canada, 1986.
[43] Vladimir Panin and Frank Kehren. Acceleration of josephs method for
full 3d reconstruction of nuclear medical images from projection data,
2008.
[44] VY Panin, F. Kehren, H. Rothfuss, D. Hu, C. Michel, and ME Casey.
PET reconstruction with system matrix derived from point source mea-
surements. Nuclear Science, IEEE Transactions on, 53(1 Part 1):152–
159, 2006.
[45] Ervin B. Podgorsak. Radiation Physics for Medical Physicists (Biolog-
ical and Medical Physics, Biomedical Engineering). Springer, 1 edition,
10 2005.
[46] E. Rutherford. The Scattering of α and β Particles by Matter and the
Structure of the Atom. Philosophical Magazine, 21:669, 1911.
[47] H. Scherl, B. Keck, M. Kowarschik, and J. Hornegger. Fast GPU-
based CT reconstruction using the common unified device architecture
(CUDA). In IEEE Nuclear Science Symposium Conference Record,
2007. NSS’07, volume 6, 2007.
[48] LA Shepp and Y. Vardi. Maximum likelihood reconstruction for emis-
sion tomography. Medical Imaging, IEEE Transactions on, 1(2):113–
122, 1982.
[49] M. Teras, T. Tolvanen, J.J. Johansson, J.J. Williams, and J. Knuuti.
Performance of the new generation of whole-body PET/CT scanners:
Discovery STE and Discovery VCT. European Journal of Nuclear
Medicine and Molecular Imaging, 34(10):1683–1692, 2007.
116
[50] P.A. Tipler. Physics for Scientists and Engineers. Worth Publishers
New York, NY, 1991.
[51] Wladimir J. van der Laan. Cubin Utilities. Web site: http://www.cs.
rug.nl/~wladimir/decuda/, Last accessed: 03/26/2009.
[52] S. Vollmar, C. Michel, JT Treffert, DF Newport, M. Casey, C. Knoss,
K. Wienhard, X. Liu, M. Defrise, and W.D. Heiss. HeinzelCluster:
accelerated reconstruction for FORE and OSEM3D. In 2001 IEEE
Nuclear Science Symposium Conference Record, volume 3, 2001.
[53] Stefan Vollmar. VINCI: Volume Imaging in Neurological Research, Co-
Registration and ROIs included. Web site: http://www.nf.mpg.de/
vinci3/, Last accessed: 4/2/2008.
[54] Z. Wang, G. Han, T. Li, and Z. Liang. Speedup OS-EM image re-
construction by PC graphics card technologies for quantitative SPECT
with varying focal-length fan-beam collimation. IEEE transactions on
nuclear science, 52(5 Part 1):1274–1280, 2005.
[55] Wikipedia. Nvidia Tesla. Web site: http://en.wikipedia.org/wiki/
NVIDIA_Tesla, Last accessed: 3/4/2009.
117