GPU implementation of non-local maximum likelihood ... · Retinex [20]. Level-set segmentations...
Transcript of GPU implementation of non-local maximum likelihood ... · Retinex [20]. Level-set segmentations...
![Page 1: GPU implementation of non-local maximum likelihood ... · Retinex [20]. Level-set segmentations which are com-monly used in medical applications for segmenting images and volumes](https://reader036.fdocuments.us/reader036/viewer/2022071210/602237ebeef6df17fd4c75bc/html5/thumbnails/1.jpg)
SPECIAL ISSUE PAPER
GPU implementation of non-local maximum likelihood estimationmethod for denoising magnetic resonance images
Adithya H. K. Upadhya1 • Basavaraj Talawar1 • Jeny Rajan1
Received: 17 June 2015 / Accepted: 21 December 2015 / Published online: 5 January 2016
� Springer-Verlag Berlin Heidelberg 2016
Abstract Magnetic resonance imaging (MRI) is a widely
deployed medical imaging technique used for various
applications such as neuroimaging, cardiovascular imaging
and musculoskeletal imaging. However, MR images
degrade in quality due to noise. The magnitude MRI data in
the presence of noise generally follows a Rician distribu-
tion if acquired with single-coil systems. Several methods
are proposed in the literature for denoising MR images
corrupted with Rician noise. Amongst the methods pro-
posed in literature for denoising MR images corrupted with
Rician noise, the non-local maximum likelihood methods
(NLML) and its variants are popular. In spite of the per-
formance and denoising quality, NLML algorithm suffers
from a tremendous time complexity O m3N3ð Þ, where m3
and N3 represent the search window and image size,
respectively, for a 3D image. This makes the algorithm
challenging for deployment in the real-time applications
where fast and prompt results are required. A viable solu-
tion to this shortcoming would be the application of a data
parallel processing framework such as Nvidia CUDA so as
to utilize the mutually exclusive and computationally
intensive calculations to our advantage. The GPU-based
implementation of NLML-based image denoising achieves
significant speedup compared to the serial implementation.
This research paper describes the first successful attempt to
implement a GPU-accelerated version of the NLML algo-
rithm. The main focus of the research was on the paral-
lelization and acceleration of one computationally
intensive section of the algorithm so as to demonstrate the
execution time improvement through the application of
parallel processing concepts on a GPU. Our results suggest
the possibility of practical deployment of NLML and its
variants for MRI denoising.
Keywords Non-local maximum likelihood estimation
(NLML) � MRI � Rician distribution � GPGPU � Parallelprocessing � Nvidia CUDA framework
1 Introduction
Random (stochastic) noise is the major contributor towards
the degradation in quality of the MR (magnetic resonance)
images [1]. In consideration of the importance of MRI
denoising for various medical applications, several
denoising algorithms and noise models have been devel-
oped to accurately estimate the true underlying intensity of
the noisy MR images. Contemporarily, noisy data in
magnitude MR images are modelled by Rician distribution
which is a better fit for the general MR image data cor-
rupted by noise [3]. The noise in the MRI k-space data (raw
data) is generally assumed to be normally distributed. The
k-space data are then Fourier transformed to obtain the
magnetization distribution. The data distribution in the
resultant real and imaginary components of the complex
image still remains Gaussian due to the linearity and
orthogonality of the Fourier transform. However, complex
image as such is not used for any analysis. To use both part
of the complex data values, we calculate magnitude and
phase images [4]. However, since the computation of the
magnitude image is a non-linear operation (root sum of
squares of the Gaussian distributed complex image), the
data in the magnitude image will be Rician distributed (in
single-coil systems) [1, 4, 5]. Several denoising schemes
& Jeny Rajan
1 Department of Computer Science and Engineering, National
Institute of Technology Karnataka, Surathkal, India
123
J Real-Time Image Proc (2017) 13:181–192
DOI 10.1007/s11554-015-0559-6
![Page 2: GPU implementation of non-local maximum likelihood ... · Retinex [20]. Level-set segmentations which are com-monly used in medical applications for segmenting images and volumes](https://reader036.fdocuments.us/reader036/viewer/2022071210/602237ebeef6df17fd4c75bc/html5/thumbnails/2.jpg)
have been formulated for denoising MR images corrupted
by Rician noise (for e.g. [1, 2, 15, 24–30]). Among them
non-local maximum likelihood methods (NLML) [2] and
its variants [1, 2, 24, 25, 28] are very popular.
NLML method is based on non-local concepts [6] and
maximum likelihood estimation techniques (MLE) [7, 8].
NLML demonstrates a good performance at preserving
edges and fine details and is designed to exploit the
redundancies and similarities within the noise corrupted
image and thus, it is one among the best algorithms utilized
for denoising MR image. The superior performance of
NLML methods over other approaches in denoising MRI is
demonstrated in [1, 2, 24, 25]. Having asserted the
importance of good-quality MR denoising and the superior
performance of NLML algorithm, it is obligatory to elab-
orate the drawbacks and disadvantages of this algorithm.
NLML suffers from a very high time complexity [2] sim-
ilar to that of non-local means (NLM) owing to the simi-
larity measures and comparisons between neighbourhoods
determined by the search window dimensions. In order to
acquire the perfect denoising quality, the search window
dimension should be equivalent to that of the image
dimensions, i.e. similarity comparisons should be made
throughout the image for every target voxel’s neighbour-
hood for ideal results. Under these circumstances, for a 3D
MR image dataset of dimensions P� Q� R, the com-
plexity would be in the order of whopping OðN6Þ (for
complexity analysis, the dimensions of the MR image
dataset is considered to be N � N � N. In order to improve
the execution speed, usually the search window is restricted
to lower dimensions, i.e. m� m� m [2]. Lower search
window dimensions could also imply compromising the
quality of resultant denoised image. This is demonstrated in
Fig. 1. From Fig. 1, it can be observed that as search
window size increases, there is an improvement in the
PSNR (peak signal-to-noise ratio) of the denoised image.
Hence, there is a pressing necessity for a viable solution
which could improve the execution time of the NLML
algorithm without incurring the trade-off involving uti-
lization of lower search window dimensions.
A solution to the aforementioned disadvantage would be
the application of parallel processing concepts so as to
exploit the various mutually exclusive and computationally
intensive calculations. GPU computing is a practical and
effective solution to address the requirement of a powerful
and efficient hardware specialized in parallel processing.
Nvidia CUDA framework provides a powerful yet intelli-
gible developer API toolset for accomplishment of GPU-
accelerated variant of this algorithm. GPU hardware has
outperformed CPU by a great extent with respect to the
theoretical floating point operations per second (GFLOP)
and the theoretical memory bandwidth [16]. Computations
involving NLML estimation for all target noisy image
voxels (to determine the true underlying intensity of the
target voxel) are independent of each other and are mutu-
ally exclusive. This concept is the main principle and
rationale behind the attempt to accelerate the NLML
algorithm where, in a theoretical perfect GPU hardware,
P� Q� R (image dimensions) threads could be launched
to compute and generate denoised voxels for every corre-
sponding noisy image voxel in parallel.
GPU computing has contributed to several other medical
image processing algorithms such as functional magnetic
resonance imaging (fMRI), diffusion tensor imaging (DTI),
image denoising and image registration [9]. CUDA-en-
abled GPU acceleration has also found applications in the
Fig. 1 a Simulated noisy image, b PSNR plot after denoising with NLML method with different search window sizes
182 J Real-Time Image Proc (2017) 13:181–192
123
![Page 3: GPU implementation of non-local maximum likelihood ... · Retinex [20]. Level-set segmentations which are com-monly used in medical applications for segmenting images and volumes](https://reader036.fdocuments.us/reader036/viewer/2022071210/602237ebeef6df17fd4c75bc/html5/thumbnails/3.jpg)
fast and efficient non-local means denoising of 3D ultra-
sound images [10]. GPU-accelerated lattice Boltzmann
model is used to solve partial differential equations for
anisotropic diffusion which is widely adopted for image
denoising [11]. In the field of MRI, GPGPU has made
several strides and contributions as well. Reconstruction
after data acquisition in the frequency domain on a
Cartesian grid through standard scanning protocols could
be performed by GPU-accelerated inverse Fourier trans-
formations [12]. In MRI, GPUs have been utilized to
increase the speed of geometric distortion corrections in
EPI (echo planar imaging) and it has also been used for
high-speed design of multidimensional radio frequency
pulses so as to improve high-field imaging [13, 14].
CUDA-based parallel algorithms have also found applica-
tions in acceleration of image restoration methods such as
Retinex [20]. Level-set segmentations which are com-
monly used in medical applications for segmenting images
and volumes have also been accelerated by GPGPU-based
methods [21].
This paper is organized as follows. Section 2 provides
an elaborate description of NLML concepts and MRI data
distribution. Section 3 explains the proposed work and
complexity analysis. Experimental results and observations
are discussed in Sect. 4. Finally, conclusion and remarks
are drawn in Sect. 5.
2 Data distribution in MRI and signal estimationusing NLML
The k-space complex raw MR image data is distinguished
and characterized by the Gaussian PDF (probability density
function). Owing to the orthogonality and linearity of the
Fourier transform, the noise distribution of the real and
imaginary constituents can still be modelled as Gaussian
after the application of inverse Fourier transformation.
However, the magnitude of the complex MR image data
will no longer follow Gaussian distribution, but rather
follow Rician distribution due to the sum of square (SoS)
operation.
Let rg represent the standard deviation of the complex
MR image data corrupted with stationary noise with the
mean values lR and lI where R and I represents the real
and imaginary components. Consequently, the recon-
structed magnitude data M will follow a Rician distribution
with PDF given by [18]:
pM MjA; rg� �
¼ M
r2ge�M2þA2
2r2g I0AM
r2g
!
e Mð Þ; ð1Þ
where M ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiR2 þ I2
p, A ¼
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffil2R þ l2I
p, I0(.) represents the
0th order modified Bessel function of the first kind and e(.)
represents the Heaviside step function. The signal-to-noise
ratio (SNR), i.e. the ratio A/rg, determines the shape of the
Rician PDF.
The unknown variables in the PDF can be estimated by
maximizing the corresponding likelihood function, if the
observed data and the model of interest are available. Let
M1, M2 … Mn be n statistically independent observations
within a region of constant signal intensity A. Then, the
joint PDF of the observations is:
pM fMigjA; rg� �
¼Yn
i¼1
Mi
r2ge�
M2iþA2
2r2g I0AMi
r2g
!
: ð2Þ
If r2g is known in advance, the only unknown parameter
in Eq. (2) is the true underlying intensity A, which can be
estimated by maximizing the likelihood function L or
equivalently ln L, with respect to A. However, if r2g is
unknown, it can also be estimated along with A using the
Eq. (3) [8]:
AML; r2ML
� �¼ arg maxA;r2gðln LÞ
n o; ð3Þ
where
ln L ¼Xn
i¼1
lnMi
r2g
!
�Xn
i¼1
M2i þ A2
2r2g
þXn
i¼1
ln I0AMi
r2g
!
ð4Þ
and AML and r2ML represent the estimated underlying true
intensity and the noise variance, respectively.
Non-local concepts employ the redundancies and repeti-
tions within an image to its advantage. Non-local approach
has garnered an intense research and publications from the
image processing community. NLML estimation method
drew inspiration from non-local means (NLM) which is a
proficient non-local concept-based denoising algorithm.
Unlike NLM, for MRI denoising, NLML strives to estimate
the true underlying intensity through Eq. (3) based on the
Rician noise nature instead of estimating noise-free signal
through computation of weighted average of non-local vox-
els. NLML selects the samples required for denoising through
non-local concepts where NL voxels get selected based on
the intensity similarity of the neighbourhood of the voxels. If
the neighbourhoods of two voxels are similar, then their
central voxels should have a similar meaning and thus similar
grey values [19]. The similarity of the voxel neighbourhoods
can be computed by computing the intensity distance (Eu-
clidian distance) between the neighbourhoods [2]:
dP;Q ¼ kNP � NQk; ð5Þ
where dP;Q denotes the intensity distance or Euclidean
distance between the neighbourhoods NP and NQ of the
J Real-Time Image Proc (2017) 13:181–192 183
123
![Page 4: GPU implementation of non-local maximum likelihood ... · Retinex [20]. Level-set segmentations which are com-monly used in medical applications for segmenting images and volumes](https://reader036.fdocuments.us/reader036/viewer/2022071210/602237ebeef6df17fd4c75bc/html5/thumbnails/4.jpg)
voxels P and Q. For each voxel P in the image, the
intensity distance dP;Q based on a similarity window,
between P and all other non-local voxels Q within the
search window is computed using Eq. (5). After sorting the
non-local voxels in the increasing order of the intensity
distance d, the k closest non-local neighbours of PffPg, isgenerated and from this set, maximum likelihood estima-
tion (MLE) using Eq. (3) is performed which gives the true
underlying intensity value of noisy voxel P. The afore-
mentioned procedure is performed on every noisy voxel in
the image to generate the denoised image.
3 Proposed method
The steps and rationale in the parallelization of the NLML
algorithm are described in this section. Sections 3.1 and 3.2
describe the serial implementation and the complexity
analysis of the NLML algorithm. Sections 3.3 and 3.4
describe the GPU implementation of the NLML algorithm.
The GPU implementation of the NLML algorithm is
referred to as ‘‘GPU NLML’’ henceforth.
3.1 Algorithm modifications and implementation
details
For the GPU implementation, the serial NLML algorithm
has been bifurcated into two phases, namely A1 and A2.
Phase A2 estimates the true underlying intensity AX for
every corresponding noisy image voxel X in the 3D MR
image. This estimation requires computation of local
maximum of the multivariate function or local minimum of
the negative multivariate function defined by Eq. (4). GPU
NLML is the parallel implementation of phase A1. The
parallelization of the unconstrained optimization function
(phase A2) and integration has been undertaken as future
work. Details of phase A1 and of the GPU implementation
are as follows.
Phase A1 of the NLML algorithm deals with the gen-
eration of vector kX comprising the sets fXIgenerated for
every noisy 3D MR image voxel XI . Here, fXIrepresents
the set of k closest non-local neighbours of a given voxel XI
where I ranges from 1; 2; 3; . . .;P� Q� R (3D MR image
of dimensions P� Q� R). Hence, the vector kX is an
aggregation of all the sets fX1; fX2
; fX3; fX4
; . . .; fXP�Q�R. As a
result, the size of the kX vector would be k � ðP� Q� RÞwhich represents a set containing a collection of the ‘k’
closest non-local neighbours for every voxel in the noisy
3D MR image.
kX ¼ fX1; fX2
; fX3; fX4
; . . .; fXP�Q�R
n oð6Þ
Phase A2 of the NLML algorithm deals with the com-
putation of the NLML estimate of every noisy image voxel
so as to determine their true underlying intensity. This is
accomplished through retrieval of the set fXIfrom the
vector kX for every corresponding noisy image voxel XI ,
i.e. the retrieval of the set of k closest non-local neighbours
for every noisy image voxel XI from the vector kX . Fol-lowing this, the set fXI
is used to compute the true under-
lying intensity AX or the denoised voxel for every noisy
image voxel XI through Eq. (3). This is accomplished
through application of function maximization (computing
local maximum) of the multivariate equation or the func-
tion minimization (computing local minimum) of negative
of the multivariate equation defined by Eq. (4).
The pseudo-code for the serial version of the algorithm
(Algorithm 1) is given below. In the serial implementation,
phase A1 of the algorithm was developed in C?? and
compiled into MEX (Matlab executable) which was
invoked by the executing Matlab program. The output of
phase A1, the vector kX , is input to phase A2 by the exe-
cuting Matlab program. Phase A2 is implemented through
Matlab’s ‘fminunc’ function which computes the local
minimum of the negative of the equation defined by
Eq. (4). Matlab’s optimization toolbox provides several
functions and procedures for finding the parameters that
maximize or minimize the objective functions while sat-
isfying constraints.
3.2 Serial NLML: complexity analysis
The NLML algorithm demonstrates a very high computa-
tional complexity similar to that of NL means algorithm due
to the cost of similarity comparison between the neigh-
bourhoods [2]. In order to reduce the execution time, the
search window dimensions are restricted to a smaller value
of m� m� m and the maximum threshold for the search
window would be N � N � N m�Nð Þ (image dimensions
P� Q� R are equal to N � N � N for complexity analy-
sis). In phase A1, for each voxel of the image (total of N3
voxels), calculations with every other voxel within the
search window (m3 voxels) will be performed to generate
the corresponding set of closest non-local neighbours. This
results in a time complexity of O m3N3ð Þ for phase A1.
In phase A2, for every voxel XI in the image, a corre-
sponding denoised voxel is generated using the set fXI
retrieved from the vector kX . This set fXIwill be used for
maximization of likelihood function L (or minimization of
negative L) as per Eq. (3). The overall complexity of the
phase A2 would be in the order of O N3ð Þ since NLML
184 J Real-Time Image Proc (2017) 13:181–192
123
![Page 5: GPU implementation of non-local maximum likelihood ... · Retinex [20]. Level-set segmentations which are com-monly used in medical applications for segmenting images and volumes](https://reader036.fdocuments.us/reader036/viewer/2022071210/602237ebeef6df17fd4c75bc/html5/thumbnails/5.jpg)
estimation computations are carried out for each voxel in
the noisy image. Consequently, the total time complexity
with both phase A1 and A2 combined would be
O m3N3 þ N3ð Þ which would result in a complexity of
O m3N3ð Þ.
3.3 GPU implementation of the algorithm
As discussed in Sect. 3.1, the sequential algorithm was
bifurcated into two phases A1 and A2. In the GPU NLML,
the phase A1 was developed using CUDA C?? and
embedded into Matlab through CUDA MEX (CUDA
Matlab Executable). Phase A1 generates the vector kX of
size P� Q� R� k (image dimensions are P� Q� R).
This vector contains the set of k closest non-local neigh-
bours of every noisy image voxel. The vector kX will be
utilized by the second stage of the algorithm, i.e. phase A2
which undergoes execution in the CPU for the generation
of denoised image through computation of NLML for
every noisy image voxel.
Figure 2 illustrates the stages involved in the generation
of the denoised image through the application of GPU-
accelerated NLML estimation algorithm. The modified
algorithm discussed in Sect. 3.1 and pseudo-code presented
in Algorithm 1 builds the foundation for the development
of the GPU NLML estimation algorithm.
J Real-Time Image Proc (2017) 13:181–192 185
123
![Page 6: GPU implementation of non-local maximum likelihood ... · Retinex [20]. Level-set segmentations which are com-monly used in medical applications for segmenting images and volumes](https://reader036.fdocuments.us/reader036/viewer/2022071210/602237ebeef6df17fd4c75bc/html5/thumbnails/6.jpg)
3.3.1 Parallelization of phase A1
Parallelization strategies can be employed to improve the
performance, owing to the inherent data level parallelism
in NLM [15] or NLML. The potential for performance
improvement and parallelization is exhibited in the phase
A1 of the algorithm where the generation of the vector kX ,consists of data-independent calculations for every noisy
image voxel XI . The GPU is utilized to perform heavy
computations of the phase A1 of the algorithm. The results
calculated by the GPU are then returned to the main pro-
gram executing in the Matlab environment for further
processing through the A2 phase. Steps involved in the
phase A1 of the of the GPU implementation of the NLML
estimation algorithm, as illustrated by Fig. 2 are described
below.
1. The noisy image data is loaded into the main memory
(RAM). Next, memory is allocated on the GPU device
(using cudaMalloc() function) for storing image data
and for storing vector kX . Following this, the image
data are copied from the main memory to GPU
memory (using cudaMemcpy() function).
2. GPU computes and generates the vector kX from the
noisy image data through application of parallel
algorithm discussed in Algorithm 2.
3. The computed kX vector is transferred from the GPU
memory to the main memory (using cudaMemcpy()
function). Subsequently, the runtime memory allocated
on the GPU is deallocated to prevent memory leaks
(using cudaFree() function).
The GPU NLML implementation of phase A1 is pre-
sented in Algorithm 2. GPU NLML is developed using the
NVIDIA CUDA C?? framework. A single CUDA thread
computes the set fX and appends it to the vector kX . Asingle thread is launched for each noisy image voxel of the
3D MR image (N3 threads are created). In each thread, the
Euclidean distance measures are computed between the
neighbourhood of target voxel and the neighbourhood of
every voxel within the search window dimensions of the
target voxel. The Euclidean distances are sorted and stored
in the set fX of fixed size ‘k’ using a variant of insertion
sort. For this implementation, the value of k, i.e. the
number of closest non-local neighbours (size of the set fXI)
is selected as 25 based on the original algorithm described
in He et al. [2]. Finally, the set fX is appended to the vector
kX . After all the threads complete execution, a complete
vector kX is obtained which contains the aggregation of all
the sets fX for every noisy image voxel, i.e. kX contains the
set of closest non-local neighbours for every noisy image
voxel.
In the CUDA GPU NLML implementation, the N3
threads created are organized as grid blocks and thread
blocks. The dimensions of the grid block along the x, y and
z axes are N/8, N/8 and N/4, respectively. The dimensions
of every thread block were 8, 8 and 4 threads along the x, y
and z axes, respectively. Therefore, the total number of
threads per block = 256 (8� 8� 4). This specific thread
block size was chosen as the Nvidia Quadro K2000 GPU
used for the experiments demonstrates optimal perfor-
mance for a thread block of size 256 threads. The optimal
(III)
(II)
MAIN MEMORY TRANSFER OF IMAGE DATA TO GPU GPU
MEMORY
GPU
PHASE A1: GENERATION OF VECTOR BY GPU (PARALLEL
ALGORITHM)
GENERATION OF DENOISED IMAGECPU
TRANSFER OF COMPUTED VECTOR TO MAIN MEMORY
PHASE A2: COMPUTATION OF NLML ESTIMATE FROM VECTOR
IN CPU (SERIAL ALGORITHM)
MAIN MEMORY
(IV)
Fig. 2 Figure depicting the
hybrid CPU–GPU model and
the two algorithm phases A1
and A2
186 J Real-Time Image Proc (2017) 13:181–192
123
![Page 7: GPU implementation of non-local maximum likelihood ... · Retinex [20]. Level-set segmentations which are com-monly used in medical applications for segmenting images and volumes](https://reader036.fdocuments.us/reader036/viewer/2022071210/602237ebeef6df17fd4c75bc/html5/thumbnails/7.jpg)
number of threads in a thread block depends on the specific
GPU used.
Following the GPU execution phase, the true underlying
intensity is computed and the denoised voxel is generated
through retrieval of the set fX from the vector kX for each
corresponding noisy image voxel in phase A2. Phase A2
has not been implemented in GPU and is executed in
Matlab using the ‘‘fminunc’’ function, similar to serial
NLML. The usage of the GPU implementation of phase A1
and the CPU implementation for phase A2 follows the
hybrid CPU–GPU computation model where the GPU runs
one of the computationally intensive sections of the pro-
gram, whereas the CPU runs a serial optimization section
of the program [9].
3.4 GPU implementation complexity analysis
Phase A1 underwent parallelization (through GPU imple-
mentation) and phase A2 remained intact and unchanged. In
phase A1, a total of N3 (image dimensions are P� Q� R,
this is considered as N3 for complexity analysis) threads
were launched to perform computations. For this complexity
analysis, we consider a perfect parallel machine or a perfect
GPU capable of spawning N3 threads and performing N3
parallel thread executions in which, each thread performs m3
operations (the search window size is m� m� m). The
computational complexity of a single thread is O m3ð Þ. In an
J Real-Time Image Proc (2017) 13:181–192 187
123
![Page 8: GPU implementation of non-local maximum likelihood ... · Retinex [20]. Level-set segmentations which are com-monly used in medical applications for segmenting images and volumes](https://reader036.fdocuments.us/reader036/viewer/2022071210/602237ebeef6df17fd4c75bc/html5/thumbnails/8.jpg)
ideal parallel machine, the computational complexity of the
GPU NLML algorithm would be O m3ð Þ.In practice, the execution time depends on the number of
threads executing in parallel in the GPU. The maximum
number of concurrent threads that can run in the GPU
hardware at a time is the product of the number of GPU
streaming multiprocessors (SMs), number of resident
warps per SM and the number of threads per warp which is
determined by the GPU architecture. The computation time
on the GPU is the time taken to execute all the thread
blocks. The overall execution time includes the memory
transfer latencies, grid block and thread block management
latencies, and the computation time of individual threads.
The second phase A2 remains unmodified, and hence the
time complexity remains unaltered as O N3ð Þ. Conse-
quently, the time complexity of the overall algorithm
would be the summation of the time complexity of A1 and
A2 phases which is in the order of Oðm3 þ N3Þ. The searchwindow dimensions cannot exceed the image dimensions
and the value of m will always be less than or equal to
N m�Nð Þ. Therefore, the resultant time complexity for a
perfectly parallel machine will be O N3ð Þ.
3.5 Optimization strategies
One of the noticeable costs in the GPGPU environment is the
communication overhead and latency by the allocation of
memory on GPU, memory operations consisting of transfer of
data from main memory to the GPU memory and transfer of
results from the GPU memory to the main memory [22]. This
drawback is comprehensively discussed in the Sect. 4.3. To
reducetheamountofdatatransferbetweentheCPUandtheGPU
memories, we use single precision floating point values instead
of double precision values (saving 50 % of memory space and
communication bandwidth). Though thismight result in loss of
accuracy, single precision is adequate to represent theMR gray
scale data without any perceptible loss of accuracy.
The original algorithm [2] computes the Euclidean dis-
tance measure between the neighbourhood of the target
voxel and the neighbourhood of every other voxel within
the search window. The results obtained were stored in
another large vector from which k closest non-local
neighbours were extracted after sorting. Owing to the lim-
itations of GPU memory, the aforementioned process is
inefficient because a large vector of size m3 would be
required for every noisy image voxel in order to store the
distance measure between the neighbourhood of target
voxel and the neighbour of every other voxel within the
search window. Alternatively, a vector of size k elements
could be utilized along with a derivative of Insertion sort, as
depicted in the GPU variant of the algorithm so as to sort the
computed Euclidean distance measures in ascending order.
This approach offers a twofold advantage. Firstly, the
memory requirements of the GPU would be reduced since a
vector of size k is required for every voxel in the noisy
image instead of a vector of size m3. Secondly, Insertion
sort is efficient for sorting and storing the incoming com-
puted values during the runtime [23].
4 Experiments and results
The experiments pertaining to the conventional NLML
algorithm and the GPU-accelerated NLML algorithm were
performed on a simulated brain database (SBD) acquired
from BrainWeb [17]. The parameters of the 3D MR image
include T1 modality, 3 mm slice thickness, 0 % noise and
20 % intensity non-uniformity (INU). A Rician noise of
standard deviation (rg) of 15 was added to the ground truth
and all denoising experiments were conducted on the
resultant noisy image. The dimension of the 3D MR image
used for the experiments was 181 9 217 9 181.
The experiments were performed in a system equipped
with Intel Xeon E5-2650 (2.6 GHz clock speed) Octa core
processor, 64 GB RAM and Nvidia Quadro K2000
graphics card. The software specifications include Matlab
2014b, Nvidia CUDA 5.5 and Microsoft visual studio
2010. GPU and CPU execution time for generating kXvector and for denoised image generation was evaluated for
performance comparison. A detailed hardware specifica-
tion of the CPU and the GPU is given below in Table 1.
The execution time for the completion of phases A1 and
A2 for both the CPU and GPU variants of the algorithm
were performed using the Matlab TIC and TOC routine to
determine the execution time. The execution times mea-
sured for the GPU variant were inclusive of all the latency
incurred due to the memory operations in GPU device. The
denoising qualities of the resultant images generated
through NLML estimation algorithm were evaluated
through peak signal-to-noise ratio (PSNR) image quality
metric. PSNR is the ratio of maximum signal power to that
power of corrupting noise. Mathematically, PSNR is
defined as follows:
PSNR ¼ 10� log10 MAX2=MSE� �
ð7Þ
MSE ¼ 1
N�XN
i¼1
ðYi � YiÞ: ð8Þ
In Eq. (7), MAX denotes the maximum value that can be
attained by a voxel in the 3D MR image. MSE stands for
mean squared error which is the average of sum of squares of
difference between the Ground truth Yi (voxel of the ground
truth image, i.e. noise-free source image) and the estimated
voxel intensity Yi (voxel of a denoised image generated
through NLML estimation algorithm).
188 J Real-Time Image Proc (2017) 13:181–192
123
![Page 9: GPU implementation of non-local maximum likelihood ... · Retinex [20]. Level-set segmentations which are com-monly used in medical applications for segmenting images and volumes](https://reader036.fdocuments.us/reader036/viewer/2022071210/602237ebeef6df17fd4c75bc/html5/thumbnails/9.jpg)
Figure 3 demonstrates the results of the denoising
experiments after the noisy MR image has been subjected
to the CPU and the GPU variant NLML denoising. A
search window of dimensions 10� 10� 10 and a local
(similarity) window of dimensions 3� 3� 3 were used for
denoising the 3D MR image. A particular slice from the
CPU-generated denoised image and the GPU-generated
denoised image is extracted for comparison and qualitative
analysis. The CPU and GPU variants of the algorithm yield
identical results and exhibit effective denoising with
improvement in visual quality and PSNR.
4.1 CPU vs GPU performance assessment
(generation of kX vector)
The execution time measurements for the two phases A1
and A2 for both CPU an GPU implementations of NLML
were performed using the TIC and TOC Matlab routines
This section deals with the execution time assessment for
the generation of kX vector by conventional NLML algo-
rithm and its corresponding GPU implementation, i.e. the
phase A1 of the algorithm. Throughout this experiment, the
similarity window dimensions were maintained as
3� 3� 3, whereas the search window dimensions were
varied from 10� 10� 10 to 50� 50� 50. The execution
time for the completion of the first stage of the algorithm
for these different search window dimensions were
observed for the conventional NLML algorithm and its
CUDA GPU counterpart.
Table 2 records the execution times and speedup by GPU
NLML over the serial NLML implementation. The simi-
larity window dimensions were 3� 3� 3 voxels and the
search window dimensions were varied from 10� 10� 10
to 50� 50� 50 voxels. As analysed in Sect. 3.2, the serial
execution grows as O m3N3ð Þ. From Table 2, the serial
implementation execution time ranges from 474.65 to
Table 1 CPU and GPU
hardware specificationCPU Intel Xeon E5-2650 v2 GPU Nvidia Quadro K2000
Clock speed 2.6 GHz Clock speed 950 MHz
Cache memory 2 MB (L2), 20 MB (L3) GPU memory 2 GB
Number of cores 8 Number of cores 384 CUDA cores
Maximum memory bandwidth 59.7 GB/s Memory bandwidth 64 GB/s
(a) (b)
(d)(c)
Fig. 3 A slice of the 3-D MR
image. a Ground truth, b image
corrupted with Rician noise of
standard deviation 15, c CPU-
generated denoised image
(PSNR = 29.98), d GPU-
generated denoised image
(PSNR = 29.98)
J Real-Time Image Proc (2017) 13:181–192 189
123
![Page 10: GPU implementation of non-local maximum likelihood ... · Retinex [20]. Level-set segmentations which are com-monly used in medical applications for segmenting images and volumes](https://reader036.fdocuments.us/reader036/viewer/2022071210/602237ebeef6df17fd4c75bc/html5/thumbnails/10.jpg)
45,460.93 s for the same 3D MR image. The 969 increase
in execution times tracks the 125x increase in the search
window ranges (10� 10� 10 to 50� 50� 50 voxels). On
the other hand, the improvement in the GPU NLML exe-
cution time is a result of parallel execution of threads in the
GPU. However, the 789 increase in execution times in the
GPU NLML (92.72–7288.72 s) is proportional to the
complexity of a single thread execution. The number of
tasks in a single thread are in the order of m3 (size of the
search window). The steady increase in speedup (from 5.12
to 6.24) with the increase in window sizes can be attributed
to the lesser rate of increase in execution times (789) of the
GPU corresponding to the serial version (969).
In Fig. 4, the execution time for the completion of the
first phase A1 for the CPU variant as well as the GPU
NLML of the algorithm is plotted against the increasing
search window dimensions.
4.2 CPU vs GPU performance assessment (denoised
image generation)
The modified algorithm consisted of two phases, i.e. A1
and A2. One phase offered an opportunity for paralleliza-
tion (A1), whereas the other phase was executed serially in
the CPU (A2). The total execution time for the denoised
image generation is the sum of the execution times of both
A1 and A2 phases.
The second phase (A2) of the algorithm deals with the
generation of the denoised voxels for every noisy image
voxel through computation of non-local estimate as per
Eq. (3). As mentioned in the complexity analysis in Sect.
3.2 and Sect. 3.4, the time complexity of the second phase
A2 is dependant only on the MR image dimensions
(O N3ð Þ). The average time required for generation of the
non-local estimate for every noisy image voxel is 38,000 s
Table 2 CPU vs GPU NLML execution time comparison for generation of kx vector (phase A1)
Local window dimensions Search window dimensions Execution time (CPU variant) (s) Execution time (GPU variant) (s) Speedup
3 9 3 9 3 10 9 10 9 10 474.65 92.72 5.12
3 9 3 9 3 20 9 20 9 20 3315.40 571.14 5.80
3 9 3 9 3 30 9 30 9 30 10,659.22 1771.54 6.02
3 9 3 9 3 40 9 40 9 40 24,218.01 3949.17 6.13
3 9 3 9 3 50 9 50 9 50 45,460.93 7288.72 6.24
Fig. 4 CPU vs GPU NLML execution time for the generation of kxvector (phase A1)
Table 3 Speedup of GPU NLML over the CPU version for denoised image generation
Local window dimensions Search window dimensions Execution time (CPU variant) (s) Execution time (GPU variant) (s) Speedup
3 9 3 9 3 10 9 10 9 10 38,474.65 38,092.72 1.01
3 9 3 9 3 20 9 20 9 20 41,315.4 38,571.14 1.07
3 9 3 9 3 30 9 30 9 30 48,659.22 39,771.54 1.22
3 9 3 9 3 40 9 40 9 40 62,218.01 41,949.17 1.5
3 9 3 9 3 50 9 50 9 50 83,460.93 45,288.72 1.84
Fig. 5 CPU vs GPU NLML execution time for the denoised image
generation (phase A1 and A2)
190 J Real-Time Image Proc (2017) 13:181–192
123
![Page 11: GPU implementation of non-local maximum likelihood ... · Retinex [20]. Level-set segmentations which are com-monly used in medical applications for segmenting images and volumes](https://reader036.fdocuments.us/reader036/viewer/2022071210/602237ebeef6df17fd4c75bc/html5/thumbnails/11.jpg)
for an image of dimensions 181� 217� 181 when the
value of the parameter k is chosen as 25. Hence, the
addition of this result to the time required for the genera-
tion of kX vector is the total time required for the genera-
tion of the denoised image (refer Table 3).
On an average, the time required for the completion of
the second phase A2 of the algorithm is 38,000 s. The GPU
NLML execution times range from 92.72 to 7288.72 s when
search window dimensions were varied from 10� 10� 10
to 50� 50� 50 voxels. Phase A2 execution time dominates
the total NLML execution time and offsets the speedup
gained by the phase A1 GPU NLML implementation. For
the same search window ranges, overall speedups range
from 1.01 to 1.84. This reduction in speedup motivates us to
implement the multivariate minimization algorithm in phase
A2 in GPU as future work. For the serial implementation,
the execution time increase by 2.16 times (from 38,474.65 to
83,460.93 s) when the search window dimensions were
increased from 10 9 10 9 10 to 50 9 50 9 50. Owing to
the GPU implementation of phase A1, the increase in overall
execution time for the GPU NLML is 1.18 for the same
search window size. This trend is evident in Fig. 5, which
illustrates the total denoised image generation time for the
serial NLML and the GPU NLML plotted against the
increasing search window dimensions.
4.3 Memory transfers between the RAM
and the GPU memory: latency analysis
Table 4 records the average latencies incurred for the GPU
memory operations during the program execution. These
high-precision latency time measurements were carried out
through utilization of CUDA events. These events were
used to record the precise time before the operation took
place and to record the precise time as soon as the opera-
tion completed. After the synchronization of these events,
the accurate elapsed time was stored in a variable using
cudaEventElapsed() routine.
The execution time measured for the algorithm execu-
tion in the Sects. 4.1 and 4.2 are inclusive of the latency
caused due to GPU memory operations. The Quadro2000
GPU connects to the host’s main memory through the PCIe
v2.x interface with peak data transfers up to 8 GB/s. For
the small image size in our experiment (181� 217� 181)
the memory transfer latency is not a bottleneck. For our
experiments, the memory transfer and allocation latencies
do not present a significant bottleneck.
5 Conclusion and future work
The paper describes the implementation and analysis of the
GPU NLML implementation. The serial NLML is divided
into two phases—finding similar pixels for ML estimation and
the estimation of the true underlying intensity for every noisy
image voxel in the 3D MR image. The paper presents the
implementation of the Phase A1 of the NLML algorithm. We
observe that the GPU NLML achieves a speedup of 5.12–6.24
for search window sizes from 10� 10� 10 to 50� 50� 50
voxels. We also show that the PSNR of the resultant denoised
image increases as the search window dimensions keep
increasing. This phenomenon is observed since the probability
of discovering similar neighbourhoods increase with the
increasing search window dimensions. In an ideal case sce-
nario, the search window dimensions should be equivalent to
that of the image size, i.e. the entire image needs to be
explored for finding similar pixels for high-quality denoising
results. However, owing to the tremendous execution time and
computational requirements, this approach can be interpreted
as unfeasible. For practical purposes, there is a threshold
placed on the search window dimensions of NLML algorithm.
However, the GPU NLML has demonstrated that the real-
time implementation of the algorithm could be made prac-
tical even for higher search window dimensions through
application of high performance GPUs. There is a further
scope for improvement in the speedups of the GPU NLML
by implementing the multivariate minimization algorithm on
the GPU. This work has been undertaken as an extension to
the current work. Future work could consist of paralleliza-
tion of the entire algorithm and further research in this
direction could result in deployment of such denoising
algorithms in the real-time MRI systems. Similar work could
also be extended other prominent denoising algorithms
where mutually exclusive, heavy and computationally
intensive calculations could be exploited through utilization
of various leading parallel processing frameworks on current
day multiprocessors, coprocessors and GPUs.
References
1. Rajan, J., Veraart, J., Audekerke, J.V., Verhoye, M., Sijbers, J.:
Nonlocal Maximum likelihood estimation method for denoising
Table 4 Average latency for different GPU memory operations
GPU memory operations Average execution
time (latency)
GPU memory allocation for image data 1.52 ls
GPU memory allocation for the kX vector 1.664 ls
Transfer of image data from main memory
(RAM) to GPU memory
9.69 ms
Transfer of generated kX vector from the
GPU memory to the main memory
361.843 ms
J Real-Time Image Proc (2017) 13:181–192 191
123
![Page 12: GPU implementation of non-local maximum likelihood ... · Retinex [20]. Level-set segmentations which are com-monly used in medical applications for segmenting images and volumes](https://reader036.fdocuments.us/reader036/viewer/2022071210/602237ebeef6df17fd4c75bc/html5/thumbnails/12.jpg)
multiple-coil magnetic resonance images. Magn. Reson. Imaging
30, 1512–1518 (2012)
2. He, L., Greenshields, I.R.: A nonlocal maximum likelihood
estimation method for Rician noise reduction in MR Images.
IEEE Trans. Med. Imaging 28, 165–172 (2009)
3. Dietrich, O., Raya, J.G., Reeder, S.B., Ingrisch, M., Reiser, M.F.,
Schoenberg, S.O.: Influence of multichannel combination, par-
allel imaging and other reconstruction techniques on MRI noise
characteristics. Magn. Reson. Imaging 26, 754–762 (2008)
4. Rajan, J.: Estimation and removal of noise from single and
multiple coil magnetic resonance images. Ph.D. Thesis, Univer-
sity of Antwerp, Belgium (2012)
5. Aja-Fernandez, S., Tristan, A., Alberola-Lopez, C.: Noise esti-
mation in single and multiple coil magnetic resonance data based
on statistical models. Magn. Reson. Imaging 27, 1397–1409
(2009)
6. Buades, A., Coll, B., Morel, J.M.: A review of image denoising
algorithms, with a new one. Multiscale Model. Simul. 4, 490–530(2005)
7. Sijbers, J., Dekker, A.J., Scheunders, P., Dyck, D.: Maximum
likelihood estimation of Rician distribution parameters. IEEE
Trans. Image Process. 17, 357–361 (1998)
8. Sijbers, J., Dekker, A.J.: Maximum likelihood estimation of
signal amplitude and noise variance from MR data. Magn. Reson.
Med. 51, 586–594 (2004)
9. Eklund, A., Dufort, P., Forsberg, D., LaConte, S.M.: Medical
image processing on the GPU—past, present and future. Med.
Image Anal. 17, 1073–1094 (2013)
10. Li, L., Hou, W., Zhang, X., Ding, M.: GPU-based block-wise
nonlocal means denoising for 3D ultrasound images. Comput.
Math. Methods Med. 2013, 1–10 (2013) (article ID 921303)11. Zhao, Y.: Lattice Boltzmann based PDE solver on the GPU. Vis.
Comput. 24, 323–333 (2008)
12. Sumanaweera, T., Liu, D.: Medical image reconstruction with the
FFT. In: Pharr, M. (ed.) GPU Gems 2, pp. 765–784. Addison
Wesley (2005)
13. Knoll, F., Unger, M., Diwoky, C., Clason, C., Pock, T., Stoll-
berger, R.: Fast reduction of undersampling artifacts in radial MR
angiography with 3D total variation on graphics hardware. Magn.
Reson. Mater. Phys. Biol. Med. 23, 103–114 (2010)
14. Deng, W., Yang, C., Stenger, V.A.: Accelerated multidimen-
sional radiofrequency pulse design for parallel transmission using
concurrent computation on multiple graphics processing units.
Magn. Reson. Med. 65, 363–369 (2011)
15. Coupe, P., Yger, P., Barillot, C.: Fast non local means denoising
for 3D MR images. Med Image Comput Comput-Assist Interv. 2,33–40 (2006)
16. ‘‘CUDA C Programming Guide’’. Nvidia Corporation, 2015.
Nvidia toolkit documentation. http://docs.nvidia.com/cuda/cuda-
c-programming-guide/#axzz3ac754Y00 (2015). Accessed 9 Jan
2015
17. Cocosco, C.A., Kollokian, V., Kwan, R.S., Evans, A.C.: Brain-
web: Online interface to a 3D MRI simulated brain database.
NeuroImage 5, S425. http://www.bic.mni.mcgill.ca/brainweb/
(1997). Accessed 3 Nov 2014
18. Rice, S.O.: Mathematical analysis of random noise. Bell Syst.
Tech. 23, 282–332 (1944)
19. Zimmer, S., Didas, S., Weickert, J.: A rotationally invariant block
matching strategy improving image denoising with non-local
means. In: International Workshop on Local and Non-Local
Approximation in Image Processing, pp. 135–142 (2008)
20. Wang, Y.K., Huang, W.B.: A CUDA-enabled parallel algorithm
for accelerating retinex. J. Real-Time Image Process. 9, 407–425(2014)
21. Rodrıguez, J.L., Heras, D.B., Arguello, F., Kainmueller, D.,
Zachow, S., Boo, M.: GPU-accelerated level-set segmentation.
J. Real-Time Image Process. 1–15 (2013). doi:10.1007/s11554-
013-0378-6
22. Lustig, D., Martonosi, M.: Reducing GPU offload latency via
fine-grained CPU-GPU synchronization. In: 2013 IEEE 19th
International Symposium on High Performance Computer
Architecture (HPCA), pp. 354–365 (2013)
23. Bender, M.A., Farach-Colton, M., Mosteiro, M.A.: Insertion Sort
is O(n log n). Theory Comput Syst. 39, 391–397 (2006)
24. Rajan, J., Dekker, A.J., Sijbers, J.: A new non-local maximum
likelihood estimation method for Rician noise reduction in
magnetic resonance images using the Kolmogorov-Smirnov test.
Sig. Process. 103, 16–23 (2014)
25. Rajan, J., Jeurissen, B., Verhoye, M., Audekerke, J.V., Sijbers, J.:
Maximum likelihood estimation-based denoising of magnetic
resonance images using restricted local neighborhoods. Phys.
Med. Biol. 56, 5221–5234 (2011)
26. Aja-Fernandez, S., Alberola-Lopez, C., Westin, C.: Noise and
signal estimation in magnitude MRI and Rician distributed ima-
ges: a LMMSE approach. IEEE Trans. Image Process. 17,1383–1398 (2008)
27. Manjon, J.V., Carbonell-Caballero, J., Lull, J.J., Garcıa-Martı, G.,
Martı-Bonmatı, L., Robles, M.: MRI denoising using non local
means. Med. Image Anal. 12, 514–523 (2008)
28. Rajan, J., Audekerke, J.V., Van der Linden, A, Verhoye, M.,
Sijbers, J.: An adaptive non local maximum likelihood estimation
method for denoising magnetic resonance images. In: IEEE
International Symposium on Biomedical Imaging (ISBI 2012),
Barcelona, pp. 1136–1139 (2012)
29. Krissian, K., Aja-Fernandez, S.: Noise driven anisotropic diffu-
sion filtering of MRI. IEEE Trans. Image Process. 18, 2265–2274(2009)
30. Sudeep, P.V., Palanisamy, P., Kesavadas, C., Rajan, J.: Nonlocal
linear minimum mean square error methods for denoising MRI.
Biomed. Signal Process. Control 20, 125–134 (2015)
Adithya H. K. Upadhya received his B.Tech. degree in Computer
Engineering from the National Institute of Technology Karnataka
(NITK), India, in the year 2015. His research interests include parallel
processing, image processing, data science and algorithms.
Basavaraj Talawar did his M.Tech. in Networking and Internet
Engineering in VTU, India, in 2005 and Ph.D. in Electrical
Engineering from the Indian Institute of Science, Bangalore, India,
in 2013. At present, he is working as an Assistant Professor at the
Department of Computer Science and Engineering at the National
Institute of Technology Karnataka (NITK), India. His major areas of
interests are Network-on-Chips, Warehouse scale computing and
DNA computing.
Jeny Rajan did his M.Tech. in Image Processing from the University
of Kerala, India, and received his PhD from the University of
Antwerp, Belgium. He is currently working as an Assistant Professor
at the Department of Computer Science and Engineering, National
Institute of Technology Karnataka (NITK), India. Before joining
NITK, he was working as a post-doctoral researcher at the Vision
Lab, University of Antwerp in Belgium. His main research interests
are MRI and Ultrasound image processing.
192 J Real-Time Image Proc (2017) 13:181–192
123