GPU implementation of non-local maximum likelihood ... · Retinex [20]. Level-set segmentations...

SPECIAL ISSUE PAPER

GPU implementation of non-local maximum likelihood estimationmethod for denoising magnetic resonance images

Adithya H. K. Upadhya1 • Basavaraj Talawar1 • Jeny Rajan1

Received: 17 June 2015 / Accepted: 21 December 2015 / Published online: 5 January 2016

� Springer-Verlag Berlin Heidelberg 2016

Abstract Magnetic resonance imaging (MRI) is a widely

deployed medical imaging technique used for various

applications such as neuroimaging, cardiovascular imaging

and musculoskeletal imaging. However, MR images

degrade in quality due to noise. The magnitude MRI data in

the presence of noise generally follows a Rician distribu-

tion if acquired with single-coil systems. Several methods

are proposed in the literature for denoising MR images

corrupted with Rician noise. Amongst the methods pro-

posed in literature for denoising MR images corrupted with

Rician noise, the non-local maximum likelihood methods

(NLML) and its variants are popular. In spite of the per-

formance and denoising quality, NLML algorithm suffers

from a tremendous time complexity O m3N3ð Þ, where m3

and N3 represent the search window and image size,

respectively, for a 3D image. This makes the algorithm

challenging for deployment in the real-time applications

where fast and prompt results are required. A viable solu-

tion to this shortcoming would be the application of a data

parallel processing framework such as Nvidia CUDA so as

to utilize the mutually exclusive and computationally

intensive calculations to our advantage. The GPU-based

implementation of NLML-based image denoising achieves

significant speedup compared to the serial implementation.

This research paper describes the first successful attempt to

implement a GPU-accelerated version of the NLML algo-

rithm. The main focus of the research was on the paral-

lelization and acceleration of one computationally

intensive section of the algorithm so as to demonstrate the

execution time improvement through the application of

parallel processing concepts on a GPU. Our results suggest

the possibility of practical deployment of NLML and its

variants for MRI denoising.

Keywords Non-local maximum likelihood estimation

(NLML) � MRI � Rician distribution � GPGPU � Parallelprocessing � Nvidia CUDA framework

1 Introduction

Random (stochastic) noise is the major contributor towards

the degradation in quality of the MR (magnetic resonance)

images [1]. In consideration of the importance of MRI

denoising for various medical applications, several

denoising algorithms and noise models have been devel-

oped to accurately estimate the true underlying intensity of

the noisy MR images. Contemporarily, noisy data in

magnitude MR images are modelled by Rician distribution

which is a better fit for the general MR image data cor-

rupted by noise [3]. The noise in the MRI k-space data (raw

data) is generally assumed to be normally distributed. The

k-space data are then Fourier transformed to obtain the

magnetization distribution. The data distribution in the

resultant real and imaginary components of the complex

image still remains Gaussian due to the linearity and

orthogonality of the Fourier transform. However, complex

image as such is not used for any analysis. To use both part

of the complex data values, we calculate magnitude and

phase images [4]. However, since the computation of the

magnitude image is a non-linear operation (root sum of

squares of the Gaussian distributed complex image), the

data in the magnitude image will be Rician distributed (in

single-coil systems) [1, 4, 5]. Several denoising schemes

& Jeny Rajan

[email protected]

1 Department of Computer Science and Engineering, National

Institute of Technology Karnataka, Surathkal, India

123

J Real-Time Image Proc (2017) 13:181–192

DOI 10.1007/s11554-015-0559-6

http://crossmark.crossref.org/dialog/?doi=10.1007/s11554-015-0559-6&domain=pdf

http://crossmark.crossref.org/dialog/?doi=10.1007/s11554-015-0559-6&domain=pdf

have been formulated for denoising MR images corrupted

by Rician noise (for e.g. [1, 2, 15, 24–30]). Among them

non-local maximum likelihood methods (NLML) [2] and

its variants [1, 2, 24, 25, 28] are very popular.

NLML method is based on non-local concepts [6] and

maximum likelihood estimation techniques (MLE) [7, 8].

NLML demonstrates a good performance at preserving

edges and fine details and is designed to exploit the

redundancies and similarities within the noise corrupted

image and thus, it is one among the best algorithms utilized

for denoising MR image. The superior performance of

NLML methods over other approaches in denoising MRI is

demonstrated in [1, 2, 24, 25]. Having asserted the

importance of good-quality MR denoising and the superior

performance of NLML algorithm, it is obligatory to elab-

orate the drawbacks and disadvantages of this algorithm.

NLML suffers from a very high time complexity [2] sim-

ilar to that of non-local means (NLM) owing to the simi-

larity measures and comparisons between neighbourhoods

determined by the search window dimensions. In order to

acquire the perfect denoising quality, the search window

dimension should be equivalent to that of the image

dimensions, i.e. similarity comparisons should be made

throughout the image for every target voxel’s neighbour-

hood for ideal results. Under these circumstances, for a 3D

MR image dataset of dimensions P� Q� R, the com-

plexity would be in the order of whopping OðN6Þ (for

complexity analysis, the dimensions of the MR image

dataset is considered to be N � N � N. In order to improve

the execution speed, usually the search window is restricted

to lower dimensions, i.e. m� m� m [2]. Lower search

window dimensions could also imply compromising the

quality of resultant denoised image. This is demonstrated in

Fig. 1. From Fig. 1, it can be observed that as search

window size increases, there is an improvement in the

PSNR (peak signal-to-noise ratio) of the denoised image.

Hence, there is a pressing necessity for a viable solution

which could improve the execution time of the NLML

algorithm without incurring the trade-off involving uti-

lization of lower search window dimensions.

A solution to the aforementioned disadvantage would be

the application of parallel processing concepts so as to

exploit the various mutually exclusive and computationally

intensive calculations. GPU computing is a practical and

effective solution to address the requirement of a powerful

and efficient hardware specialized in parallel processing.

Nvidia CUDA framework provides a powerful yet intelli-

gible developer API toolset for accomplishment of GPU-

accelerated variant of this algorithm. GPU hardware has

outperformed CPU by a great extent with respect to the

theoretical floating point operations per second (GFLOP)

and the theoretical memory bandwidth [16]. Computations

involving NLML estimation for all target noisy image

voxels (to determine the true underlying intensity of the

target voxel) are independent of each other and are mutu-

ally exclusive. This concept is the main principle and

rationale behind the attempt to accelerate the NLML

algorithm where, in a theoretical perfect GPU hardware,

P� Q� R (image dimensions) threads could be launched

to compute and generate denoised voxels for every corre-

sponding noisy image voxel in parallel.

GPU computing has contributed to several other medical

image processing algorithms such as functional magnetic

resonance imaging (fMRI), diffusion tensor imaging (DTI),

image denoising and image registration [9]. CUDA-en-

abled GPU acceleration has also found applications in the

Fig. 1 a Simulated noisy image, b PSNR plot after denoising with NLML method with different search window sizes

182 J Real-Time Image Proc (2017) 13:181–192

123

fast and efficient non-local means denoising of 3D ultra-

sound images [10]. GPU-accelerated lattice Boltzmann

model is used to solve partial differential equations for

anisotropic diffusion which is widely adopted for image

denoising [11]. In the field of MRI, GPGPU has made

several strides and contributions as well. Reconstruction

after data acquisition in the frequency domain on a

Cartesian grid through standard scanning protocols could

be performed by GPU-accelerated inverse Fourier trans-

formations [12]. In MRI, GPUs have been utilized to

increase the speed of geometric distortion corrections in

EPI (echo planar imaging) and it has also been used for

high-speed design of multidimensional radio frequency

pulses so as to improve high-field imaging [13, 14].

CUDA-based parallel algorithms have also found applica-

tions in acceleration of image restoration methods such as

Retinex [20]. Level-set segmentations which are com-

monly used in medical applications for segmenting images

and volumes have also been accelerated by GPGPU-based

methods [21].

This paper is organized as follows. Section 2 provides

an elaborate description of NLML concepts and MRI data

distribution. Section 3 explains the proposed work and

complexity analysis. Experimental results and observations

are discussed in Sect. 4. Finally, conclusion and remarks

are drawn in Sect. 5.

2 Data distribution in MRI and signal estimationusing NLML

The k-space complex raw MR image data is distinguished

and characterized by the Gaussian PDF (probability density

function). Owing to the orthogonality and linearity of the

Fourier transform, the noise distribution of the real and

imaginary constituents can still be modelled as Gaussian

after the application of inverse Fourier transformation.

However, the magnitude of the complex MR image data

will no longer follow Gaussian distribution, but rather

follow Rician distribution due to the sum of square (SoS)

operation.

Let rg represent the standard deviation of the complex

MR image data corrupted with stationary noise with the

mean values lR and lI where R and I represents the real

and imaginary components. Consequently, the recon-

structed magnitude data M will follow a Rician distribution

with PDF given by [18]:

pM MjA; rg� �

¼ M

r2ge�M2þA2

2r2g I0AM

r2g

!

e Mð Þ; ð1Þ

where M ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiR2 þ I2

p, A ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffil2R þ l2I

p, I0(.) represents the

0th order modified Bessel function of the first kind and e(.)

represents the Heaviside step function. The signal-to-noise

ratio (SNR), i.e. the ratio A/rg, determines the shape of the

Rician PDF.

The unknown variables in the PDF can be estimated by

maximizing the corresponding likelihood function, if the

observed data and the model of interest are available. Let

M1, M2 … Mn be n statistically independent observations

within a region of constant signal intensity A. Then, the

joint PDF of the observations is:

pM fMigjA; rg� �

¼Yn

i¼1

Mi

r2ge�

M2iþA2

2r2g I0AMi

r2g

!

: ð2Þ

If r2g is known in advance, the only unknown parameter

in Eq. (2) is the true underlying intensity A, which can be

estimated by maximizing the likelihood function L or

equivalently ln L, with respect to A. However, if r2g is

unknown, it can also be estimated along with A using the

Eq. (3) [8]:

AML; r2ML

� �¼ arg maxA;r2gðln LÞ

n o; ð3Þ

where

ln L ¼Xn

i¼1

lnMi

r2g

!

�Xn

i¼1

M2i þ A2

2r2g

þXn

i¼1

ln I0AMi

r2g

!

ð4Þ

and AML and r2ML represent the estimated underlying true

intensity and the noise variance, respectively.

Non-local concepts employ the redundancies and repeti-

tions within an image to its advantage. Non-local approach

has garnered an intense research and publications from the

image processing community. NLML estimation method

drew inspiration from non-local means (NLM) which is a

proficient non-local concept-based denoising algorithm.

Unlike NLM, for MRI denoising, NLML strives to estimate

the true underlying intensity through Eq. (3) based on the

Rician noise nature instead of estimating noise-free signal

through computation of weighted average of non-local vox-

els. NLML selects the samples required for denoising through

non-local concepts where NL voxels get selected based on

the intensity similarity of the neighbourhood of the voxels. If

the neighbourhoods of two voxels are similar, then their

central voxels should have a similar meaning and thus similar

grey values [19]. The similarity of the voxel neighbourhoods

can be computed by computing the intensity distance (Eu-

clidian distance) between the neighbourhoods [2]:

dP;Q ¼ kNP � NQk; ð5Þ

where dP;Q denotes the intensity distance or Euclidean

distance between the neighbourhoods NP and NQ of the

J Real-Time Image Proc (2017) 13:181–192 183

123

voxels P and Q. For each voxel P in the image, the

intensity distance dP;Q based on a similarity window,

between P and all other non-local voxels Q within the

search window is computed using Eq. (5). After sorting the

non-local voxels in the increasing order of the intensity

distance d, the k closest non-local neighbours of PffPg, isgenerated and from this set, maximum likelihood estima-

tion (MLE) using Eq. (3) is performed which gives the true

underlying intensity value of noisy voxel P. The afore-

mentioned procedure is performed on every noisy voxel in

the image to generate the denoised image.

3 Proposed method

The steps and rationale in the parallelization of the NLML

algorithm are described in this section. Sections 3.1 and 3.2

describe the serial implementation and the complexity

analysis of the NLML algorithm. Sections 3.3 and 3.4

describe the GPU implementation of the NLML algorithm.

The GPU implementation of the NLML algorithm is

referred to as ‘‘GPU NLML’’ henceforth.

3.1 Algorithm modifications and implementation

details

For the GPU implementation, the serial NLML algorithm

has been bifurcated into two phases, namely A1 and A2.

Phase A2 estimates the true underlying intensity AX for

every corresponding noisy image voxel X in the 3D MR

image. This estimation requires computation of local

maximum of the multivariate function or local minimum of

the negative multivariate function defined by Eq. (4). GPU

NLML is the parallel implementation of phase A1. The

parallelization of the unconstrained optimization function

(phase A2) and integration has been undertaken as future

work. Details of phase A1 and of the GPU implementation

are as follows.

Phase A1 of the NLML algorithm deals with the gen-

eration of vector kX comprising the sets fXIgenerated for

every noisy 3D MR image voxel XI . Here, fXIrepresents

the set of k closest non-local neighbours of a given voxel XI

where I ranges from 1; 2; 3; . . .;P� Q� R (3D MR image

of dimensions P� Q� R). Hence, the vector kX is an

aggregation of all the sets fX1; fX2

; fX3; fX4

; . . .; fXP�Q�R. As a

result, the size of the kX vector would be k � ðP� Q� RÞwhich represents a set containing a collection of the ‘k’

closest non-local neighbours for every voxel in the noisy

3D MR image.

kX ¼ fX1; fX2

; fX3; fX4

; . . .; fXP�Q�R

n oð6Þ

Phase A2 of the NLML algorithm deals with the com-

putation of the NLML estimate of every noisy image voxel

so as to determine their true underlying intensity. This is

accomplished through retrieval of the set fXIfrom the

vector kX for every corresponding noisy image voxel XI ,

i.e. the retrieval of the set of k closest non-local neighbours

for every noisy image voxel XI from the vector kX . Fol-lowing this, the set fXI

is used to compute the true under-

lying intensity AX or the denoised voxel for every noisy

image voxel XI through Eq. (3). This is accomplished

through application of function maximization (computing

local maximum) of the multivariate equation or the func-

tion minimization (computing local minimum) of negative

of the multivariate equation defined by Eq. (4).

The pseudo-code for the serial version of the algorithm

(Algorithm 1) is given below. In the serial implementation,

phase A1 of the algorithm was developed in C?? and

compiled into MEX (Matlab executable) which was

invoked by the executing Matlab program. The output of

phase A1, the vector kX , is input to phase A2 by the exe-

cuting Matlab program. Phase A2 is implemented through

Matlab’s ‘fminunc’ function which computes the local

minimum of the negative of the equation defined by

Eq. (4). Matlab’s optimization toolbox provides several

functions and procedures for finding the parameters that

maximize or minimize the objective functions while sat-

isfying constraints.

3.2 Serial NLML: complexity analysis

The NLML algorithm demonstrates a very high computa-

tional complexity similar to that of NL means algorithm due

to the cost of similarity comparison between the neigh-

bourhoods [2]. In order to reduce the execution time, the

search window dimensions are restricted to a smaller value

of m� m� m and the maximum threshold for the search

window would be N � N � N m�Nð Þ (image dimensions

P� Q� R are equal to N � N � N for complexity analy-

sis). In phase A1, for each voxel of the image (total of N3

voxels), calculations with every other voxel within the

search window (m3 voxels) will be performed to generate

the corresponding set of closest non-local neighbours. This

results in a time complexity of O m3N3ð Þ for phase A1.

In phase A2, for every voxel XI in the image, a corre-

sponding denoised voxel is generated using the set fXI

retrieved from the vector kX . This set fXIwill be used for

maximization of likelihood function L (or minimization of

negative L) as per Eq. (3). The overall complexity of the

phase A2 would be in the order of O N3ð Þ since NLML


123

estimation computations are carried out for each voxel in

the noisy image. Consequently, the total time complexity

with both phase A1 and A2 combined would be

O m3N3 þ N3ð Þ which would result in a complexity of

O m3N3ð Þ.

3.3 GPU implementation of the algorithm

As discussed in Sect. 3.1, the sequential algorithm was

bifurcated into two phases A1 and A2. In the GPU NLML,

the phase A1 was developed using CUDA C?? and

embedded into Matlab through CUDA MEX (CUDA

Matlab Executable). Phase A1 generates the vector kX of

size P� Q� R� k (image dimensions are P� Q� R).

This vector contains the set of k closest non-local neigh-

bours of every noisy image voxel. The vector kX will be

utilized by the second stage of the algorithm, i.e. phase A2

which undergoes execution in the CPU for the generation

of denoised image through computation of NLML for

every noisy image voxel.

Figure 2 illustrates the stages involved in the generation

of the denoised image through the application of GPU-

accelerated NLML estimation algorithm. The modified

algorithm discussed in Sect. 3.1 and pseudo-code presented

in Algorithm 1 builds the foundation for the development

of the GPU NLML estimation algorithm.


123

3.3.1 Parallelization of phase A1

Parallelization strategies can be employed to improve the

performance, owing to the inherent data level parallelism

in NLM [15] or NLML. The potential for performance

improvement and parallelization is exhibited in the phase

A1 of the algorithm where the generation of the vector kX ,consists of data-independent calculations for every noisy

image voxel XI . The GPU is utilized to perform heavy

computations of the phase A1 of the algorithm. The results

calculated by the GPU are then returned to the main pro-

gram executing in the Matlab environment for further

processing through the A2 phase. Steps involved in the

phase A1 of the of the GPU implementation of the NLML

estimation algorithm, as illustrated by Fig. 2 are described

below.

1. The noisy image data is loaded into the main memory

(RAM). Next, memory is allocated on the GPU device

(using cudaMalloc() function) for storing image data

and for storing vector kX . Following this, the image

data are copied from the main memory to GPU

memory (using cudaMemcpy() function).

2. GPU computes and generates the vector kX from the

noisy image data through application of parallel

algorithm discussed in Algorithm 2.

3. The computed kX vector is transferred from the GPU

memory to the main memory (using cudaMemcpy()

function). Subsequently, the runtime memory allocated

on the GPU is deallocated to prevent memory leaks

(using cudaFree() function).

The GPU NLML implementation of phase A1 is pre-

sented in Algorithm 2. GPU NLML is developed using the

NVIDIA CUDA C?? framework. A single CUDA thread

computes the set fX and appends it to the vector kX . Asingle thread is launched for each noisy image voxel of the

3D MR image (N3 threads are created). In each thread, the

Euclidean distance measures are computed between the

neighbourhood of target voxel and the neighbourhood of

every voxel within the search window dimensions of the

target voxel. The Euclidean distances are sorted and stored

in the set fX of fixed size ‘k’ using a variant of insertion

sort. For this implementation, the value of k, i.e. the

number of closest non-local neighbours (size of the set fXI)

is selected as 25 based on the original algorithm described

in He et al. [2]. Finally, the set fX is appended to the vector

kX . After all the threads complete execution, a complete

vector kX is obtained which contains the aggregation of all

the sets fX for every noisy image voxel, i.e. kX contains the

set of closest non-local neighbours for every noisy image

voxel.

In the CUDA GPU NLML implementation, the N3

threads created are organized as grid blocks and thread

blocks. The dimensions of the grid block along the x, y and

z axes are N/8, N/8 and N/4, respectively. The dimensions

of every thread block were 8, 8 and 4 threads along the x, y

and z axes, respectively. Therefore, the total number of

threads per block = 256 (8� 8� 4). This specific thread

block size was chosen as the Nvidia Quadro K2000 GPU

used for the experiments demonstrates optimal perfor-

mance for a thread block of size 256 threads. The optimal

(III)

(II)

MAIN MEMORY TRANSFER OF IMAGE DATA TO GPU GPU

MEMORY

GPU

PHASE A1: GENERATION OF VECTOR BY GPU (PARALLEL

ALGORITHM)

GENERATION OF DENOISED IMAGECPU

TRANSFER OF COMPUTED VECTOR TO MAIN MEMORY

PHASE A2: COMPUTATION OF NLML ESTIMATE FROM VECTOR

IN CPU (SERIAL ALGORITHM)

MAIN MEMORY

(IV)

Fig. 2 Figure depicting the

hybrid CPU–GPU model and

the two algorithm phases A1

and A2


123

number of threads in a thread block depends on the specific

GPU used.

Following the GPU execution phase, the true underlying

intensity is computed and the denoised voxel is generated

through retrieval of the set fX from the vector kX for each

corresponding noisy image voxel in phase A2. Phase A2

has not been implemented in GPU and is executed in

Matlab using the ‘‘fminunc’’ function, similar to serial

NLML. The usage of the GPU implementation of phase A1

and the CPU implementation for phase A2 follows the

hybrid CPU–GPU computation model where the GPU runs

one of the computationally intensive sections of the pro-

gram, whereas the CPU runs a serial optimization section

of the program [9].

3.4 GPU implementation complexity analysis

Phase A1 underwent parallelization (through GPU imple-

mentation) and phase A2 remained intact and unchanged. In

phase A1, a total of N3 (image dimensions are P� Q� R,

this is considered as N3 for complexity analysis) threads

were launched to perform computations. For this complexity

analysis, we consider a perfect parallel machine or a perfect

GPU capable of spawning N3 threads and performing N3

parallel thread executions in which, each thread performs m3

operations (the search window size is m� m� m). The

computational complexity of a single thread is O m3ð Þ. In an


123

ideal parallel machine, the computational complexity of the

GPU NLML algorithm would be O m3ð Þ.In practice, the execution time depends on the number of

threads executing in parallel in the GPU. The maximum

number of concurrent threads that can run in the GPU

hardware at a time is the product of the number of GPU

streaming multiprocessors (SMs), number of resident

warps per SM and the number of threads per warp which is

determined by the GPU architecture. The computation time

on the GPU is the time taken to execute all the thread

blocks. The overall execution time includes the memory

transfer latencies, grid block and thread block management

latencies, and the computation time of individual threads.

The second phase A2 remains unmodified, and hence the

time complexity remains unaltered as O N3ð Þ. Conse-

quently, the time complexity of the overall algorithm

would be the summation of the time complexity of A1 and

A2 phases which is in the order of Oðm3 þ N3Þ. The searchwindow dimensions cannot exceed the image dimensions

and the value of m will always be less than or equal to

N m�Nð Þ. Therefore, the resultant time complexity for a

perfectly parallel machine will be O N3ð Þ.

3.5 Optimization strategies

One of the noticeable costs in the GPGPU environment is the

communication overhead and latency by the allocation of

memory on GPU, memory operations consisting of transfer of

data from main memory to the GPU memory and transfer of

results from the GPU memory to the main memory [22]. This

drawback is comprehensively discussed in the Sect. 4.3. To

reducetheamountofdatatransferbetweentheCPUandtheGPU

memories, we use single precision floating point values instead

of double precision values (saving 50 % of memory space and

communication bandwidth). Though thismight result in loss of

accuracy, single precision is adequate to represent theMR gray

scale data without any perceptible loss of accuracy.

The original algorithm [2] computes the Euclidean dis-

tance measure between the neighbourhood of the target

voxel and the neighbourhood of every other voxel within

the search window. The results obtained were stored in

another large vector from which k closest non-local

neighbours were extracted after sorting. Owing to the lim-

itations of GPU memory, the aforementioned process is

inefficient because a large vector of size m3 would be

required for every noisy image voxel in order to store the

distance measure between the neighbourhood of target

voxel and the neighbour of every other voxel within the

search window. Alternatively, a vector of size k elements

could be utilized along with a derivative of Insertion sort, as

depicted in the GPU variant of the algorithm so as to sort the

computed Euclidean distance measures in ascending order.

This approach offers a twofold advantage. Firstly, the

memory requirements of the GPU would be reduced since a

vector of size k is required for every voxel in the noisy

image instead of a vector of size m3. Secondly, Insertion

sort is efficient for sorting and storing the incoming com-

puted values during the runtime [23].

4 Experiments and results

The experiments pertaining to the conventional NLML

algorithm and the GPU-accelerated NLML algorithm were

performed on a simulated brain database (SBD) acquired

from BrainWeb [17]. The parameters of the 3D MR image

include T1 modality, 3 mm slice thickness, 0 % noise and

20 % intensity non-uniformity (INU). A Rician noise of

standard deviation (rg) of 15 was added to the ground truth

and all denoising experiments were conducted on the

resultant noisy image. The dimension of the 3D MR image

used for the experiments was 181 9 217 9 181.

The experiments were performed in a system equipped

with Intel Xeon E5-2650 (2.6 GHz clock speed) Octa core

processor, 64 GB RAM and Nvidia Quadro K2000

graphics card. The software specifications include Matlab

2014b, Nvidia CUDA 5.5 and Microsoft visual studio

2010. GPU and CPU execution time for generating kXvector and for denoised image generation was evaluated for

performance comparison. A detailed hardware specifica-

tion of the CPU and the GPU is given below in Table 1.

The execution time for the completion of phases A1 and

A2 for both the CPU and GPU variants of the algorithm

were performed using the Matlab TIC and TOC routine to

determine the execution time. The execution times mea-

sured for the GPU variant were inclusive of all the latency

incurred due to the memory operations in GPU device. The

denoising qualities of the resultant images generated

through NLML estimation algorithm were evaluated

through peak signal-to-noise ratio (PSNR) image quality

metric. PSNR is the ratio of maximum signal power to that

power of corrupting noise. Mathematically, PSNR is

defined as follows:

PSNR ¼ 10� log10 MAX2=MSE� �

ð7Þ

MSE ¼ 1

N�XN

i¼1

ðYi � YiÞ: ð8Þ

In Eq. (7), MAX denotes the maximum value that can be

attained by a voxel in the 3D MR image. MSE stands for

mean squared error which is the average of sum of squares of

difference between the Ground truth Yi (voxel of the ground

truth image, i.e. noise-free source image) and the estimated

voxel intensity Yi (voxel of a denoised image generated

through NLML estimation algorithm).


123

Figure 3 demonstrates the results of the denoising

experiments after the noisy MR image has been subjected

to the CPU and the GPU variant NLML denoising. A

search window of dimensions 10� 10� 10 and a local

(similarity) window of dimensions 3� 3� 3 were used for

denoising the 3D MR image. A particular slice from the

CPU-generated denoised image and the GPU-generated

denoised image is extracted for comparison and qualitative

analysis. The CPU and GPU variants of the algorithm yield

identical results and exhibit effective denoising with

improvement in visual quality and PSNR.

4.1 CPU vs GPU performance assessment

(generation of kX vector)

The execution time measurements for the two phases A1

and A2 for both CPU an GPU implementations of NLML

were performed using the TIC and TOC Matlab routines

This section deals with the execution time assessment for

the generation of kX vector by conventional NLML algo-

rithm and its corresponding GPU implementation, i.e. the

phase A1 of the algorithm. Throughout this experiment, the

similarity window dimensions were maintained as

3� 3� 3, whereas the search window dimensions were

varied from 10� 10� 10 to 50� 50� 50. The execution

time for the completion of the first stage of the algorithm

for these different search window dimensions were

observed for the conventional NLML algorithm and its

CUDA GPU counterpart.

Table 2 records the execution times and speedup by GPU

NLML over the serial NLML implementation. The simi-

larity window dimensions were 3� 3� 3 voxels and the

search window dimensions were varied from 10� 10� 10

to 50� 50� 50 voxels. As analysed in Sect. 3.2, the serial

execution grows as O m3N3ð Þ. From Table 2, the serial

implementation execution time ranges from 474.65 to

Table 1 CPU and GPU

hardware specificationCPU Intel Xeon E5-2650 v2 GPU Nvidia Quadro K2000

Clock speed 2.6 GHz Clock speed 950 MHz

Cache memory 2 MB (L2), 20 MB (L3) GPU memory 2 GB

Number of cores 8 Number of cores 384 CUDA cores

Maximum memory bandwidth 59.7 GB/s Memory bandwidth 64 GB/s

(a) (b)

(d)(c)

Fig. 3 A slice of the 3-D MR

image. a Ground truth, b image

corrupted with Rician noise of

standard deviation 15, c CPU-

generated denoised image

(PSNR = 29.98), d GPU-

generated denoised image

(PSNR = 29.98)


123

45,460.93 s for the same 3D MR image. The 969 increase

in execution times tracks the 125x increase in the search

window ranges (10� 10� 10 to 50� 50� 50 voxels). On

the other hand, the improvement in the GPU NLML exe-

cution time is a result of parallel execution of threads in the

GPU. However, the 789 increase in execution times in the

GPU NLML (92.72–7288.72 s) is proportional to the

complexity of a single thread execution. The number of

tasks in a single thread are in the order of m3 (size of the

search window). The steady increase in speedup (from 5.12

to 6.24) with the increase in window sizes can be attributed

to the lesser rate of increase in execution times (789) of the

GPU corresponding to the serial version (969).

In Fig. 4, the execution time for the completion of the

first phase A1 for the CPU variant as well as the GPU

NLML of the algorithm is plotted against the increasing

search window dimensions.

4.2 CPU vs GPU performance assessment (denoised

image generation)

The modified algorithm consisted of two phases, i.e. A1

and A2. One phase offered an opportunity for paralleliza-

tion (A1), whereas the other phase was executed serially in

the CPU (A2). The total execution time for the denoised

image generation is the sum of the execution times of both

A1 and A2 phases.

The second phase (A2) of the algorithm deals with the

generation of the denoised voxels for every noisy image

voxel through computation of non-local estimate as per

Eq. (3). As mentioned in the complexity analysis in Sect.

3.2 and Sect. 3.4, the time complexity of the second phase

A2 is dependant only on the MR image dimensions

(O N3ð Þ). The average time required for generation of the

non-local estimate for every noisy image voxel is 38,000 s

Table 2 CPU vs GPU NLML execution time comparison for generation of kx vector (phase A1)

Local window dimensions Search window dimensions Execution time (CPU variant) (s) Execution time (GPU variant) (s) Speedup

3 9 3 9 3 10 9 10 9 10 474.65 92.72 5.12

3 9 3 9 3 20 9 20 9 20 3315.40 571.14 5.80

3 9 3 9 3 30 9 30 9 30 10,659.22 1771.54 6.02

3 9 3 9 3 40 9 40 9 40 24,218.01 3949.17 6.13

3 9 3 9 3 50 9 50 9 50 45,460.93 7288.72 6.24

Fig. 4 CPU vs GPU NLML execution time for the generation of kxvector (phase A1)

Table 3 Speedup of GPU NLML over the CPU version for denoised image generation

Local window dimensions Search window dimensions Execution time (CPU variant) (s) Execution time (GPU variant) (s) Speedup

3 9 3 9 3 10 9 10 9 10 38,474.65 38,092.72 1.01

3 9 3 9 3 20 9 20 9 20 41,315.4 38,571.14 1.07

3 9 3 9 3 30 9 30 9 30 48,659.22 39,771.54 1.22

3 9 3 9 3 40 9 40 9 40 62,218.01 41,949.17 1.5

3 9 3 9 3 50 9 50 9 50 83,460.93 45,288.72 1.84

Fig. 5 CPU vs GPU NLML execution time for the denoised image

generation (phase A1 and A2)


123

for an image of dimensions 181� 217� 181 when the

value of the parameter k is chosen as 25. Hence, the

addition of this result to the time required for the genera-

tion of kX vector is the total time required for the genera-

tion of the denoised image (refer Table 3).

On an average, the time required for the completion of

the second phase A2 of the algorithm is 38,000 s. The GPU

NLML execution times range from 92.72 to 7288.72 s when

search window dimensions were varied from 10� 10� 10

to 50� 50� 50 voxels. Phase A2 execution time dominates

the total NLML execution time and offsets the speedup

gained by the phase A1 GPU NLML implementation. For

the same search window ranges, overall speedups range

from 1.01 to 1.84. This reduction in speedup motivates us to

implement the multivariate minimization algorithm in phase

A2 in GPU as future work. For the serial implementation,

the execution time increase by 2.16 times (from 38,474.65 to

83,460.93 s) when the search window dimensions were

increased from 10 9 10 9 10 to 50 9 50 9 50. Owing to

the GPU implementation of phase A1, the increase in overall

execution time for the GPU NLML is 1.18 for the same

search window size. This trend is evident in Fig. 5, which

illustrates the total denoised image generation time for the

serial NLML and the GPU NLML plotted against the

increasing search window dimensions.

4.3 Memory transfers between the RAM

and the GPU memory: latency analysis

Table 4 records the average latencies incurred for the GPU

memory operations during the program execution. These

high-precision latency time measurements were carried out

through utilization of CUDA events. These events were

used to record the precise time before the operation took

place and to record the precise time as soon as the opera-

tion completed. After the synchronization of these events,

the accurate elapsed time was stored in a variable using

cudaEventElapsed() routine.

The execution time measured for the algorithm execu-

tion in the Sects. 4.1 and 4.2 are inclusive of the latency

caused due to GPU memory operations. The Quadro2000

GPU connects to the host’s main memory through the PCIe

v2.x interface with peak data transfers up to 8 GB/s. For

the small image size in our experiment (181� 217� 181)

the memory transfer latency is not a bottleneck. For our

experiments, the memory transfer and allocation latencies

do not present a significant bottleneck.

5 Conclusion and future work

The paper describes the implementation and analysis of the

GPU NLML implementation. The serial NLML is divided

into two phases—finding similar pixels for ML estimation and

the estimation of the true underlying intensity for every noisy

image voxel in the 3D MR image. The paper presents the

implementation of the Phase A1 of the NLML algorithm. We

observe that the GPU NLML achieves a speedup of 5.12–6.24

for search window sizes from 10� 10� 10 to 50� 50� 50

voxels. We also show that the PSNR of the resultant denoised

image increases as the search window dimensions keep

increasing. This phenomenon is observed since the probability

of discovering similar neighbourhoods increase with the

increasing search window dimensions. In an ideal case sce-

nario, the search window dimensions should be equivalent to

that of the image size, i.e. the entire image needs to be

explored for finding similar pixels for high-quality denoising

results. However, owing to the tremendous execution time and

computational requirements, this approach can be interpreted

as unfeasible. For practical purposes, there is a threshold

placed on the search window dimensions of NLML algorithm.

However, the GPU NLML has demonstrated that the real-

time implementation of the algorithm could be made prac-

tical even for higher search window dimensions through

application of high performance GPUs. There is a further

scope for improvement in the speedups of the GPU NLML

by implementing the multivariate minimization algorithm on

the GPU. This work has been undertaken as an extension to

the current work. Future work could consist of paralleliza-

tion of the entire algorithm and further research in this

direction could result in deployment of such denoising

algorithms in the real-time MRI systems. Similar work could

also be extended other prominent denoising algorithms

where mutually exclusive, heavy and computationally

intensive calculations could be exploited through utilization

of various leading parallel processing frameworks on current

day multiprocessors, coprocessors and GPUs.

References

1. Rajan, J., Veraart, J., Audekerke, J.V., Verhoye, M., Sijbers, J.:

Nonlocal Maximum likelihood estimation method for denoising

Table 4 Average latency for different GPU memory operations

GPU memory operations Average execution

time (latency)

GPU memory allocation for image data 1.52 ls

GPU memory allocation for the kX vector 1.664 ls

Transfer of image data from main memory

(RAM) to GPU memory

9.69 ms

Transfer of generated kX vector from the

GPU memory to the main memory

361.843 ms


123

multiple-coil magnetic resonance images. Magn. Reson. Imaging

30, 1512–1518 (2012)

2. He, L., Greenshields, I.R.: A nonlocal maximum likelihood

estimation method for Rician noise reduction in MR Images.

IEEE Trans. Med. Imaging 28, 165–172 (2009)

3. Dietrich, O., Raya, J.G., Reeder, S.B., Ingrisch, M., Reiser, M.F.,

Schoenberg, S.O.: Influence of multichannel combination, par-

allel imaging and other reconstruction techniques on MRI noise

characteristics. Magn. Reson. Imaging 26, 754–762 (2008)

4. Rajan, J.: Estimation and removal of noise from single and

multiple coil magnetic resonance images. Ph.D. Thesis, Univer-

sity of Antwerp, Belgium (2012)

5. Aja-Fernandez, S., Tristan, A., Alberola-Lopez, C.: Noise esti-

mation in single and multiple coil magnetic resonance data based

on statistical models. Magn. Reson. Imaging 27, 1397–1409

(2009)

6. Buades, A., Coll, B., Morel, J.M.: A review of image denoising

algorithms, with a new one. Multiscale Model. Simul. 4, 490–530(2005)

7. Sijbers, J., Dekker, A.J., Scheunders, P., Dyck, D.: Maximum

likelihood estimation of Rician distribution parameters. IEEE

Trans. Image Process. 17, 357–361 (1998)

8. Sijbers, J., Dekker, A.J.: Maximum likelihood estimation of

signal amplitude and noise variance from MR data. Magn. Reson.

Med. 51, 586–594 (2004)

9. Eklund, A., Dufort, P., Forsberg, D., LaConte, S.M.: Medical

image processing on the GPU—past, present and future. Med.

Image Anal. 17, 1073–1094 (2013)

10. Li, L., Hou, W., Zhang, X., Ding, M.: GPU-based block-wise

nonlocal means denoising for 3D ultrasound images. Comput.

Math. Methods Med. 2013, 1–10 (2013) (article ID 921303)11. Zhao, Y.: Lattice Boltzmann based PDE solver on the GPU. Vis.

Comput. 24, 323–333 (2008)

12. Sumanaweera, T., Liu, D.: Medical image reconstruction with the

FFT. In: Pharr, M. (ed.) GPU Gems 2, pp. 765–784. Addison

Wesley (2005)

13. Knoll, F., Unger, M., Diwoky, C., Clason, C., Pock, T., Stoll-

berger, R.: Fast reduction of undersampling artifacts in radial MR

angiography with 3D total variation on graphics hardware. Magn.

Reson. Mater. Phys. Biol. Med. 23, 103–114 (2010)

14. Deng, W., Yang, C., Stenger, V.A.: Accelerated multidimen-

sional radiofrequency pulse design for parallel transmission using

concurrent computation on multiple graphics processing units.

Magn. Reson. Med. 65, 363–369 (2011)

15. Coupe, P., Yger, P., Barillot, C.: Fast non local means denoising

for 3D MR images. Med Image Comput Comput-Assist Interv. 2,33–40 (2006)

16. ‘‘CUDA C Programming Guide’’. Nvidia Corporation, 2015.

Nvidia toolkit documentation. http://docs.nvidia.com/cuda/cuda-

c-programming-guide/#axzz3ac754Y00 (2015). Accessed 9 Jan

2015

17. Cocosco, C.A., Kollokian, V., Kwan, R.S., Evans, A.C.: Brain-

web: Online interface to a 3D MRI simulated brain database.

NeuroImage 5, S425. http://www.bic.mni.mcgill.ca/brainweb/

(1997). Accessed 3 Nov 2014

18. Rice, S.O.: Mathematical analysis of random noise. Bell Syst.

Tech. 23, 282–332 (1944)

19. Zimmer, S., Didas, S., Weickert, J.: A rotationally invariant block

matching strategy improving image denoising with non-local

means. In: International Workshop on Local and Non-Local

Approximation in Image Processing, pp. 135–142 (2008)

20. Wang, Y.K., Huang, W.B.: A CUDA-enabled parallel algorithm

for accelerating retinex. J. Real-Time Image Process. 9, 407–425(2014)

21. Rodrıguez, J.L., Heras, D.B., Arguello, F., Kainmueller, D.,

Zachow, S., Boo, M.: GPU-accelerated level-set segmentation.

J. Real-Time Image Process. 1–15 (2013). doi:10.1007/s11554-

013-0378-6

22. Lustig, D., Martonosi, M.: Reducing GPU offload latency via

fine-grained CPU-GPU synchronization. In: 2013 IEEE 19th

International Symposium on High Performance Computer

Architecture (HPCA), pp. 354–365 (2013)

23. Bender, M.A., Farach-Colton, M., Mosteiro, M.A.: Insertion Sort

is O(n log n). Theory Comput Syst. 39, 391–397 (2006)

24. Rajan, J., Dekker, A.J., Sijbers, J.: A new non-local maximum

likelihood estimation method for Rician noise reduction in

magnetic resonance images using the Kolmogorov-Smirnov test.

Sig. Process. 103, 16–23 (2014)

25. Rajan, J., Jeurissen, B., Verhoye, M., Audekerke, J.V., Sijbers, J.:

Maximum likelihood estimation-based denoising of magnetic

resonance images using restricted local neighborhoods. Phys.

Med. Biol. 56, 5221–5234 (2011)

26. Aja-Fernandez, S., Alberola-Lopez, C., Westin, C.: Noise and

signal estimation in magnitude MRI and Rician distributed ima-

ges: a LMMSE approach. IEEE Trans. Image Process. 17,1383–1398 (2008)

27. Manjon, J.V., Carbonell-Caballero, J., Lull, J.J., Garcıa-Martı, G.,

Martı-Bonmatı, L., Robles, M.: MRI denoising using non local

means. Med. Image Anal. 12, 514–523 (2008)

28. Rajan, J., Audekerke, J.V., Van der Linden, A, Verhoye, M.,

Sijbers, J.: An adaptive non local maximum likelihood estimation

method for denoising magnetic resonance images. In: IEEE

International Symposium on Biomedical Imaging (ISBI 2012),

Barcelona, pp. 1136–1139 (2012)

29. Krissian, K., Aja-Fernandez, S.: Noise driven anisotropic diffu-

sion filtering of MRI. IEEE Trans. Image Process. 18, 2265–2274(2009)

30. Sudeep, P.V., Palanisamy, P., Kesavadas, C., Rajan, J.: Nonlocal

linear minimum mean square error methods for denoising MRI.

Biomed. Signal Process. Control 20, 125–134 (2015)

Adithya H. K. Upadhya received his B.Tech. degree in Computer

Engineering from the National Institute of Technology Karnataka

(NITK), India, in the year 2015. His research interests include parallel

processing, image processing, data science and algorithms.

Basavaraj Talawar did his M.Tech. in Networking and Internet

Engineering in VTU, India, in 2005 and Ph.D. in Electrical

Engineering from the Indian Institute of Science, Bangalore, India,

in 2013. At present, he is working as an Assistant Professor at the

Department of Computer Science and Engineering at the National

Institute of Technology Karnataka (NITK), India. His major areas of

interests are Network-on-Chips, Warehouse scale computing and

DNA computing.

Jeny Rajan did his M.Tech. in Image Processing from the University

of Kerala, India, and received his PhD from the University of

Antwerp, Belgium. He is currently working as an Assistant Professor

at the Department of Computer Science and Engineering, National

Institute of Technology Karnataka (NITK), India. Before joining

NITK, he was working as a post-doctoral researcher at the Vision

Lab, University of Antwerp in Belgium. His main research interests

are MRI and Ultrasound image processing.


123

http://docs.nvidia.com/cuda/cuda-c-programming-guide/%23axzz3ac754Y00

http://docs.nvidia.com/cuda/cuda-c-programming-guide/%23axzz3ac754Y00

http://www.bic.mni.mcgill.ca/brainweb/

http://dx.doi.org/10.1007/s11554-013-0378-6

http://dx.doi.org/10.1007/s11554-013-0378-6

GPU implementation of non-local maximum likelihood ... · Retinex [20]. Level-set segmentations...

Documents

Transcript of GPU implementation of non-local maximum likelihood ... · Retinex [20]. Level-set segmentations...