Gaussian Image Blurring in CUDA C++

Image Processing:Gaussian smoothing

201301032Darshan Parsana

Blurring/smoothing

Mathematically, applying a Gaussian blur to an image is the same as convolving the image with a Gaussian function.

The Gaussian blur is a type of image-blurring filter that uses a Gaussian function for calculating the transformation to apply to each pixel in the image. Gaussian blur takes a weighted average around the pixel, while "normal" blur just averages all the pixels in the radius of the single pixel together.

https://en.wikipedia.org/wiki/Convolution

https://en.wikipedia.org/wiki/Gaussian_function

Gaussian function:

How it works? kernel type : Gaussian

Complexity = O(N*r*r) ; r = blur radii. N = total no. of pixels

It is a widely used effect in graphics software, typically to reduce image noise and reduce image detail.

Ref: https://en.wikipedia.org/wiki/Gaussian_blur

https://en.wikipedia.org/wiki/Gaussian_blur

Examples: Input = image Output = image

Blur radii = 1.2 pixel



Serial code Complexity = O(N*r*r); N=total no. of pixel So, in parallel code we can just launch threads based on output

image(like in matrix multiplication)

for(row = 0; row < height; row++){for(col = 0; col < width; col++){

int sumX = 0,sumY = 0,ans = 0;int r = row;int c = col;

for(i = -filterWidth/2; i < filterWidth/2; i++){for(j = -filterWidth/2; j < filterWidth/2; j++){

row = row+i;col = col+j; row = min(max(0, row), width - 1);

col = min(max(0, col), height - 1); int pixel = input[row][col]; sumX += pixel*Mx[i + filterWidth/2][j + filterWidth/2]; }

}ans = abs(sumX/273) ;

if(ans > 255) ans = 255;if(ans < 0) ans = 0;

output[r][c] = ans;}

}

Serial code

64*64 228*221 749*912020406080

100120140160180

1.01 6.75

70.8

0.45 4.92

90.027

load convolution

size

tim

e

Strategy & Naïve Implementation

Each thread generates a single output pixel. Simple implementation => load image, launch kernel, compute

output

A block of pixels from the image is loaded into an array in shared memory.

And load filter into constant memory

Parallel code:(without shared)Here, Block size = 16*16;

__global__ void image(int * in, int *out, int width){

//masksint Mx[5][5] =

{ { 1,4,7,4,1 },{4,16,26,16,4 },{7,26,41,26,7},{ 4,16,26,16,4 },{1,4,7,4,1} };

int sumX = 0;

int row = blockIdx.y * blockDim.y + threadIdx.y;int col = blockIdx.x * blockDim.x + threadIdx.x;

if(row <= 0 || row >= n-1 || col <= 0 || col >= n-1)

{out[row*width + col] = 0;

}else{

for(int i = -2; i < 3; i++){

for(int j = -2; j < 3; j++){

int pixel = in[(row + i) * width + (col + j)];sumX += pixel * Mx[i+2][j+2];

}}__syncthreads();

int ans = abs(sumX)/273;//if the value of sum exceeds general pixels

measures then assign boundariesif(ans > 255) ans = 255;if(ans < 0) ans = 0;//save the convolved pixel to out arrayout[row*width + col] = ans;

}}

Parallel code:(shared) Use of constant and

shared memory Tile size = block size =

16*16

//kernel__global__ void image(int * in, int *out, int width, int height){

__shared__ int smem[BLOCK_W*BLOCK_H];

__const__ int Mx[5][5] = { { 1,4,7,4,1 },{4,16,26,16,4 },{7,26,41,26,7},{ 4,16,26,16,4 },{1,4,7,4,1} };

int x =blockIdx.x*TILE_W+threadIdx.x - R;int y = blockIdx.y*TILE_H + threadIdx.y -R;

x = min(max(0, x), width-1);y = min(max(0,x), height-1);

unsigned int index = y*width+x;unsigned int bindex = threadIdx.y*blockDim.y+threadIdx.x;

smem[bindex] = in[index];__syncthreads();

if((threadIdx.x>=R)&&(threadIdx.x<(BLOCK_W-R))&&(threadIdx.y>=R)&&(threadIdx.y<(BLOCK_H-R)))

{int sum =0;

for(int dy = -R; dy<R;dy++){for(int dx=-R;dx<R;dx++){

int i = smem[bindex+(dy*blockDim.y)+dx];sum += Mx[dy][dx]*i;

}}out[index]= sum/273;}

}

Comparison(block size/TILE size on time)

4*4 8*8 16*16 32*320

0.05

0.1

0.15

0.2

0.25

0.3 0.28

0.160.14

0.167

0.08 0.07 0.064 0.081

Effect of block size

without shared shared

Block size

time

Fixed input size : 228*221

64*64 228*221 749*912012345

0.03 0.176

1.890.0649 0.1453

1.93

Without sharedload convolution

size

tim

e

64*64 228*221 749*9120

0.5

1

1.5

2

2.5

0.03 0.181

1.88

0.05 0.064

0.2658

sharedload convolution

Speed up:

64*64 228*221 749*9120

50

100

150

200

250

300

350

400

450

15 33.86 46.6515

76.875

338.7

1

0

0

without shared shared

From graph ,we can see that use of shared memory improves performance.

Conclusion

Using shared mem. and const. mem. , we can get much more speed up (here ~10x) than naïve.

Thank you

Gaussian Image Blurring in CUDA C++

Engineering

Transcript of Gaussian Image Blurring in CUDA C++