Gaussian Image Blurring in CUDA C++
-
Upload
darshan-parsana -
Category
Engineering
-
view
1.478 -
download
34
Transcript of Gaussian Image Blurring in CUDA C++
Image Processing:Gaussian smoothing
201301032Darshan Parsana
Blurring/smoothing
Mathematically, applying a Gaussian blur to an image is the same as convolving the image with a Gaussian function.
The Gaussian blur is a type of image-blurring filter that uses a Gaussian function for calculating the transformation to apply to each pixel in the image. Gaussian blur takes a weighted average around the pixel, while "normal" blur just averages all the pixels in the radius of the single pixel together.
Gaussian function:
How it works? kernel type : Gaussian
Complexity = O(N*r*r) ; r = blur radii. N = total no. of pixels
It is a widely used effect in graphics software, typically to reduce image noise and reduce image detail.
Ref: https://en.wikipedia.org/wiki/Gaussian_blur
Examples: Input = image Output = image
Blur radii = 1.2 pixel
Blur radii = 2.5 pixel
Blur radii = 5.0 pixel
Serial code Complexity = O(N*r*r); N=total no. of pixel So, in parallel code we can just launch threads based on output
image(like in matrix multiplication)
for(row = 0; row < height; row++){for(col = 0; col < width; col++){
int sumX = 0,sumY = 0,ans = 0;int r = row;int c = col;
for(i = -filterWidth/2; i < filterWidth/2; i++){for(j = -filterWidth/2; j < filterWidth/2; j++){
row = row+i;col = col+j; row = min(max(0, row), width - 1);
col = min(max(0, col), height - 1); int pixel = input[row][col]; sumX += pixel*Mx[i + filterWidth/2][j + filterWidth/2]; }
}ans = abs(sumX/273) ;
if(ans > 255) ans = 255;if(ans < 0) ans = 0;
output[r][c] = ans;}
}
Serial code
64*64 228*221 749*912020406080
100120140160180
1.01 6.75
70.8
0.45 4.92
90.027
load convolution
size
tim
e
Strategy & Naïve Implementation
Each thread generates a single output pixel. Simple implementation => load image, launch kernel, compute
output
A block of pixels from the image is loaded into an array in shared memory.
And load filter into constant memory
Parallel code:(without shared)Here, Block size = 16*16;
__global__ void image(int * in, int *out, int width){
//masksint Mx[5][5] =
{ { 1,4,7,4,1 },{4,16,26,16,4 },{7,26,41,26,7},{ 4,16,26,16,4 },{1,4,7,4,1} };
int sumX = 0;
int row = blockIdx.y * blockDim.y + threadIdx.y;int col = blockIdx.x * blockDim.x + threadIdx.x;
if(row <= 0 || row >= n-1 || col <= 0 || col >= n-1)
{out[row*width + col] = 0;
}else{
for(int i = -2; i < 3; i++){
for(int j = -2; j < 3; j++){
int pixel = in[(row + i) * width + (col + j)];sumX += pixel * Mx[i+2][j+2];
}}__syncthreads();
int ans = abs(sumX)/273;//if the value of sum exceeds general pixels
measures then assign boundariesif(ans > 255) ans = 255;if(ans < 0) ans = 0;//save the convolved pixel to out arrayout[row*width + col] = ans;
}}
Parallel code:(shared) Use of constant and
shared memory Tile size = block size =
16*16
//kernel__global__ void image(int * in, int *out, int width, int height){
__shared__ int smem[BLOCK_W*BLOCK_H];
__const__ int Mx[5][5] = { { 1,4,7,4,1 },{4,16,26,16,4 },{7,26,41,26,7},{ 4,16,26,16,4 },{1,4,7,4,1} };
int x =blockIdx.x*TILE_W+threadIdx.x - R;int y = blockIdx.y*TILE_H + threadIdx.y -R;
x = min(max(0, x), width-1);y = min(max(0,x), height-1);
unsigned int index = y*width+x;unsigned int bindex = threadIdx.y*blockDim.y+threadIdx.x;
smem[bindex] = in[index];__syncthreads();
if((threadIdx.x>=R)&&(threadIdx.x<(BLOCK_W-R))&&(threadIdx.y>=R)&&(threadIdx.y<(BLOCK_H-R)))
{int sum =0;
for(int dy = -R; dy<R;dy++){for(int dx=-R;dx<R;dx++){
int i = smem[bindex+(dy*blockDim.y)+dx];sum += Mx[dy][dx]*i;
}}out[index]= sum/273;}
}
Comparison(block size/TILE size on time)
4*4 8*8 16*16 32*320
0.05
0.1
0.15
0.2
0.25
0.3 0.28
0.160.14
0.167
0.08 0.07 0.064 0.081
Effect of block size
without shared shared
Block size
time
Fixed input size : 228*221
64*64 228*221 749*912012345
0.03 0.176
1.890.0649 0.1453
1.93
Without sharedload convolution
size
tim
e
64*64 228*221 749*9120
0.5
1
1.5
2
2.5
0.03 0.181
1.88
0.05 0.064
0.2658
sharedload convolution
Speed up:
64*64 228*221 749*9120
50
100
150
200
250
300
350
400
450
15 33.86 46.6515
76.875
338.7
1
0
0
without shared shared
From graph ,we can see that use of shared memory improves performance.
Conclusion
Using shared mem. and const. mem. , we can get much more speed up (here ~10x) than naïve.
Thank you