CUDA ITK Won-Ki Jeong SCI Institute University of Utah.
-
Upload
philippa-clarke -
Category
Documents
-
view
214 -
download
1
Transcript of CUDA ITK Won-Ki Jeong SCI Institute University of Utah.
![Page 1: CUDA ITK Won-Ki Jeong SCI Institute University of Utah.](https://reader036.fdocuments.us/reader036/viewer/2022082816/56649d215503460f949f5fc7/html5/thumbnails/1.jpg)
CUDA ITKCUDA ITK
Won-Ki JeongWon-Ki Jeong
SCI InstituteSCI Institute
University of UtahUniversity of Utah
![Page 2: CUDA ITK Won-Ki Jeong SCI Institute University of Utah.](https://reader036.fdocuments.us/reader036/viewer/2022082816/56649d215503460f949f5fc7/html5/thumbnails/2.jpg)
NVIDIA G80NVIDIA G80• New architecture for computing on New architecture for computing on
the GPUthe GPU– GPU as massively parallel multithreaded GPU as massively parallel multithreaded
machine machine • One step further from streaming modelOne step further from streaming model
– New hardware featuresNew hardware features• Unified shaders (ALUs)Unified shaders (ALUs)• Flexible memory access (scatter)Flexible memory access (scatter)• Fast user-controllable on-chip memoryFast user-controllable on-chip memory• Integer, bitwise operationsInteger, bitwise operations
![Page 3: CUDA ITK Won-Ki Jeong SCI Institute University of Utah.](https://reader036.fdocuments.us/reader036/viewer/2022082816/56649d215503460f949f5fc7/html5/thumbnails/3.jpg)
NVIDIA CUDANVIDIA CUDA• C-extension NVIDIA GPU programming C-extension NVIDIA GPU programming
languagelanguage– No graphics API overheadNo graphics API overhead– Easy to learnEasy to learn– Support development toolsSupport development tools
• Extensions / APIExtensions / API– Function type : __global__, __device__, __host__Function type : __global__, __device__, __host__– Variable type : __shared__, __constant__Variable type : __shared__, __constant__– cudaMalloc(), cudaFree(), cudaMemcpy(),…cudaMalloc(), cudaFree(), cudaMemcpy(),…– __syncthread(), atomicAdd(),…__syncthread(), atomicAdd(),…
• Program typesProgram types– DeviceDevice program (kernel) : run on the GPU program (kernel) : run on the GPU– HostHost program : run on the CPU to call device program : run on the CPU to call device
programsprograms
![Page 4: CUDA ITK Won-Ki Jeong SCI Institute University of Utah.](https://reader036.fdocuments.us/reader036/viewer/2022082816/56649d215503460f949f5fc7/html5/thumbnails/4.jpg)
CUDA ITKCUDA ITK• IITK powered by CUDATK powered by CUDA
– Many registration / image processing functions Many registration / image processing functions are still computationally expensive and are still computationally expensive and parallelizableparallelizable
– Current ITK parallelization is bound by # of Current ITK parallelization is bound by # of CPUs (cores)CPUs (cores)
• Our approachOur approach– Implement several well-known ITK image filters Implement several well-known ITK image filters
using NVIDIA CUDAusing NVIDIA CUDA– Focus on 3D volume processingFocus on 3D volume processing
• CT / MRI datasets are mostly 3D volumeCT / MRI datasets are mostly 3D volume
![Page 5: CUDA ITK Won-Ki Jeong SCI Institute University of Utah.](https://reader036.fdocuments.us/reader036/viewer/2022082816/56649d215503460f949f5fc7/html5/thumbnails/5.jpg)
CUDA ITKCUDA ITK• CUDA code is integrated into ITKCUDA code is integrated into ITK
– Transparent to the itk usersTransparent to the itk users– No need to modify current code using ITKNo need to modify current code using ITK
• Check environment variable ITK_CUDACheck environment variable ITK_CUDA– Entry point : GenerateData() or Entry point : GenerateData() or
ThreadedGenerateData()ThreadedGenerateData()– If ITK_CUDA == 0If ITK_CUDA == 0
• Execute original ITK codeExecute original ITK code
– If ITK_CUDA == 1If ITK_CUDA == 1• Execute CUDA codeExecute CUDA code
![Page 6: CUDA ITK Won-Ki Jeong SCI Institute University of Utah.](https://reader036.fdocuments.us/reader036/viewer/2022082816/56649d215503460f949f5fc7/html5/thumbnails/6.jpg)
ITK image space filtersITK image space filters• Convolution filtersConvolution filters
– Mean filterMean filter– Gaussian filterGaussian filter– Derivative filterDerivative filter– Hessian of Gaussian filterHessian of Gaussian filter
• Statistical filterStatistical filter– Median filterMedian filter
• PDE-based filterPDE-based filter– Anisotropic diffusion filterAnisotropic diffusion filter
![Page 7: CUDA ITK Won-Ki Jeong SCI Institute University of Utah.](https://reader036.fdocuments.us/reader036/viewer/2022082816/56649d215503460f949f5fc7/html5/thumbnails/7.jpg)
Speed up using CUDASpeed up using CUDA• Mean filter : ~ 140xMean filter : ~ 140x• Median filter : ~ 25xMedian filter : ~ 25x• Gaussian filter : ~ 60xGaussian filter : ~ 60x• Anisotropic diffusion : ~ 70xAnisotropic diffusion : ~ 70x
![Page 8: CUDA ITK Won-Ki Jeong SCI Institute University of Utah.](https://reader036.fdocuments.us/reader036/viewer/2022082816/56649d215503460f949f5fc7/html5/thumbnails/8.jpg)
CCoonvolution filtersnvolution filters• Separable filterSeparable filter
– N-dimensional convolution = N*1D convolutionN-dimensional convolution = N*1D convolution– For filter radius For filter radius rr, ,
• ExampleExample– 2D Gaussian = 2 * 1D Gaussian2D Gaussian = 2 * 1D Gaussian
)2
1(
2
1
)2
1
2
1(
)2
1(
2
2
2
2
2
2
2
2
2
22
22
22
22
Iee
Iee
IeIG
yx
yx
yx
)()( rNOrO N
![Page 9: CUDA ITK Won-Ki Jeong SCI Institute University of Utah.](https://reader036.fdocuments.us/reader036/viewer/2022082816/56649d215503460f949f5fc7/html5/thumbnails/9.jpg)
GPU implementationGPU implementation• Apply 1D convolution along each axisApply 1D convolution along each axis
– Minimize overlappingMinimize overlapping
kernel
*
Shared memory
Input (global memory) Output (global memory)
![Page 10: CUDA ITK Won-Ki Jeong SCI Institute University of Utah.](https://reader036.fdocuments.us/reader036/viewer/2022082816/56649d215503460f949f5fc7/html5/thumbnails/10.jpg)
Minimize overlappingMinimize overlapping• Usually kernel width is large ( > 20 for Gaussian)Usually kernel width is large ( > 20 for Gaussian)
– Max block size ~ 8x8x8Max block size ~ 8x8x8– Each pixel has 6 neighbors in 3DEach pixel has 6 neighbors in 3D
• Use long and thin blocks to minimize overlappingUse long and thin blocks to minimize overlapping
1
1
1
1
2
2
2
2 4
Multiple overlapping No overlapping
1
1
1
1
![Page 11: CUDA ITK Won-Ki Jeong SCI Institute University of Utah.](https://reader036.fdocuments.us/reader036/viewer/2022082816/56649d215503460f949f5fc7/html5/thumbnails/11.jpg)
Median filterMedian filter• Viola et al. [VIS 03]Viola et al. [VIS 03]
– Finding median by bisection of Finding median by bisection of histogram binshistogram bins
– Log(# bins) iterations (e.g., 8-bit pixel : Log(# bins) iterations (e.g., 8-bit pixel : 8 iterations)8 iterations)
14 3 18 2 10
16 4
14 3 18 2 10
115
14 3 18 2 10
14 3 18 2 10
1.
2.
3.
4.
0 1 2 3 4 5 6 7Intensity :
![Page 12: CUDA ITK Won-Ki Jeong SCI Institute University of Utah.](https://reader036.fdocuments.us/reader036/viewer/2022082816/56649d215503460f949f5fc7/html5/thumbnails/12.jpg)
Pseudo code (GPU median Pseudo code (GPU median filter)filter)
Copy current block from global to shared memory
min = 0;max = 255;pivot = (min+max)/2.0f;For(i=0; i<8; i++){ count = 0; For(j=0; j<kernelsize; j++) { if(kernel[j] > pivot) count++: }
if(count < kernelsize/2) max = floor(pivot); else min = ceil(pivot); pivot = (min + max)/2.0f;}
return floor(pivot);
![Page 13: CUDA ITK Won-Ki Jeong SCI Institute University of Utah.](https://reader036.fdocuments.us/reader036/viewer/2022082816/56649d215503460f949f5fc7/html5/thumbnails/13.jpg)
Perona & Malik anisotropic Perona & Malik anisotropic PDEPDE
• Nonlinear diffusionNonlinear diffusion– Fall-off function Fall-off function c c (conductance)(conductance) controls controls
anisotropyanisotropy– Less smoothing across high gradientLess smoothing across high gradient– Contrast parameter Contrast parameter kk
• Numerical solutionNumerical solution– Euler explicit integration (iterative method)Euler explicit integration (iterative method)– Finite difference for derivative computationFinite difference for derivative computation
2
2
)(
))((
k
x
exc
IIct
I
![Page 14: CUDA ITK Won-Ki Jeong SCI Institute University of Utah.](https://reader036.fdocuments.us/reader036/viewer/2022082816/56649d215503460f949f5fc7/html5/thumbnails/14.jpg)
Gradient & Conductance Gradient & Conductance mapmap
• Half x / y / z direction gradients / conductance for Half x / y / z direction gradients / conductance for each pixeleach pixel
• 2D example2D example– For n^2 block, 4(n+1)^2 + (n+2)^2 shared memory For n^2 block, 4(n+1)^2 + (n+2)^2 shared memory
requiredrequired
(n+1)*(n+1) * 4(grad x, grad y, cond x, cond y)
n*n(n+2)*(n+2)
Global memoryShared memory
![Page 15: CUDA ITK Won-Ki Jeong SCI Institute University of Utah.](https://reader036.fdocuments.us/reader036/viewer/2022082816/56649d215503460f949f5fc7/html5/thumbnails/15.jpg)
Euler integrationEuler integration• Use pre-computed gradients and Use pre-computed gradients and
conductanceconductance– Each gradient / conductance is used twiceEach gradient / conductance is used twice– Avoid redundant computation by using Avoid redundant computation by using
pre-computed gradient / conductance pre-computed gradient / conductance mapmap
))1,()1,(
)1,()1,(
),1(),1(
),1(),1((*),(
)))(((*),(),(
jigjic
jigjic
jigjic
jigjicdtjiI
IIcdtjiIjiI
old
oldnew
![Page 16: CUDA ITK Won-Ki Jeong SCI Institute University of Utah.](https://reader036.fdocuments.us/reader036/viewer/2022082816/56649d215503460f949f5fc7/html5/thumbnails/16.jpg)
Experiments Experiments • Test environmentTest environment
– CPU : AMD Opteron Dual Core 1.8GHzCPU : AMD Opteron Dual Core 1.8GHz– GPU : Tesla C870GPU : Tesla C870
• Input volume is 128^3Input volume is 128^3
![Page 17: CUDA ITK Won-Ki Jeong SCI Institute University of Utah.](https://reader036.fdocuments.us/reader036/viewer/2022082816/56649d215503460f949f5fc7/html5/thumbnails/17.jpg)
ResultResult• Mean filterMean filter
• Gaussian filter Gaussian filter
Kernel Kernel sizesize
33 55 77 99
ITKITK 1.031.03 2.132.13 7.177.17 18.518.5
CUDACUDA 0.07050.0705 0.050.05 0.080.08 0.1320.132
Speed upSpeed up 1313 4141 8686 140140
VarianceVariance 11 22 44 88
ITKITK 0.7730.773 1.071.07 1.361.36 2.122.12
CUDACUDA 0.02790.0279 0.03160.0316 0.03170.0317 0.03270.0327
Speed upSpeed up 2727 3333 4242 6464
![Page 18: CUDA ITK Won-Ki Jeong SCI Institute University of Utah.](https://reader036.fdocuments.us/reader036/viewer/2022082816/56649d215503460f949f5fc7/html5/thumbnails/18.jpg)
ResultResult• Median filterMedian filter
• Anisotropic diffusion Anisotropic diffusion
Kernel Kernel sizesize
33 55 77 99
ITKITK 1.031.03 4.184.18 14.114.1 23.123.1
CUDACUDA 0.07050.0705 0.2320.232 0.5440.544 1.071.07
Speed upSpeed up 1414 1818 2525 2121
IterationIteration 22 44 88 1616
ITKITK 3.213.21 6.376.37 12.712.7 25.525.5
CUDACUDA 0.07150.0715 0.1060.106 0.1720.172 0.3060.306
Speed upSpeed up 4444 6060 7373 8383
![Page 19: CUDA ITK Won-Ki Jeong SCI Institute University of Utah.](https://reader036.fdocuments.us/reader036/viewer/2022082816/56649d215503460f949f5fc7/html5/thumbnails/19.jpg)
SummarySummary• ITK powered by CUDAITK powered by CUDA
– Image space filters using CUDAImage space filters using CUDA– Up to 140x speed upUp to 140x speed up
• Future workFuture work– GPU image class for ITKGPU image class for ITK
• Reduce CPU to GPU memory I/OReduce CPU to GPU memory I/O• Pipelining supportPipelining support
– Image registrationImage registration– Numerical library (vnl)Numerical library (vnl)– Out-of-GPU-core processingOut-of-GPU-core processing
• Seismic volumes (~10s to 100s GB)Seismic volumes (~10s to 100s GB)
![Page 20: CUDA ITK Won-Ki Jeong SCI Institute University of Utah.](https://reader036.fdocuments.us/reader036/viewer/2022082816/56649d215503460f949f5fc7/html5/thumbnails/20.jpg)
Questions?