CUDA ITK Won-Ki Jeong SCI Institute University of Utah.

CUDA ITKCUDA ITK

Won-Ki JeongWon-Ki Jeong

SCI InstituteSCI Institute

University of UtahUniversity of Utah

NVIDIA G80NVIDIA G80• New architecture for computing on New architecture for computing on

the GPUthe GPU– GPU as massively parallel multithreaded GPU as massively parallel multithreaded

machine machine • One step further from streaming modelOne step further from streaming model

– New hardware featuresNew hardware features• Unified shaders (ALUs)Unified shaders (ALUs)• Flexible memory access (scatter)Flexible memory access (scatter)• Fast user-controllable on-chip memoryFast user-controllable on-chip memory• Integer, bitwise operationsInteger, bitwise operations

NVIDIA CUDANVIDIA CUDA• C-extension NVIDIA GPU programming C-extension NVIDIA GPU programming

languagelanguage– No graphics API overheadNo graphics API overhead– Easy to learnEasy to learn– Support development toolsSupport development tools

• Extensions / APIExtensions / API– Function type : __global__, __device__, __host__Function type : __global__, __device__, __host__– Variable type : __shared__, __constant__Variable type : __shared__, __constant__– cudaMalloc(), cudaFree(), cudaMemcpy(),…cudaMalloc(), cudaFree(), cudaMemcpy(),…– __syncthread(), atomicAdd(),…__syncthread(), atomicAdd(),…

• Program typesProgram types– DeviceDevice program (kernel) : run on the GPU program (kernel) : run on the GPU– HostHost program : run on the CPU to call device program : run on the CPU to call device

programsprograms

CUDA ITKCUDA ITK• IITK powered by CUDATK powered by CUDA

– Many registration / image processing functions Many registration / image processing functions are still computationally expensive and are still computationally expensive and parallelizableparallelizable

– Current ITK parallelization is bound by # of Current ITK parallelization is bound by # of CPUs (cores)CPUs (cores)

• Our approachOur approach– Implement several well-known ITK image filters Implement several well-known ITK image filters

using NVIDIA CUDAusing NVIDIA CUDA– Focus on 3D volume processingFocus on 3D volume processing

• CT / MRI datasets are mostly 3D volumeCT / MRI datasets are mostly 3D volume

CUDA ITKCUDA ITK• CUDA code is integrated into ITKCUDA code is integrated into ITK

– Transparent to the itk usersTransparent to the itk users– No need to modify current code using ITKNo need to modify current code using ITK

• Check environment variable ITK_CUDACheck environment variable ITK_CUDA– Entry point : GenerateData() or Entry point : GenerateData() or

ThreadedGenerateData()ThreadedGenerateData()– If ITK_CUDA == 0If ITK_CUDA == 0

• Execute original ITK codeExecute original ITK code

– If ITK_CUDA == 1If ITK_CUDA == 1• Execute CUDA codeExecute CUDA code

ITK image space filtersITK image space filters• Convolution filtersConvolution filters

– Mean filterMean filter– Gaussian filterGaussian filter– Derivative filterDerivative filter– Hessian of Gaussian filterHessian of Gaussian filter

• Statistical filterStatistical filter– Median filterMedian filter

• PDE-based filterPDE-based filter– Anisotropic diffusion filterAnisotropic diffusion filter

Speed up using CUDASpeed up using CUDA• Mean filter : ~ 140xMean filter : ~ 140x• Median filter : ~ 25xMedian filter : ~ 25x• Gaussian filter : ~ 60xGaussian filter : ~ 60x• Anisotropic diffusion : ~ 70xAnisotropic diffusion : ~ 70x

CCoonvolution filtersnvolution filters• Separable filterSeparable filter

– N-dimensional convolution = N*1D convolutionN-dimensional convolution = N*1D convolution– For filter radius For filter radius rr, ,

• ExampleExample– 2D Gaussian = 2 * 1D Gaussian2D Gaussian = 2 * 1D Gaussian

)2

1(

2

1

)2

1

2

1(

)2

1(

2

2

2

2

2

2

2

2

2

22

22

22

22

Iee

Iee

IeIG

yx

yx

yx

)()( rNOrO N

GPU implementationGPU implementation• Apply 1D convolution along each axisApply 1D convolution along each axis

– Minimize overlappingMinimize overlapping

kernel

*

Shared memory

Input (global memory) Output (global memory)

Minimize overlappingMinimize overlapping• Usually kernel width is large ( > 20 for Gaussian)Usually kernel width is large ( > 20 for Gaussian)

– Max block size ~ 8x8x8Max block size ~ 8x8x8– Each pixel has 6 neighbors in 3DEach pixel has 6 neighbors in 3D

• Use long and thin blocks to minimize overlappingUse long and thin blocks to minimize overlapping

1

1

1

1

2

2

2

2 4

Multiple overlapping No overlapping

1

1

1

1

Median filterMedian filter• Viola et al. [VIS 03]Viola et al. [VIS 03]

– Finding median by bisection of Finding median by bisection of histogram binshistogram bins

– Log(# bins) iterations (e.g., 8-bit pixel : Log(# bins) iterations (e.g., 8-bit pixel : 8 iterations)8 iterations)

14 3 18 2 10

16 4

14 3 18 2 10

115

14 3 18 2 10

14 3 18 2 10

1.

2.

3.

4.

0 1 2 3 4 5 6 7Intensity :

Pseudo code (GPU median Pseudo code (GPU median filter)filter)

Copy current block from global to shared memory

min = 0;max = 255;pivot = (min+max)/2.0f;For(i=0; i<8; i++){ count = 0; For(j=0; j<kernelsize; j++) { if(kernel[j] > pivot) count++: }

if(count < kernelsize/2) max = floor(pivot); else min = ceil(pivot); pivot = (min + max)/2.0f;}

return floor(pivot);

Perona & Malik anisotropic Perona & Malik anisotropic PDEPDE

• Nonlinear diffusionNonlinear diffusion– Fall-off function Fall-off function c c (conductance)(conductance) controls controls

anisotropyanisotropy– Less smoothing across high gradientLess smoothing across high gradient– Contrast parameter Contrast parameter kk

• Numerical solutionNumerical solution– Euler explicit integration (iterative method)Euler explicit integration (iterative method)– Finite difference for derivative computationFinite difference for derivative computation

2

2

)(

))((

k

x

exc

IIct

I

Gradient & Conductance Gradient & Conductance mapmap

• Half x / y / z direction gradients / conductance for Half x / y / z direction gradients / conductance for each pixeleach pixel

• 2D example2D example– For n^2 block, 4(n+1)^2 + (n+2)^2 shared memory For n^2 block, 4(n+1)^2 + (n+2)^2 shared memory

requiredrequired

(n+1)*(n+1) * 4(grad x, grad y, cond x, cond y)

n*n(n+2)*(n+2)

Global memoryShared memory

Euler integrationEuler integration• Use pre-computed gradients and Use pre-computed gradients and

conductanceconductance– Each gradient / conductance is used twiceEach gradient / conductance is used twice– Avoid redundant computation by using Avoid redundant computation by using

pre-computed gradient / conductance pre-computed gradient / conductance mapmap

))1,()1,(

)1,()1,(

),1(),1(

),1(),1((*),(

)))(((*),(),(

jigjic

jigjic

jigjic

jigjicdtjiI

IIcdtjiIjiI

old

oldnew

Experiments Experiments • Test environmentTest environment

– CPU : AMD Opteron Dual Core 1.8GHzCPU : AMD Opteron Dual Core 1.8GHz– GPU : Tesla C870GPU : Tesla C870

• Input volume is 128^3Input volume is 128^3

ResultResult• Mean filterMean filter

• Gaussian filter Gaussian filter

Kernel Kernel sizesize

33 55 77 99

ITKITK 1.031.03 2.132.13 7.177.17 18.518.5

CUDACUDA 0.07050.0705 0.050.05 0.080.08 0.1320.132

Speed upSpeed up 1313 4141 8686 140140

VarianceVariance 11 22 44 88

ITKITK 0.7730.773 1.071.07 1.361.36 2.122.12

CUDACUDA 0.02790.0279 0.03160.0316 0.03170.0317 0.03270.0327

Speed upSpeed up 2727 3333 4242 6464

ResultResult• Median filterMedian filter

• Anisotropic diffusion Anisotropic diffusion

Kernel Kernel sizesize

33 55 77 99

ITKITK 1.031.03 4.184.18 14.114.1 23.123.1

CUDACUDA 0.07050.0705 0.2320.232 0.5440.544 1.071.07

Speed upSpeed up 1414 1818 2525 2121

IterationIteration 22 44 88 1616

ITKITK 3.213.21 6.376.37 12.712.7 25.525.5

CUDACUDA 0.07150.0715 0.1060.106 0.1720.172 0.3060.306

Speed upSpeed up 4444 6060 7373 8383

SummarySummary• ITK powered by CUDAITK powered by CUDA

– Image space filters using CUDAImage space filters using CUDA– Up to 140x speed upUp to 140x speed up

• Future workFuture work– GPU image class for ITKGPU image class for ITK

• Reduce CPU to GPU memory I/OReduce CPU to GPU memory I/O• Pipelining supportPipelining support

– Image registrationImage registration– Numerical library (vnl)Numerical library (vnl)– Out-of-GPU-core processingOut-of-GPU-core processing

• Seismic volumes (~10s to 100s GB)Seismic volumes (~10s to 100s GB)

Questions?

CUDA ITK Won-Ki Jeong SCI Institute University of Utah.

Documents

Transcript of CUDA ITK Won-Ki Jeong SCI Institute University of Utah.