Introduction to Productive GPU Programming | GTC...

Introduction to Productive GPU Programming

Umar Arshad

ArrayFire

● World’s leading GPU experts○ In the industry since 2007○ NVIDIA Partner

● Deep experience working with thousands of customers○ Analysis○ Acceleration○ Algorithm development

● GPU Training○ Hands on course with a CUDA engineer○ Customized to meet your needs

Productivity

● Software Development○ Development Costs○ Features○ User Experience

● Limited resources● Tools and Libraries

○ Reduce R&D costs○ Lower testing and deployment time○ More time for features

GPU Libraries

● Programmed by GPU experts○ Years of experience

● Abstract low level details● Target multiple architectures

○ Some kernels might run better on older hardware

● Free improvements on new hardware○ Update to the latest version

● No need to reinvent the wheel

Library Types

● Specialized GPU Libs○ Targeted at a specific set of operations○ C interface○ Raw pointer interface

● General GPU Libs○ Manage GPU resources using containers○ Targeted for general computation○ Higher level functions○ C++ interface

Specialized GPU Libraries

● Fast Fourier Transforms○ cuFFT

● Random Number Generation○ cuRAND

● Linear Algebra○ cuBLAS○ CULA Tools○ MAGMA

● Signal and Image Processing○ NPP

Specialized GPU Libraries

● C Interface○ Use pointers to reference data

● Do not manage memory● Mimic existing libraries

○ cuBLAS ≈ BLAS○ CULA ≈ BLAS + LAPACK○ cuFFT ≈ FFTW

● Minimizes the amount of code necessary to integrate into existing projects

cuFFT

● 1D, 2D and 3D transforms● Both real and complex data types supported● Single and double-precision supported● Batch execution for multiple transforms● Available as part of the CUDA Toolkit

cuRAND

● Bulk random number generation on the GPU● Use in Host and Kernel● Single and double precision support● Four different RNG algorithms

○ MRG32k3a○ MTGP Merseinne Twister○ XORWOW○ Sobol' quasi-RNG

● Multiple RNG distributions (uniform, normal, log-normal, poisson)

cuBLAS, CULA and MAGMA

● Support most popular linear algebra routines● Real and complex data types support● Single and double-precision support

NPP

● Signal and image processing functions● Avoid unnecessary data copies

○ Can process data that is already on the GPU○ Keeps processed data on GPU for further processing

● Arithmetic and logical operations● Color conversions● Filtering● Geometric transforms● Statistical functions

General-Purpose GPU Libraries

● Thrust● OpenCV● ArrayFire

Thrust

● GPU library resembling C++ STL○ STL like data structures○ Iterators○ Fully interoperable with CUDA C

● Parallel vector operation methods○ Reductions○ Sorting○ Prefix-Sum

● Customizable GPU kernels using functors

Thrust - Data Structures

● Two types of containers

● Supports same data types as C++○ host_vector<float> foo(2e6, 4);○ device_vector<double> bar(2e6);

● Explicit data transfer○ bar = foo;

host_vector Stores data on the host

device_vector Stores data on the device

Thrust iterators

● Like C++ Thrust uses iterators to define range of operations

● Thrust functions require a begin and end iterator○ The begin iterator points to the starting range○ The end iterator points to the ending range

int sum = thrust::reduce(hdata.begin(), hdata.end()); //sum of the hdata vector

Thrust Functions

● Thrust includes many basic algorithms for general computation○ Reductions○ Sorting○ Prefix-sum○ Scan○ Reordering○ Transformation○ Generation○ Random number generation

Thrust Functions

● Many thrust functions can be altered using built-in function objects or custom functions

● Basic operations○ plus, minus, multiply, etc.

● Custom Functors○ Create your own function objects○ Overload the operator() function

■ decorated with __host__ and __device__

Thrust Functorsstruct my_plus{ __host__ __device__ float operator()(const float& x, const float& y) const { return x + y; }};

void thrust_functor_example(){ // Define input vectors x and y and output vector z with equal lengths ...

// z <- x + y; thrust::transform(x.begin(), x.end(), y.begin(), z.begin(), my_plus()); thrust::transform(x.begin(), x.end(), y.begin(), z.begin(), plus<float>());}

Thrust Functors

● Caveats○ Cannot use shared memory○ Do not have control over block/grid size○ No stream support

OpenCV (GPU)

● Manipulate matrices● Perform complex image processing operations

○ Image filtering○ Resizing

● Dozens of available computer vision algorithms○ Object recognition○ Human Detection

OpenCV (GPU) - Data Structures

● GpuMat container to store data● Signed/unsigned 8, 16 and 32 bit integers, single and

double-precision floating-point numbers● Multi-channel support● Only 2D matrices - no arbitrary dimensions

OpenCV (GPU) - Usage Examplevoid opencv_gpu_example(){

float f[4] = {100, 200, 400, 800};cv::Mat h(4, 1,CV_32F, f);

cv::gpu::GpuMat d(h);cv::Scalar sum = cv::gpu::sum(d);

}

ArrayFire

● Hundreds of parallel functions○ Targeting image processing, machine learning, etc.

● Support for multiple languages○ C/C++, Fortran, Java and R

● Linux, Windows, Mac OS X ● OpenGL based graphics● Based around one data structure● JIT

○ Combine multiple operations into one kernel

● GFOR, the only data parallel loop

ArrayFire Functions

● Hundreds of parallel functions○ Building blocks

■ Reductions■ Scan■ Set operations■ Sorting■ Statistics■ Basic matrix manipulation

ArrayFire Functions

● Hundreds of parallel functions○ Signal/image processing

■ Convolution■ FFT■ Histograms■ Interpolation■ Connected components

○ Linear Algebra■ Matrix multiply■ Linear system solving■ Factorization

ArrayFire - Data Structures

● Built around a flexible data structure named "array"○ Lightweight wrapper around the data on the compute device

○ Manages the data and basic metadata such as size, type and dimensions

● You can transfer data into an array object using one of its constructors

float hA[] = {0, 1, 2, 3, 4, 5};array A(2, 3, hA);

ArrayFire - Indexing#include <arrayfire.h>#include <af/utils.h> // require for print()

void af_example(){ float f[4] = {100, 200, 400, 800}; array a(4, f); // 4 rows x 1 col array initialized with f values array b = sum(a); // performs reduce-sum over all elements of a}

Case Study 1 — Dot Product

● Have two N-dimensional vectors, x and y● Multiply magnitudes of dimensions 1 to N of vectors x

and y● Sum up all N results

Case Study 1 — Dot Product — Thrustdouble dotproduct_thrust(){

// float vector in device memorythrust::device_vector<float> x(samples);thrust::device_vector<float> y(samples);thrust::device_vector<float> z(samples);

// generate sequence, starting from 0.0f, with steps of 1.0f and 0.001fthrust::sequence(x.begin(), x.end(), 0.f, 1.f);thrust::sequence(y.begin(), y.end(), 0.f, 0.001f);

// multiplies vectors x and y, storing result in zthrust::transform(x.begin(), x.end(), y.begin(), z.begin(), thrust::multiplies<float>());

// returns sum-reduction of zreturn thrust::reduce(z.begin(), z.end());

}

Case Study 1 — Dot Product — ArrayFirestatic double dotproduct_af(){

// array in device memory, set to sequence {0.0f, 1.0f, 2.0f, ..., samples-1}array x = seq(samples);// array in device memory, set to sequence {0.0f, 1.0f, 2.0f, ..., samples-1} * 0.001farray y(seq(samples)*0.001f);

// multiplies x and y and returns sum-reduction of resulting vectorreturn sum<float>(x*y);

}

Case Study 2 — Pi Estimation

● Generate millions of uniformly distributed random samples

● Each sample will include x and y coordinates● Estimate rate of samples that fall within the unit

radius circle

Case Study 2 — Pi Estimation — ArrayFiredouble pi_af(){

array x = randu(samples,f32), y = randu(samples,f32);return 4 * sum<float>(x*x + y*y <= 1) / samples;

}

Speedups With ArrayFire

Field Application Speedup

Academia Power Systems Simulations 35x

Finance Option Pricing 52x

Government Radar Image Formation 45x

Life Sciences Pathology Advances > 100x

Manufacturing Tomography of Vegetation 10x

Media & Computer Vision Digital Holography 17x

Oil & Gas Ground Water Simulations > 20x

OpenACC — Programming Standard

● Write straightforward (serial) code (better sentence: "Write (almost) serial code")

● Use directives to tell the compiler what is executed in parallel

● Let the compiler do the parallel work for you● Under constant development — constant

performance improvements!

Implementation From Scratch — When?

● You are writing a novel algorithm implementation● Your code uses a modified version of the standard

algorithm in question● You want to learn about how parallel code works

Serial to Parallel

● Check if serial -> parallel is feasible (e.g., loops with inputs that do not rely on previous results)

● Identify performance bottlenecks (e.g., memory bandwidth, register usage)

● Profile the code (e.g., check where your code is spending most of its time and resources)

● Optimize the code (e.g., use shared memory where possible)

Tools To Improve Productivity

● Debugging, memory correctness checking and profiling tools○ Debugging: cuda-gdb○ Memory correctness checking: cuda-memcheck○ Profiling: nvprof

● Integrated Development Environments (IDEs)○ NVIDIA Nsight

cuda-gdb

● Break host code on condition● Check variable values● Do everything else that GNU gdb is capable of

cuda-gdb - Example Code#include <cuda.h> // Include for debugging

__global__ void vadd(int * a, int * b, int * c, int length){ int idx = blockDim.x * blockIdx.x + threadIdx.x; if (idx < length) c[idx] = a[idx] + b[idx]; // Line 6}

int main(){ int samples = 1000;... // d_A = {0, 1, 2, 3, ..., 999}; // d_B = {0, 2, 4, 6, ..., 1998}; vadd<<<2, 512>>>(d_A, d_B, d_C, samples);

return 0;}

cuda-gdb - Usage$ cuda-gdb ./vadd(cuda-gdb) break vadd.cu:6 if idx == 900Breakpoint 1 (vadd.cu:6 if idx == 900) pending.(cuda-gdb) run[Launch of CUDA Kernel 0 (vadd<<<(2,1,1),(512,1,1)>>>) on Device 0][Switching focus to CUDA kernel 0, grid 1, block (1,0,0), thread (388,0,0), device 0, sm 6, warp 12, lane 4]

Breakpoint 1, vadd(int * @global, int * @global, int * @global, int)<<<(2,1,1),(512,1,1)>>> (a=0x500140000, b=0x500141000, c=0x500142000, length=1000) at vadd.cu:66 if (idx < length) c[idx] = a[idx] + b[idx];(cuda-gdb) print a[idx]$1 = 900(cuda-gdb) print b[idx]$2 = 1800(cuda-gdb) print c[idx]$3 = 2700

cuda-gdb - CUDA Information

● Get information on device running your code(cuda-gdb) info cuda devices Dev Description SM Type SMs Warps/SM Lanes/Warp Max Regs/Lane Active SMs Mask* 0 gk104 sm_30 8 64 32 64 0x000000c0

● Get information on how your kernel is called(cuda-gdb) info cuda kernels Kernel Parent Dev Grid Status SMs Mask GridDim BlockDim Invocation* 0 - 0 1 Active 0x000000c0 (2,1,1) (512,1,1) vadd(a=0x500140000, b=0x500141000, c=0x500142000, length=1000)

● Get information on the threads running your kernel(cuda-gdb) info cuda threads BlockIdx ThreadIdx To BlockIdx ThreadIdx Count Virtual PC Filename LineKernel 0* (0,0,0) (0,0,0) (1,0,0) (511,0,0) 1024 0x0000000000678f88 vadd.cu 6

● There is even more: SMs, warps, lanes, blocks, ...

cuda-memcheck

● Check for memory out of bounds● Errors will be generated when reading or writing

unallocated memory

cuda-memcheck - Example Code#include <cuda.h> // Include for debugging

__global__ void vadd(int * a, int * b, int * c, int length){

int idx = blockDim.x * blockIdx.x + threadIdx.x;if (idx < length) c[idx] = a[idx] + b[idx]; // Line 6

}

int main(){

int samples = 1000;...

// d_A = {0, 1, 2, 3, ..., 999};// d_B = {0, 2, 4, 6, ..., 1998};vadd<<<2, 512>>>(d_A, d_B, d_C, samples);

cuda-memcheck - Usage$ cuda-memcheck ./vadd========= CUDA-MEMCHECK========= Invalid __global__ read of size 4========= at 0x000000e0 in vadd.cu:9:vadd(int*, int*, int*, int)========= by thread (511,0,0) in block (1,0,0)========= Address 0x500140ffc is out of bounds================= Invalid __global__ read of size 4========= at 0x000000e0 in vadd.cu:9:vadd(int*, int*, int*, int)========= by thread (510,0,0) in block (1,0,0)========= Address 0x500140ff8 is out of bounds...========= Program hit error 4 on CUDA API call to cudaMemcpy========= Saved host backtrace up to driver entry point at error================== ERROR SUMMARY: 25 errors

nvprof

● Verify time consumed by each call to an operation involving a device

● Useful information to identify bottlenecks

nvprof - Usage$ nvprof ./vadd==10506== NVPROF is profiling process 10506, command: ./vadd==10506== Profiling application: ./vadd==10506== Profiling result:Time(%) Time Calls Avg Min Max Name 38.03% 2.5920us 2 1.2960us 1.2800us 1.3120us [CUDA memcpy HtoD] 37.09% 2.5280us 1 2.5280us 2.5280us 2.5280us [CUDA memcpy DtoH] 24.88% 1.6960us 1 1.6960us 1.6960us 1.6960us vadd(int*, int*, int*, int)

Introduction to Productive GPU Programming | GTC...

Documents

Transcript of Introduction to Productive GPU Programming | GTC...