Introduction to Productive GPU Programming | GTC...
Transcript of Introduction to Productive GPU Programming | GTC...
Introduction to Productive GPU Programming
Umar Arshad
ArrayFire
● World’s leading GPU experts○ In the industry since 2007○ NVIDIA Partner
● Deep experience working with thousands of customers○ Analysis○ Acceleration○ Algorithm development
● GPU Training○ Hands on course with a CUDA engineer○ Customized to meet your needs
Productivity
● Software Development○ Development Costs○ Features○ User Experience
● Limited resources● Tools and Libraries
○ Reduce R&D costs○ Lower testing and deployment time○ More time for features
GPU Libraries
● Programmed by GPU experts○ Years of experience
● Abstract low level details● Target multiple architectures
○ Some kernels might run better on older hardware
● Free improvements on new hardware○ Update to the latest version
● No need to reinvent the wheel
Library Types
● Specialized GPU Libs○ Targeted at a specific set of operations○ C interface○ Raw pointer interface
● General GPU Libs○ Manage GPU resources using containers○ Targeted for general computation○ Higher level functions○ C++ interface
Specialized GPU Libraries
● Fast Fourier Transforms○ cuFFT
● Random Number Generation○ cuRAND
● Linear Algebra○ cuBLAS○ CULA Tools○ MAGMA
● Signal and Image Processing○ NPP
Specialized GPU Libraries
● C Interface○ Use pointers to reference data
● Do not manage memory● Mimic existing libraries
○ cuBLAS ≈ BLAS○ CULA ≈ BLAS + LAPACK○ cuFFT ≈ FFTW
● Minimizes the amount of code necessary to integrate into existing projects
cuFFT
● 1D, 2D and 3D transforms● Both real and complex data types supported● Single and double-precision supported● Batch execution for multiple transforms● Available as part of the CUDA Toolkit
cuRAND
● Bulk random number generation on the GPU● Use in Host and Kernel● Single and double precision support● Four different RNG algorithms
○ MRG32k3a○ MTGP Merseinne Twister○ XORWOW○ Sobol' quasi-RNG
● Multiple RNG distributions (uniform, normal, log-normal, poisson)
cuBLAS, CULA and MAGMA
● Support most popular linear algebra routines● Real and complex data types support● Single and double-precision support
NPP
● Signal and image processing functions● Avoid unnecessary data copies
○ Can process data that is already on the GPU○ Keeps processed data on GPU for further processing
● Arithmetic and logical operations● Color conversions● Filtering● Geometric transforms● Statistical functions
General-Purpose GPU Libraries
● Thrust● OpenCV● ArrayFire
Thrust
● GPU library resembling C++ STL○ STL like data structures○ Iterators○ Fully interoperable with CUDA C
● Parallel vector operation methods○ Reductions○ Sorting○ Prefix-Sum
● Customizable GPU kernels using functors
Thrust - Data Structures
● Two types of containers
● Supports same data types as C++○ host_vector<float> foo(2e6, 4);○ device_vector<double> bar(2e6);
● Explicit data transfer○ bar = foo;
host_vector Stores data on the host
device_vector Stores data on the device
Thrust iterators
● Like C++ Thrust uses iterators to define range of operations
● Thrust functions require a begin and end iterator○ The begin iterator points to the starting range○ The end iterator points to the ending range
int sum = thrust::reduce(hdata.begin(), hdata.end()); //sum of the hdata vector
Thrust Functions
● Thrust includes many basic algorithms for general computation○ Reductions○ Sorting○ Prefix-sum○ Scan○ Reordering○ Transformation○ Generation○ Random number generation
Thrust Functions
● Many thrust functions can be altered using built-in function objects or custom functions
● Basic operations○ plus, minus, multiply, etc.
● Custom Functors○ Create your own function objects○ Overload the operator() function
■ decorated with __host__ and __device__
Thrust Functorsstruct my_plus{ __host__ __device__ float operator()(const float& x, const float& y) const { return x + y; }};
void thrust_functor_example(){ // Define input vectors x and y and output vector z with equal lengths ...
// z <- x + y; thrust::transform(x.begin(), x.end(), y.begin(), z.begin(), my_plus()); thrust::transform(x.begin(), x.end(), y.begin(), z.begin(), plus<float>());}
Thrust Functors
● Caveats○ Cannot use shared memory○ Do not have control over block/grid size○ No stream support
OpenCV (GPU)
● Manipulate matrices● Perform complex image processing operations
○ Image filtering○ Resizing
● Dozens of available computer vision algorithms○ Object recognition○ Human Detection
OpenCV (GPU) - Data Structures
● GpuMat container to store data● Signed/unsigned 8, 16 and 32 bit integers, single and
double-precision floating-point numbers● Multi-channel support● Only 2D matrices - no arbitrary dimensions
OpenCV (GPU) - Usage Examplevoid opencv_gpu_example(){
float f[4] = {100, 200, 400, 800};cv::Mat h(4, 1,CV_32F, f);
cv::gpu::GpuMat d(h);cv::Scalar sum = cv::gpu::sum(d);
}
ArrayFire
● Hundreds of parallel functions○ Targeting image processing, machine learning, etc.
● Support for multiple languages○ C/C++, Fortran, Java and R
● Linux, Windows, Mac OS X ● OpenGL based graphics● Based around one data structure● JIT
○ Combine multiple operations into one kernel
● GFOR, the only data parallel loop
ArrayFire Functions
● Hundreds of parallel functions○ Building blocks
■ Reductions■ Scan■ Set operations■ Sorting■ Statistics■ Basic matrix manipulation
ArrayFire Functions
● Hundreds of parallel functions○ Signal/image processing
■ Convolution■ FFT■ Histograms■ Interpolation■ Connected components
○ Linear Algebra■ Matrix multiply■ Linear system solving■ Factorization
ArrayFire - Data Structures
● Built around a flexible data structure named "array"○ Lightweight wrapper around the data on the compute device
○ Manages the data and basic metadata such as size, type and dimensions
● You can transfer data into an array object using one of its constructors
float hA[] = {0, 1, 2, 3, 4, 5};array A(2, 3, hA);
ArrayFire - Indexing#include <arrayfire.h>#include <af/utils.h> // require for print()
void af_example(){ float f[4] = {100, 200, 400, 800}; array a(4, f); // 4 rows x 1 col array initialized with f values array b = sum(a); // performs reduce-sum over all elements of a}
Case Study 1 — Dot Product
● Have two N-dimensional vectors, x and y● Multiply magnitudes of dimensions 1 to N of vectors x
and y● Sum up all N results
Case Study 1 — Dot Product — Thrustdouble dotproduct_thrust(){
// float vector in device memorythrust::device_vector<float> x(samples);thrust::device_vector<float> y(samples);thrust::device_vector<float> z(samples);
// generate sequence, starting from 0.0f, with steps of 1.0f and 0.001fthrust::sequence(x.begin(), x.end(), 0.f, 1.f);thrust::sequence(y.begin(), y.end(), 0.f, 0.001f);
// multiplies vectors x and y, storing result in zthrust::transform(x.begin(), x.end(), y.begin(), z.begin(), thrust::multiplies<float>());
// returns sum-reduction of zreturn thrust::reduce(z.begin(), z.end());
}
Case Study 1 — Dot Product — ArrayFirestatic double dotproduct_af(){
// array in device memory, set to sequence {0.0f, 1.0f, 2.0f, ..., samples-1}array x = seq(samples);// array in device memory, set to sequence {0.0f, 1.0f, 2.0f, ..., samples-1} * 0.001farray y(seq(samples)*0.001f);
// multiplies x and y and returns sum-reduction of resulting vectorreturn sum<float>(x*y);
}
Case Study 2 — Pi Estimation
● Generate millions of uniformly distributed random samples
● Each sample will include x and y coordinates● Estimate rate of samples that fall within the unit
radius circle
Case Study 2 — Pi Estimation — ArrayFiredouble pi_af(){
array x = randu(samples,f32), y = randu(samples,f32);return 4 * sum<float>(x*x + y*y <= 1) / samples;
}
Speedups With ArrayFire
Field Application Speedup
Academia Power Systems Simulations 35x
Finance Option Pricing 52x
Government Radar Image Formation 45x
Life Sciences Pathology Advances > 100x
Manufacturing Tomography of Vegetation 10x
Media & Computer Vision Digital Holography 17x
Oil & Gas Ground Water Simulations > 20x
OpenACC — Programming Standard
● Write straightforward (serial) code (better sentence: "Write (almost) serial code")
● Use directives to tell the compiler what is executed in parallel
● Let the compiler do the parallel work for you● Under constant development — constant
performance improvements!
Implementation From Scratch — When?
● You are writing a novel algorithm implementation● Your code uses a modified version of the standard
algorithm in question● You want to learn about how parallel code works
Serial to Parallel
● Check if serial -> parallel is feasible (e.g., loops with inputs that do not rely on previous results)
● Identify performance bottlenecks (e.g., memory bandwidth, register usage)
● Profile the code (e.g., check where your code is spending most of its time and resources)
● Optimize the code (e.g., use shared memory where possible)
Tools To Improve Productivity
● Debugging, memory correctness checking and profiling tools○ Debugging: cuda-gdb○ Memory correctness checking: cuda-memcheck○ Profiling: nvprof
● Integrated Development Environments (IDEs)○ NVIDIA Nsight
cuda-gdb
● Break host code on condition● Check variable values● Do everything else that GNU gdb is capable of
cuda-gdb - Example Code#include <cuda.h> // Include for debugging
__global__ void vadd(int * a, int * b, int * c, int length){ int idx = blockDim.x * blockIdx.x + threadIdx.x; if (idx < length) c[idx] = a[idx] + b[idx]; // Line 6}
int main(){ int samples = 1000;... // d_A = {0, 1, 2, 3, ..., 999}; // d_B = {0, 2, 4, 6, ..., 1998}; vadd<<<2, 512>>>(d_A, d_B, d_C, samples);
return 0;}
cuda-gdb - Usage$ cuda-gdb ./vadd(cuda-gdb) break vadd.cu:6 if idx == 900Breakpoint 1 (vadd.cu:6 if idx == 900) pending.(cuda-gdb) run[Launch of CUDA Kernel 0 (vadd<<<(2,1,1),(512,1,1)>>>) on Device 0][Switching focus to CUDA kernel 0, grid 1, block (1,0,0), thread (388,0,0), device 0, sm 6, warp 12, lane 4]
Breakpoint 1, vadd(int * @global, int * @global, int * @global, int)<<<(2,1,1),(512,1,1)>>> (a=0x500140000, b=0x500141000, c=0x500142000, length=1000) at vadd.cu:66 if (idx < length) c[idx] = a[idx] + b[idx];(cuda-gdb) print a[idx]$1 = 900(cuda-gdb) print b[idx]$2 = 1800(cuda-gdb) print c[idx]$3 = 2700
cuda-gdb - CUDA Information
● Get information on device running your code(cuda-gdb) info cuda devices Dev Description SM Type SMs Warps/SM Lanes/Warp Max Regs/Lane Active SMs Mask* 0 gk104 sm_30 8 64 32 64 0x000000c0
● Get information on how your kernel is called(cuda-gdb) info cuda kernels Kernel Parent Dev Grid Status SMs Mask GridDim BlockDim Invocation* 0 - 0 1 Active 0x000000c0 (2,1,1) (512,1,1) vadd(a=0x500140000, b=0x500141000, c=0x500142000, length=1000)
● Get information on the threads running your kernel(cuda-gdb) info cuda threads BlockIdx ThreadIdx To BlockIdx ThreadIdx Count Virtual PC Filename LineKernel 0* (0,0,0) (0,0,0) (1,0,0) (511,0,0) 1024 0x0000000000678f88 vadd.cu 6
● There is even more: SMs, warps, lanes, blocks, ...
cuda-memcheck
● Check for memory out of bounds● Errors will be generated when reading or writing
unallocated memory
cuda-memcheck - Example Code#include <cuda.h> // Include for debugging
__global__ void vadd(int * a, int * b, int * c, int length){
int idx = blockDim.x * blockIdx.x + threadIdx.x;if (idx < length) c[idx] = a[idx] + b[idx]; // Line 6
}
int main(){
int samples = 1000;...
// d_A = {0, 1, 2, 3, ..., 999};// d_B = {0, 2, 4, 6, ..., 1998};vadd<<<2, 512>>>(d_A, d_B, d_C, samples);
cuda-memcheck - Usage$ cuda-memcheck ./vadd========= CUDA-MEMCHECK========= Invalid __global__ read of size 4========= at 0x000000e0 in vadd.cu:9:vadd(int*, int*, int*, int)========= by thread (511,0,0) in block (1,0,0)========= Address 0x500140ffc is out of bounds================= Invalid __global__ read of size 4========= at 0x000000e0 in vadd.cu:9:vadd(int*, int*, int*, int)========= by thread (510,0,0) in block (1,0,0)========= Address 0x500140ff8 is out of bounds...========= Program hit error 4 on CUDA API call to cudaMemcpy========= Saved host backtrace up to driver entry point at error================== ERROR SUMMARY: 25 errors
nvprof
● Verify time consumed by each call to an operation involving a device
● Useful information to identify bottlenecks
nvprof - Usage$ nvprof ./vadd==10506== NVPROF is profiling process 10506, command: ./vadd==10506== Profiling application: ./vadd==10506== Profiling result:Time(%) Time Calls Avg Min Max Name 38.03% 2.5920us 2 1.2960us 1.2800us 1.3120us [CUDA memcpy HtoD] 37.09% 2.5280us 1 2.5280us 2.5280us 2.5280us [CUDA memcpy DtoH] 24.88% 1.6960us 1 1.6960us 1.6960us 1.6960us vadd(int*, int*, int*, int)