GPGPU - UCLouvain
Transcript of GPGPU - UCLouvain
Patrice Rondao Alface7 March 2011
A Guided Tour on the Migration of Real-time Multimedia Applications on Next-Gen GPGPU
ALCATEL-LUCENT BELL LABS BELGIUM
• Bell Labs established in 1925
• Global presence at 8 Research Centers in the USA, France, Belgium, Germany, Ireland, India, China and South Korea.
• Well-known for inventions as: the transistor, laser, DSL, UNIX, DWDM and MIMO, C, C++…
• 27,600 Active Patents and 400 publications and conference papers per year.
• 7 Nobel Prizes in Physics, 9 U.S. National Medals of Science and 12 U.S. National Medals of Technology.
• Bell Labs Belgium (Antwerp):
• 150 researchers: largest ICT research center in Belgium
• video & immersion
• next generation access
• telco cloud
• connected devices
RESEARCH ACTIVITIES
Layered Panoramic and OmnidirectionalA/V Capturing
Video Analysis and Automated Editing
Automated shot framing
Region-of-InterestDetection and Tracking
Immersive and Interactive Applications for End-Users
Gesture-based user interfaces
Flexible and Interactive A/V Rendering
Scalable delivery and in-Network Adaptation of A/V flows
ACKNOWLEDGMENTS
• CUDA research done at and in collaboration with IMEC, Leuven
• Special thanks to
• Gauthier Lafruit
• Sammy Rogmans
• Qiong Yang
• Pradip Mainali
• Rajat Phull
• …
OUTLINE
• Introduction to CUDA
• CUDA Programming Model
• Feature-Point Detection Acceleration in CUDA
• Free-Viewpoint Video Hybrid Acceleration
• Summary
INTRODUCTION TO CUDACOMPUTE UNIFIED DEVICE ARCHITECTURE
source: http://www.nvidia.com
INTRODUCTION TO CUDARESEARCH
0
50
100
150
200
250
300
350
2007 2008 2009 2010
year
IEEE publications
INTRODUCTION TO CUDAWHY GPGPU PROGRAMMING?
source: http://w
ww.nvidia.com
INTRODUCTION TO CUDAWHY GPGPU PROGRAMMING?
• GPU = Massively parallel processors
• Calculation:
• 800 GFLOPS vs. 80 GFLOPS
• Memory Bandwidth:
• 86.4 GB/s vs. 8.4 GB/s
• Until 2006, mostly programmed through graphics API
• Now Next-Gen GPGPU programming is available with Brook++ and CUDA
source: http://www.nvidia.com
INTRODUCTION TO CUDAPARALLEL PROGRAMMING AND ARCHITECTURES
• Necessary pre-requisites before optimizing code on parallel architectures:
• “Patterns for Parallel Programming” by T. G. Mattson et al., 2004, ISBN-13: 978-0321228116
• “The Landscape of Parallel Computing Research: A View from Berkeley 2.0” by Asanovic et al. 2007: http://www.cs.bris.ac.uk/Teaching/Resources/COMS35101/resources/berkeleyview2.0-ACACES20070716.pdf
• 13 dwarves
1. Finite State Machine 8. Dynamic Programming
2. Combinational Logic 9. N-Body Methods
3. Graph Traversal 10. MapReduce
4. Structured Grids 11. Back-track/Branch & Bound
5. Dense Linear Algebra 12. Graphical Model Inference
6. Sparse Linear Algebra 13. Unstructured Grids
7. Spectral Methods (FFT)
Asanovic:
“Claim: parallel arch., lang., compiler … must do at least these well to do future parallel apps well
Note: MapReduce is embarrassingly parallel; perhaps FSM is embarrassingly sequential?”
INTRODUCTION TO CUDAPARALLEL PROGRAMMING AND ARCHITECTURES
source: “The Landscape of Parallel Computing Research: A View from Berkeley 2.0”
INTRODUCTION TO CUDAAMDAHL’S LAW
N
PP
speedup
+−=
)1(
1
• if P is the proportion of a program that can be made parallel, and
• (1 − P) is the proportion that cannot be parallelized
• then the maximum speedup that can be achieved by using Nprocessors is
INTRODUCTION TO CUDAGPU: WHAT IS IT GOOD AT?
• The GPU is good at data-parallel processing
• The same computation executed on many data elements in parallel – low control flow overhead
with high floating point arithmetic intensity
• Many calculations per memory access
• (Currently also need high floating point to integer ratio)
• High floating-point arithmetic intensity and many data elements mean that memory access latency can be hidden with calculations instead of big data caches – Still need to avoid bandwidth saturation!
OUTLINE
• Introduction to CUDA
• CUDA Programming Model
• Feature-Point Detection Acceleration in CUDA
• Free-Viewpoint Video Hybrid Acceleration
• Summary
CUDA PROGRAMMING MODELGeForce 8800
source: http://www.nvidia.com
CUDA PROGRAMMING MODEL
source: http://www.nvidia.com
CUDA PROGRAMMING MODELMEMORY SPACES
source: http://www.nvidia.com
CUDA PROGRAMMING MODELMEMORY SPACES
source: http://www.nvidia.com
CUDA PROGRAMMING MODEL
• The GPU is viewed as a compute device that:
• Is a coprocessor to the CPU or host
• Has its own DRAM (device memory)
• Runs many threads in parallel
• Data-parallel portions of an application are executed on the device as kernels which run in parallel on many threads
• Differences between GPU and CPU threads
• GPU threads are extremely lightweight
• Very little creation overhead
• GPU needs 1000s of threads for full efficiency
• Multi-core CPU needs only a few
CUDA PROGRAMMING MODEL
• Warps
• Each block is split into SIMD groups of threads called warps
• Warps are swapped in and out via thread scheduling
• Threads within a warp execute in lock step
• Threads are assigned to warps consecutively by their thread ID
• Issue order of warps and blocks is undefined, but there are synchronization primitives
• Performance
• Branches are predicated
• Divergence within a warp should be avoided if possible
• Memory coherence extremely important
• (Always try to read/write in a coalesced manner)
CUDA PROGRAMMING MODEL
• Compute Unified Device Architecture
• Unified hardware and software specification for parallel computation
• Simple extensions to C language to allow code to run on the GPU
• Developed by and for NVIDIA
• Benefits and Features
• Application controlled SIMD program structure
• Fully general load/store to GPU memory
• Totally untyped (not limited to texture storage)
• No limits on branching, looping, etc.
• Full integer and bit instructions
• Supports pointers
• Explicitly managed memory down to cache level
• No graphics code (although interoperability with OpenGL/D3D is supported)
CUDA PROGRAMMING MODELAPPLICATION PROGRAMMING INTERFACE
• The API is an extension to the C/C++ programming language
• It consists of:
• Language extensions
• To target portions of the code for execution on the device
• Two stage compilation (e.g. nvcc + gcc)
• A runtime library split into:
• A common component providing built-in vector types and a subset of the C runtime library in both host and device codes
• A host component to control and access one or more devices from the host
• A device component providing device-specific functions
CUDA PROGRAMMING MODEL EXTENDED C
1. Identify parallel code: Amdahl’s law
2. Select best memory to optimize read/write access
3. If possible, exploit data-reuse using Shared memory but avoid bank conflicts
4. Minimize Control Flow in a kernel (Execution granularity is a WARP of 32 threads) and avoid unnecessary __synchthreads()
5. Optimize GPU occupancy taking block size, registers per thread and shared memory into consideration
(Occupancy is the ratio of the number of active warps per multiprocessor to the maximum number of possible active warps)
CUDA PROGRAMMING MODEL HINTS FOR ACCELERATION
N
PP
speedup
+−=
)1(
1
OUTLINE
• Introduction to CUDA
• CUDA Programming Model
• Feature-Point Detection Acceleration in CUDA
• Free-Viewpoint Video Hybrid Acceleration
• Summary
FEATURE POINT DETECTOR ACCELERATION IN CUDA
• Real-time feature point detection for robot navigation, video mosaicing, video stabilization etc.
Feature point detection is a computational intensive task.
Must be performed in real-time to meet application requirements
FEATURE POINT DETECTOR ACCELERATION IN CUDA
• Applications
• Video Stabilization
• Stereo Matching
• Medical Image partial co-registration
• Watermarking
• Morphing
Feature Detection
Feature Tracking
HomographyEstimation
ProjectionVideo
Stabilization
FEATURE POINT DETECTOR ACCELERATION IN CUDA
• Change of intensity for the shift (u,v):
Can be seen as the quadratic approximation of the autocorrelation function
with
The Measure of Cornerness is now given by:
and
R is then compared to a threshold and then filtered and sorted using Non Maximum Suppression
2| | ( ( ))R C k trace C= − ×
min 1 2min( )R λ λ λ= =
2
2
( ) ( ) ( )
( ) ( ) ( )
( )∈
=
∑ x x y
W x y y
g g gC
g g gx
x x x
x x x
p
∑ −++=yx
yxIvyuxIyxwvuE,
2)],(),()[,(),(
[ ]( , ) , ( )
≅
pu
E u v u v Cv
FEATURE POINT DETECTOR ACCELERATION IN CUDA
P. Mainali, Q. Yang, G. Lafruit, R. Lauwereins and L. Van Gool, “LOCOCO: Low Complexity Corner Detector”, ICASSP 2010.
• Harris detector
1. Compute x and y derivatives of image filtered by a gaussian Gx, Gy
2. Compute product of derivatives Gxx, Gxy, Gyy
3. Compute weighted averages of these products Sxx, Sxy, Syy
4. Compute the matrix H =[Sxx, Sxy; Sxy, Syy] and estimate cornernessR = det(H)-k (trace(H))2
5. Non-maximum suppression
• Lococo detector
1.Approximate the gaussian derivative filter by the box filter of the integral image and compute G’x, G’y
2.Compute product of derivatives Gxx, Gxy, Gyy
3.Compute the sums of these products S’xx, S’xy, S’yy
4.Compute the matrixH =[S’xx, S’xy; S’xy, S’yy] and estimate cornernessR = det(H)-k (trace(H))2
5.Non-maximum suppression
FEATURE POINT DETECTOR ACCELERATION IN CUDA
• Lococo in CUDA
• Image resolution: 960x960
• Speedup 16x
• Integral Image
• Sorting
Rajat Phull, Pradip Mainali, Qiong Yang, Patrice Rondao Alface, Henk Sips, “Low Complexity Corner
Detector Using CUDA for Multimedia Applications”, IARIA MMEDIA’11 International Conference,
Budapest, Hungary, 17-22 April 2011, accepted.
FEATURE POINT DETECTOR ACCELERATION IN CUDA
• Hardware used
• CPU
• Intel Core i7
• 5.06 GHz
• 8 MB Intel smart cache
• 4GB RAM
• GPU
• Nvidia’s GeForce GTX 280
• 1.3 GHz clock speed
• 240 CUDA cores
• 65535 threads
• 1 GB global memory
• 16 KB shared memory per core
• Memory bandwidth 147 GB/sec
• CPU-GPU bandwidth 1.4 GB/sec
• Compute capability 1.3
FEATURE POINT DETECTOR ACCELERATION IN CUDA
• Integral image
• Mostly sequential algorithm but …
• Prefixed-sum parallel algorithm to compute the sum of rows
• Transpose the result using shared memory and block pre-fetches
• Re-operate the prefix-sum on the rows
• The transpose step is needed in order to optimize the memory reads
FEATURE POINT DETECTOR ACCELERATION IN CUDA
• Pre-fixed sum parallel implementation
• Scan: Mark Harris, Shubhabrata Sengupta, and John D. Owens. “Parallel Prefix Sum (Scan) with CUDA”. In Hubert Nguyen, editor, GPU Gems 3, chapter 39, pages 851–876. Addison Wesley, August 2007
• Exploits shared memory
Up-sweep (reduction) Down-sweep
FEATURE POINT DETECTOR ACCELERATION IN CUDA
2. Approximate gaussian derivative filter by the box filter of the integral and compute Gx, Gy
Boxfilters can be computed easily with CUDA. Options:
1. Compute from global memory directly with memory fetches at x+-4, y+-4 positions
2. Pre-store the input from the global memory into the shared memory for optimized reads
Pseudo Code (Executed by all the threads)xBlock = blockDim.x * blockIdx.x;yBlock = blockDim.y * blockIdx.y;
index = pitch * (yBlock + threadIdx.y) + xBlock + threadIdx.x;
region1 = iiA + iiD – iiB – iiC;region2 = iiE + iiH – iiF – iiG;out[index] = (region1 - region2)/(WIN*WIN);
FEATURE POINT DETECTOR ACCELERATION IN CUDA
3. Compute the products Gxx, Gxy, Gyy
4. Compute their sums Sxx, Sxy, Syy
• Combination of the two first kernels. Options:
• Pre-compute Gxx, Gyy, Gxy, store them in global memory and then compute integral images of the results
• Fuse both kernels by avoiding a pre-store of the three gradient products in global memory
• But bandwidth is high…
• Extra computations lower the optimality of the scan algorithm implementation by using more resources per thread
• Results are similar
FEATURE POINT DETECTOR ACCELERATION IN CUDA
5. Evaluate the cornerness R from matrix H
• Pixel-wise evaluation of a simple expression from Sxx, Sxy, Syy
• Optimized reads by coalescing the pointers
• Integral Image with pre-fixed sum
• gx and gy are squared and multiplied during scan operation
Pseudo Code (Executed by all the threads)xBlock = blockDim.x * blockIdx.x;yBlock = blockDim.y * blockIdx.y;index = pitch * (yBlock + threadIdx.y) + xBlock+ threadIdx.x;
Gxx = iiA + iiD – iiB – iiC;Gyy = iiA + iiD – iiB – iiC;Gxy = iiA + iiD – iiB – iiC;
out[index] = Gxx * Gyy - Gxy * Gxy - .04*(Gxx+Gyy)2;
FEATURE POINT DETECTOR ACCELERATION IN CUDA
6. Non-maximum suppression
Sequential algorithm but a Quick Sort can be launched in parallel using the Radix Sorting
Nadathur Satish, Mark Harris, and Michael Garland. “Designing Efficient Sorting Algorithms for Manycore GPUs”. Proc. 23rd IEEE International Parallel & Distributed Processing Symposium, May 2009.
1. Accesses 2dx2d block to shared memory from global memory
2. Compute max value in dxd regions
3. If Candidate maxima > threshold than compute maxima in neighborhood
FEATURE POINT DETECTOR ACCELERATION IN CUDA
Image Size: 960x960Feature Count: 1000
Speedup: 16x
CPU: Intel Core 2 Duo, 2.66 GHz And 2 GB RAM. GPU: Nvidia GeForce GTX 280
FEATURE POINT DETECTOR ACCELERATION IN CUDA
Efficient for applications that require multi-scale estimation
LOCOCO is 2-3 times faster than Harris on the same GPU
FEATURE POINT DETECTOR ACCELERATION IN CUDA
Method Time(ms) Image size Platform
LOCOCO-CUDA 2.4 640x480 GF 280 GTX
L.Teixeira [3] 7.3 640x480 GF 8800 GTX
Sinha [4] 61.7 720x576 GF 8800 GTX + AMD Athlon 64 X2 Dual Core
[3] L. Teixeira, W. Celes and M. Gattass, “Accelerated Corner Detector Algorithms”, in BMVC, 2008 [4] S. Sinha, J. Frahm and M. Pollefeys, “GPU-based video feature tracking and Matching”, in EDGE 2006
FEATURE POINT DETECTOR ACCELERATION IN CUDA
FEATURE POINT DETECTOR ACCELERATION IN CUDA
OUTLINE
• Introduction to CUDA
• CUDA Programming Model
• Feature-Point Detection Acceleration in CUDA
• Free-Viewpoint Video Hybrid Acceleration
• Summary
FREE-VIEWPOINT VIDEO HYBRID ACCELERATION
FREE-VIEWPOINT VIDEO HYBRID ACCELERATION
• 5 Kernels corresponding to different “dwarfs”…
Sammy Rogmans, Maarten Dumont, Gauthier Lafruit, and Philippe Bekaert, "Migrating Real-Time
Image-Based Rendering from Traditional to Next-Gen GPGPU," in proceedings of 3DTV-CON: The
True Vision Capture, Transmission and Display of 3D Video, Potsdam, Germany, May 2009.
FREE-VIEWPOINT VIDEO HYBRID ACCELERATION
• Traditional GPGPU vs CUDA
• Sparse Matrices:Traditional GPGPU Dense Matrices: CUDA
FREE-VIEWPOINT VIDEO HYBRID ACCELERATION
• CUDA: Shared memory allows for user controlled data cache management
• Example: N x M Convolution kernels
FREE-VIEWPOINT VIDEO HYBRID ACCELERATION
• Sparse computational masks:
FREE-VIEWPOINT VIDEO HYBRID ACCELERATION
• cost computation (K1), cost aggregation (K2), disparity selection (K3), image warping (K4) and occlusion handling (K5)
FREE-VIEWPOINT VIDEO HYBRID ACCELERATION
• Interoperability CUDA-DirectX or CUDA-OpenGL enables to design a hybrid chain:
• Only K2 (cost aggregation) is significantly faster in CUDA
• Only K5 (occlusion handling) is significantly faster in Traditional GPGPU
• Kernel to Kernel communication considerations:
• Interoperability uses a sort of semaphore lock/unlock of the VRAM : minimize transitions from CUDA to Traditional GPGPU as much as possible!
• K1 and K2 in CUDA
• K3, K4 and K5 in Traditional GPGPU
CONCLUSIONS
• Think parallel / Amdahl’s law / 7 or 13 dwarfs…
• There are other platforms such as FPGA, Cell, TILERA, CPU…
• CUDA offers an easy extended-C programming model
• Traditional GPGPU behaves better on sparse data processing, CUDA on dense data processing but can be combined
• CPU-GPU bandwidth is ruled by PCI Express (few Gigabps)
• Have a look at forums, libraries CuFFT, CuBLAS, CuLAPACK, CuPP…
• CUDA 4.0 is coming…