GPGPU - UCLouvain

Patrice Rondao Alface7 March 2011

A Guided Tour on the Migration of Real-time Multimedia Applications on Next-Gen GPGPU

ALCATEL-LUCENT BELL LABS BELGIUM

• Bell Labs established in 1925

• Global presence at 8 Research Centers in the USA, France, Belgium, Germany, Ireland, India, China and South Korea.

• Well-known for inventions as: the transistor, laser, DSL, UNIX, DWDM and MIMO, C, C++…

• 27,600 Active Patents and 400 publications and conference papers per year.

• 7 Nobel Prizes in Physics, 9 U.S. National Medals of Science and 12 U.S. National Medals of Technology.

• Bell Labs Belgium (Antwerp):

• 150 researchers: largest ICT research center in Belgium

• video & immersion

• next generation access

• telco cloud

• connected devices

RESEARCH ACTIVITIES

Layered Panoramic and OmnidirectionalA/V Capturing

Video Analysis and Automated Editing

Automated shot framing

Region-of-InterestDetection and Tracking

Immersive and Interactive Applications for End-Users

Gesture-based user interfaces

Flexible and Interactive A/V Rendering

Scalable delivery and in-Network Adaptation of A/V flows

ACKNOWLEDGMENTS

• CUDA research done at and in collaboration with IMEC, Leuven

• Special thanks to

• Gauthier Lafruit

• Sammy Rogmans

• Qiong Yang

• Pradip Mainali

• Rajat Phull

• …

OUTLINE

• Introduction to CUDA

• CUDA Programming Model

• Feature-Point Detection Acceleration in CUDA

• Free-Viewpoint Video Hybrid Acceleration

• Summary

INTRODUCTION TO CUDACOMPUTE UNIFIED DEVICE ARCHITECTURE

source: http://www.nvidia.com

INTRODUCTION TO CUDARESEARCH

0

50

100

150

200

250

300

350

2007 2008 2009 2010

year

IEEE publications

INTRODUCTION TO CUDAWHY GPGPU PROGRAMMING?

source: http://w

ww.nvidia.com

INTRODUCTION TO CUDAWHY GPGPU PROGRAMMING?

• GPU = Massively parallel processors

• Calculation:

• 800 GFLOPS vs. 80 GFLOPS

• Memory Bandwidth:

• 86.4 GB/s vs. 8.4 GB/s

• Until 2006, mostly programmed through graphics API

• Now Next-Gen GPGPU programming is available with Brook++ and CUDA


INTRODUCTION TO CUDAPARALLEL PROGRAMMING AND ARCHITECTURES

• Necessary pre-requisites before optimizing code on parallel architectures:

• “Patterns for Parallel Programming” by T. G. Mattson et al., 2004, ISBN-13: 978-0321228116

• “The Landscape of Parallel Computing Research: A View from Berkeley 2.0” by Asanovic et al. 2007: http://www.cs.bris.ac.uk/Teaching/Resources/COMS35101/resources/berkeleyview2.0-ACACES20070716.pdf

• 13 dwarves

1. Finite State Machine 8. Dynamic Programming

2. Combinational Logic 9. N-Body Methods

3. Graph Traversal 10. MapReduce

4. Structured Grids 11. Back-track/Branch & Bound

5. Dense Linear Algebra 12. Graphical Model Inference

6. Sparse Linear Algebra 13. Unstructured Grids

7. Spectral Methods (FFT)

Asanovic:

“Claim: parallel arch., lang., compiler … must do at least these well to do future parallel apps well

Note: MapReduce is embarrassingly parallel; perhaps FSM is embarrassingly sequential?”

INTRODUCTION TO CUDAPARALLEL PROGRAMMING AND ARCHITECTURES

source: “The Landscape of Parallel Computing Research: A View from Berkeley 2.0”

INTRODUCTION TO CUDAAMDAHL’S LAW

N

PP

speedup

+−=

)1(

1

• if P is the proportion of a program that can be made parallel, and

• (1 − P) is the proportion that cannot be parallelized

• then the maximum speedup that can be achieved by using Nprocessors is

INTRODUCTION TO CUDAGPU: WHAT IS IT GOOD AT?

• The GPU is good at data-parallel processing

• The same computation executed on many data elements in parallel – low control flow overhead

with high floating point arithmetic intensity

• Many calculations per memory access

• (Currently also need high floating point to integer ratio)

• High floating-point arithmetic intensity and many data elements mean that memory access latency can be hidden with calculations instead of big data caches – Still need to avoid bandwidth saturation!

OUTLINE





• Summary

CUDA PROGRAMMING MODELGeForce 8800


CUDA PROGRAMMING MODEL


CUDA PROGRAMMING MODELMEMORY SPACES



• The GPU is viewed as a compute device that:

• Is a coprocessor to the CPU or host

• Has its own DRAM (device memory)

• Runs many threads in parallel

• Data-parallel portions of an application are executed on the device as kernels which run in parallel on many threads

• Differences between GPU and CPU threads

• GPU threads are extremely lightweight

• Very little creation overhead

• GPU needs 1000s of threads for full efficiency

• Multi-core CPU needs only a few


• Warps

• Each block is split into SIMD groups of threads called warps

• Warps are swapped in and out via thread scheduling

• Threads within a warp execute in lock step

• Threads are assigned to warps consecutively by their thread ID

• Issue order of warps and blocks is undefined, but there are synchronization primitives

• Performance

• Branches are predicated

• Divergence within a warp should be avoided if possible

• Memory coherence extremely important

• (Always try to read/write in a coalesced manner)


• Compute Unified Device Architecture

• Unified hardware and software specification for parallel computation

• Simple extensions to C language to allow code to run on the GPU

• Developed by and for NVIDIA

• Benefits and Features

• Application controlled SIMD program structure

• Fully general load/store to GPU memory

• Totally untyped (not limited to texture storage)

• No limits on branching, looping, etc.

• Full integer and bit instructions

• Supports pointers

• Explicitly managed memory down to cache level

• No graphics code (although interoperability with OpenGL/D3D is supported)

CUDA PROGRAMMING MODELAPPLICATION PROGRAMMING INTERFACE

• The API is an extension to the C/C++ programming language

• It consists of:

• Language extensions

• To target portions of the code for execution on the device

• Two stage compilation (e.g. nvcc + gcc)

• A runtime library split into:

• A common component providing built-in vector types and a subset of the C runtime library in both host and device codes

• A host component to control and access one or more devices from the host

• A device component providing device-specific functions

CUDA PROGRAMMING MODEL EXTENDED C

1. Identify parallel code: Amdahl’s law

2. Select best memory to optimize read/write access

3. If possible, exploit data-reuse using Shared memory but avoid bank conflicts

4. Minimize Control Flow in a kernel (Execution granularity is a WARP of 32 threads) and avoid unnecessary __synchthreads()

5. Optimize GPU occupancy taking block size, registers per thread and shared memory into consideration

(Occupancy is the ratio of the number of active warps per multiprocessor to the maximum number of possible active warps)

CUDA PROGRAMMING MODEL HINTS FOR ACCELERATION

N

PP

speedup

+−=

)1(

1

OUTLINE





• Summary

FEATURE POINT DETECTOR ACCELERATION IN CUDA

• Real-time feature point detection for robot navigation, video mosaicing, video stabilization etc.

Feature point detection is a computational intensive task.

Must be performed in real-time to meet application requirements


• Applications

• Video Stabilization

• Stereo Matching

• Medical Image partial co-registration

• Watermarking

• Morphing

Feature Detection

Feature Tracking

HomographyEstimation

ProjectionVideo

Stabilization


• Change of intensity for the shift (u,v):

Can be seen as the quadratic approximation of the autocorrelation function

with

The Measure of Cornerness is now given by:

and

R is then compared to a threshold and then filtered and sorted using Non Maximum Suppression

2| | ( ( ))R C k trace C= − ×

min 1 2min( )R λ λ λ= =

2

2

( ) ( ) ( )

( ) ( ) ( )

( )∈

=

∑ x x y

W x y y

g g gC

g g gx

x x x

x x x

p

∑ −++=yx

yxIvyuxIyxwvuE,

2)],(),()[,(),(

[ ]( , ) , ( )

≅

pu

E u v u v Cv


P. Mainali, Q. Yang, G. Lafruit, R. Lauwereins and L. Van Gool, “LOCOCO: Low Complexity Corner Detector”, ICASSP 2010.

• Harris detector

1. Compute x and y derivatives of image filtered by a gaussian Gx, Gy

2. Compute product of derivatives Gxx, Gxy, Gyy

3. Compute weighted averages of these products Sxx, Sxy, Syy

4. Compute the matrix H =[Sxx, Sxy; Sxy, Syy] and estimate cornernessR = det(H)-k (trace(H))2

5. Non-maximum suppression

• Lococo detector

1.Approximate the gaussian derivative filter by the box filter of the integral image and compute G’x, G’y

2.Compute product of derivatives Gxx, Gxy, Gyy

3.Compute the sums of these products S’xx, S’xy, S’yy

4.Compute the matrixH =[S’xx, S’xy; S’xy, S’yy] and estimate cornernessR = det(H)-k (trace(H))2

5.Non-maximum suppression


• Lococo in CUDA

• Image resolution: 960x960

• Speedup 16x

• Integral Image

• Sorting

Rajat Phull, Pradip Mainali, Qiong Yang, Patrice Rondao Alface, Henk Sips, “Low Complexity Corner

Detector Using CUDA for Multimedia Applications”, IARIA MMEDIA’11 International Conference,

Budapest, Hungary, 17-22 April 2011, accepted.


• Hardware used

• CPU

• Intel Core i7

• 5.06 GHz

• 8 MB Intel smart cache

• 4GB RAM

• GPU

• Nvidia’s GeForce GTX 280

• 1.3 GHz clock speed

• 240 CUDA cores

• 65535 threads

• 1 GB global memory

• 16 KB shared memory per core

• Memory bandwidth 147 GB/sec

• CPU-GPU bandwidth 1.4 GB/sec

• Compute capability 1.3


• Integral image

• Mostly sequential algorithm but …

• Prefixed-sum parallel algorithm to compute the sum of rows

• Transpose the result using shared memory and block pre-fetches

• Re-operate the prefix-sum on the rows

• The transpose step is needed in order to optimize the memory reads


• Pre-fixed sum parallel implementation

• Scan: Mark Harris, Shubhabrata Sengupta, and John D. Owens. “Parallel Prefix Sum (Scan) with CUDA”. In Hubert Nguyen, editor, GPU Gems 3, chapter 39, pages 851–876. Addison Wesley, August 2007

• Exploits shared memory

Up-sweep (reduction) Down-sweep


2. Approximate gaussian derivative filter by the box filter of the integral and compute Gx, Gy

Boxfilters can be computed easily with CUDA. Options:

1. Compute from global memory directly with memory fetches at x+-4, y+-4 positions

2. Pre-store the input from the global memory into the shared memory for optimized reads

Pseudo Code (Executed by all the threads)xBlock = blockDim.x * blockIdx.x;yBlock = blockDim.y * blockIdx.y;

index = pitch * (yBlock + threadIdx.y) + xBlock + threadIdx.x;

region1 = iiA + iiD – iiB – iiC;region2 = iiE + iiH – iiF – iiG;out[index] = (region1 - region2)/(WIN*WIN);


3. Compute the products Gxx, Gxy, Gyy

4. Compute their sums Sxx, Sxy, Syy

• Combination of the two first kernels. Options:

• Pre-compute Gxx, Gyy, Gxy, store them in global memory and then compute integral images of the results

• Fuse both kernels by avoiding a pre-store of the three gradient products in global memory

• But bandwidth is high…

• Extra computations lower the optimality of the scan algorithm implementation by using more resources per thread

• Results are similar


5. Evaluate the cornerness R from matrix H

• Pixel-wise evaluation of a simple expression from Sxx, Sxy, Syy

• Optimized reads by coalescing the pointers

• Integral Image with pre-fixed sum

• gx and gy are squared and multiplied during scan operation

Pseudo Code (Executed by all the threads)xBlock = blockDim.x * blockIdx.x;yBlock = blockDim.y * blockIdx.y;index = pitch * (yBlock + threadIdx.y) + xBlock+ threadIdx.x;

Gxx = iiA + iiD – iiB – iiC;Gyy = iiA + iiD – iiB – iiC;Gxy = iiA + iiD – iiB – iiC;

out[index] = Gxx * Gyy - Gxy * Gxy - .04*(Gxx+Gyy)2;


6. Non-maximum suppression

Sequential algorithm but a Quick Sort can be launched in parallel using the Radix Sorting

Nadathur Satish, Mark Harris, and Michael Garland. “Designing Efficient Sorting Algorithms for Manycore GPUs”. Proc. 23rd IEEE International Parallel & Distributed Processing Symposium, May 2009.

1. Accesses 2dx2d block to shared memory from global memory

2. Compute max value in dxd regions

3. If Candidate maxima > threshold than compute maxima in neighborhood


Image Size: 960x960Feature Count: 1000

Speedup: 16x

CPU: Intel Core 2 Duo, 2.66 GHz And 2 GB RAM. GPU: Nvidia GeForce GTX 280


Efficient for applications that require multi-scale estimation

LOCOCO is 2-3 times faster than Harris on the same GPU


Method Time(ms) Image size Platform

LOCOCO-CUDA 2.4 640x480 GF 280 GTX

L.Teixeira [3] 7.3 640x480 GF 8800 GTX

Sinha [4] 61.7 720x576 GF 8800 GTX + AMD Athlon 64 X2 Dual Core

[3] L. Teixeira, W. Celes and M. Gattass, “Accelerated Corner Detector Algorithms”, in BMVC, 2008 [4] S. Sinha, J. Frahm and M. Pollefeys, “GPU-based video feature tracking and Matching”, in EDGE 2006

OUTLINE





• Summary

FREE-VIEWPOINT VIDEO HYBRID ACCELERATION


• 5 Kernels corresponding to different “dwarfs”…

Sammy Rogmans, Maarten Dumont, Gauthier Lafruit, and Philippe Bekaert, "Migrating Real-Time

Image-Based Rendering from Traditional to Next-Gen GPGPU," in proceedings of 3DTV-CON: The

True Vision Capture, Transmission and Display of 3D Video, Potsdam, Germany, May 2009.


• Traditional GPGPU vs CUDA

• Sparse Matrices:Traditional GPGPU Dense Matrices: CUDA


• CUDA: Shared memory allows for user controlled data cache management

• Example: N x M Convolution kernels


• Sparse computational masks:


• cost computation (K1), cost aggregation (K2), disparity selection (K3), image warping (K4) and occlusion handling (K5)


• Interoperability CUDA-DirectX or CUDA-OpenGL enables to design a hybrid chain:

• Only K2 (cost aggregation) is significantly faster in CUDA

• Only K5 (occlusion handling) is significantly faster in Traditional GPGPU

• Kernel to Kernel communication considerations:

• Interoperability uses a sort of semaphore lock/unlock of the VRAM : minimize transitions from CUDA to Traditional GPGPU as much as possible!

• K1 and K2 in CUDA

• K3, K4 and K5 in Traditional GPGPU

CONCLUSIONS

• Think parallel / Amdahl’s law / 7 or 13 dwarfs…

• There are other platforms such as FPGA, Cell, TILERA, CPU…

• CUDA offers an easy extended-C programming model

• Traditional GPGPU behaves better on sparse data processing, CUDA on dense data processing but can be combined

• CPU-GPU bandwidth is ruled by PCI Express (few Gigabps)

• Have a look at forums, libraries CuFFT, CuBLAS, CuLAPACK, CuPP…

• CUDA 4.0 is coming…

GPGPU - UCLouvain

Documents

Transcript of GPGPU - UCLouvain