GPU - Graphical Processing Unit

8/7/2019 GPU - Graphical Processing Unit

http://slidepdf.com/reader/full/gpu-graphical-processing-unit 1/69



Université de Mons

Thanks GPU



Université de Mons

Table of content

1. History & Resume

2. GPU and 3D rendering

3. Architecture of a GPU4. GPU programming

5. CUDA

6. Conclusion



Université de Mons

What is a GPU ?

The GPU is a processor specialized in 3D tasks

Offload the the CPU (central processor unit) of

several tasks

Highly parallel structuremore effective than

CPU for a range of complexe algorithme

Calculation of floating point



Université de Mons

Central Processing Unit : CPU

5Prof. Untel | Service Untel (voir pied de page dans le menu Powerpoint)

• an essential component in a computer.

• interpret instructions and process datas of a

program.

• Sequential process (not much data but higher

complexity)

• Need to process more and more datas for

Multimedia applications (games, CAD,…)



Université de Mons

Evolution of the CPU


• Multimedia applications used dedicated

algorithms to proceed

• Linear algorithm to apply the same

instructions to a large amount of data : we

speak about « vector calculus »

• Adaptation of the Architectures of CPU to use

Multimedia complexion :

Intel Pentium MMX, AMD Opteron 3D Now !


http://slidepdf.com/reader/full/gpu-graphical-processing-unit 7/69Université de Mons

Limitation of the CPU


• New generation of CPU with higher

performances seems more features and

functions for the users

• Users want more and more functions and they

want that technologies follow their desire

• But technologies are limited because internal

clock frequency of CPU are physically limited



Solution to turn away the problem


• Multi-core : combine several CPU to one CPU

• Add a specific processor to multimedia

application GPU

BUT need parallel programming



Multi-core CPU


• Classic programming is not adapted to multi-

core architecture because sequential

programming use one core and no more

• Classic programming + multi-core doesn’t

seem improvement !

• Need parallel programming : the problem is

divided into elementary task which are

process simultaneously by several CPU to

decrease computation time



Multi-core CPU


• Parallel programming seems complex

programming

• Parallel programming is already used by

scientists to use supercalculators

• Multi-core CPU is good but not enough

compare to GPU



GPU Vs CPU


• Comparison on FLOPS performance (Floating

point Operation Per Second)



Origin of GPU


• Need to display a 2D projection of a 3D model

in real time

CAD : to visualize in 3D a virtual object

Video Games : to represent a virtual world

• 2 techniques : Ray tracing Rasterizing



Graphic Card is often called GPU


• Graphic Card is an important part of the

computer

• Composed by memory area, processors,

registers and communication chipsets

• GPU = graphics processors on this card

•Until 240 parallel processors flow on GPU

@1500MHz

• Single Instruction on Multiple Data [SIMD]



Graphic Card is often called GPU


• GPU processors are organized in pipeline



GPU Programming


Languages

Shading Language



Language GPGPU


CUDA

OpenCLAccelerator

…..



Programming Model


Tableau = texture

Kernel =Fragment Shader

Calculus = Graphics renderingFeedback

GPGPU complexity

Memory AccessBandwidth

…..



Table of content

1. History & Resume


3. Architecture of a GPU4. GPU programming

5. CUDA

6. Conclusion



Basic need ?

Show, in real time, a 2D projection (on the screen) of a 3D model

Raytracing

Rasterisation



There is a specific vocabulary for

the GPU

Vertex

Texture

Pixel & fragmentShader

Pipeline



A Vertex (plural : Vertices) are

commonly used to define the

corners of surfaces in 3D

models, where each such point

is given as a vector.

A vertex is represented by

coordonates X,Y and Z

Vertex

This cube has 8 vertices



Texture

A Texture is a 2D image which

is applicated at a 3D object

perceived surface quality of an

artwork



Pixel Fragment

• A pixel is the smallest item of information in an image seen by the viewer

• A fragment is the data necessary to generate a single pixel of

a drawing primitive. It is constituate by :

Some coordonates X,Y,Z

A color

A visibility depth

NOT seen by the user



Université de Mons

Shader

A shader is simple programs that describe the traits of either a vertex or apixel (via the fragments).

It allows to control a subset of the GPU processors

Lots of special shading functions defined thanks to major graphics software

libraries (OpenGL and Direct3D)

3 types of shaders :

Vertex Shader

Run for each vertex given at the

processor

transform each vertex's 3D

position to the 2D coordinate of

the screen

Geometry shader

add and remove vertices

New shader (not present oneeach GPU)

Pixel (or fragment) Shader

calculate the color of individual

pixels lighting/shadow effect



Université de Mons

Pipeline

A pipeline is an ordonate sequence of different

levels.

Each level get the data of the past one, do his

own operation and send the results to the

next one.

A pipeline is « full » when each level is working

simultaneously optimal use



Université de Mons

Actual Graphic Pipeline

The graphics pipeline typically accepts some representation of a

three-dimensional scene as an input and results in a 2D raster

image (image made of pixels) as output.

OpenGL and Direct3D are two notable graphics pipeline modelsaccepted as widespread industry standards.

The graphic pipeline contains 4 levels :

3 programmable levels

pilot by the shader

1 non-programmable level

The rasterizer



Université de Mons

Vertex flux from the CPU to the GPU



Université de Mons

Pre-stage : Tessellation



Université de Mons

Stage 1 : Vertex shader (Programmable)

•Objects are transformed from 3D world spacecoordinates into a 3D coordinate system based on theposition and orientation of a virtual camera

•Use to add special effect to objetcs in a 3D

environment

•Run once for each vertex given to the GPU

•Can change vertex’s properties such as : position,

color, texture coordinate,…

•One element in/one element out

•Can not create new vertices



Université de Mons

Stage 2 : Geometry shader (Prgrammable)

•One element in / 0 ~100 elements out

•Can add and remove vertices

•Can be used to add volumetric detail (too costly forCPU) or for the refinement of the mesh size

•Ex : 20 triangles 100 triangles smaller

•Displacement Mapping

•Last type of shader created (not always present in thepipeline)

Mesh size = taille des mailles = maillage



Université de Mons

Stage 3 : Rasterization (non-programmable)(1)

•Most popular technique forproducing real-time 3D computergraphics (faster than raytracing)

•Projection of the polygons of the 3Dscene on a grid (2D) of the size of theoutput image

•Output fragments have the imagefinal coordinates

2D vector to raster

Vector image (Vertex) Raster image (Fragments)

Polygon = set of trianglesTriangle = 3 vertex in 3D space



Université de Mons

Stage 3 : Rasterization (2)

The Rasterization algorithme has minimum 3 steps :

1. Calculation of the 2D coordinates (transformation)

2. Filtering of the vertex (clipping)

3. Rasterization itself (scan conversion)

4. Acceleration technics (optional)

5. Further refinments



Université de Mons


The Rasterization algorithme has minimum 3 steps :

1. Calculation of the 2D coordinates (transformation)

Set of mathematics transformation :• Translation, scalling, rotation : to put the 3D figure at the desire

location (Exemple = the origine)• Projection : from 3D to 2D (orthogonal projection (removed the

z-components), perspective projection)

These operations are done thanks to a multiplication of thevertex’s augmented 3D matrix by different matrix

Ex : Translation matrix :

Ex : A man who turn his head



Université de Mons


2. Filtering of the vertex (clipping)

• Triangles 2D vertices location are calculated BUT may be outside of the window (area on the screen wherethe pixel will be written)

• Clipping is the process of truncating triangles to fit them inside the viewing area.

3. Rasterization itself (scan conversion)

• To fill in the 2D triangles that are now in the image plane in pixels

• Exemple : treatment of a line (coordonates (1,1) to (5.1), color degraded blue to green) Will fill pixel (1,1), (2,1), (3,1), (4,1), & (5,1) ;

For each pixel, ones has to determinates the caracteristic with a goog balance :

(1,1) being totaly blue, (2,1) less blue, (3,1) blue)green,…

• This is much more complicated for shape like triangle but the principe remains the same

• Difficulty : Pixel Aliasing

use of Z-buffer to see which pixel is closer to the camera



Université de Mons


4. Acceleration techniques

I. Backface culling :determines whether a polygon of a graphical object is

visible, if not (it shows its back to the camera) cull

II. Spatial data structures



Université de Mons

Stage 4 : Fragment shader (Programmable)

•Give his final color to each pixel (fonction of lighting,reflexing or refraction of the light,…)

•Biggest computational resource•Perform complex per-pixel effects and refinmentstechniques such as :

I. Texture filtering : to create clean images at anydistance

II. Environment mapping : a form

of texture mapping in which thetexture coordinates view-dependentto simulate reflection on a shinyobject

III. Shadows : traditionnally not processin the rasterizer modern techniques

Fragment Shader = Pixel Shader

OpenGL Direct3D



Université de Mons

Exit of the pipeline

• Fragment flux can :

Either be written in a framebufferand then display on the screen

Either, if it need more treatment, bewritten in a texture and then pick backby the the CPU



Université de Mons

Resume



Université de Mons

The unified architecture came from the 6th generation

of GPU

Before : 2 types of processor in the GPU

Vertex Units

Fragments Units

Creation of a neck of strangling when one type was over-charged not optimal

Since GeForce 8, processors are not specifics anymore

optimal use of the pipeline : Unified Architecture



Université de Mons

GPU-s evolution through the different generations

Gén Year Nvidia AMD/ATI Particularities

1 96 TNT2 Rage -DirectX6 = standard

-Rasterziation of traingle and texture

-Limitation : no vertex treatment

-Other provider : 3 dfx (Voodoo)

2 99 Geforce 256 Radeon 7500 -Open GL supported

- vertex treatment supported

3 0102

Geforce 3Geforce 4

Radeon 8500 -Nvidia buy 3 dfx-Vertex treatment programmable

4 02 Geforce FX Radeon 9700 - Fragments treatment programmable

- First GPGPU opérations

5 04

05

Geforce 6

Geforce 7

Radeon X800

Radeon X1800

-Speed of treatment increase

-GPGPU operation developped

6 06

07

08

Geforce 8

Geforce 9

Radeon HD200

Radeon HD300

-Geometry shader appear

-Unified architecture

-Nvidia created CUDA language

7 08 Geforce 200 Radeon HD400 -Not very spread yet

-Technical improvments (frequence, memory,

number of processor, bandwith,…)



Université de Mons

Table of content

1. History & Resume


3. Architecture of a GPU

4. GPU programming

5. CUDA

6. Conclusion



Université de Mons

Architecture of a GPU




Université de Mons

Short remember :

Architecture of a CPU

CPU and its evolution

Drawbacks


Needs

SIMD/MIMD

Short talk about data management

Gathering/scattering and PRAM

Overview


Time



Université de Mons

Architecture of a CPU


Arithmetic Logic Unit orCalcul Unit :

• Manage all operations

Control Unit :

• Manage all instructions

Cache :

• Fast memory access• Expensive

• High volume

DRAM :

• Dynamic random access memory• Cheap but need to be refreshed

Control brain

ALU hands

Memory tools

CONTROLALU ALU

ALU ALU

CACHE

DRAM

h b



Université de Mons

CPU processing

For a computer :

Program = several sequential instructions

Simple CPU : SISD (single instruction single data)

Short remember :

ARCHITECTURE OF A CPU

CPU AND ITS EVOLUTION

DRAWBACKS


NEEDS

SIMD/MIMD

Data management

GATHERING/SCATTERING AND PRAM


Instruction1 Instruction2 Instruction3

Program

code

• Instructions are computed 1 by 1

• On a single data at each time



Université de Mons

At first : SISD

In-order processors

Out-of-order processors ( performances )

Instructions dispatch to an instruction queue The results are queued

The process is still sequential

High volume of cache memory

Need to have a fast access to instructions and datas

Lots of « go and back » on datas

CPU and its evolution




Université de Mons

Evolution ( Pentium 3 )

SIMD (single instruction multiple data)

Vectorial calculus performances

Reasons

Only a few « go and back » on datas The complexity of the algorithm is very

High volume of cache memory and out-of-order execution are

superficials for multimedia applications

Evolution and drawbacks


CPU is perfect for sequential program but is weak for

multimedia applications



Université de Mons

A GPU is a SIMD processor

To be able to process a lot of datas





Université de Mons

A high memory bandwidth

10 x CPU bandwidth to process lots of datas in real time

Needs of the GPU




Université de Mons

Parler de la nouvelle génération GPU

MIMD (multiple instruction multiple data)

Comparer MIMD et SIMD

Parler de la gestion des données

Gathering

Scattering

Parler du modèle PRAM utilisé dans les GPU

Reste à faire




Université de Mons

Table of content

1. History & Resume


3. Architecture of a GPU

4. GPU programming

5. CUDA

6. Conclusion



Université de Mons

• CUDA (Computer Unified Device Architecture) is a

development library created by NVIDIA in 2007.

• It allows to use the power of a compatible graphic

card for general purpose computing.• Programmers can use C,C++ or Fortran to develop

applications using CUDA.

• Interfaces (wrappers) enable to use high-level

languages such as Java, .net or Python.

CUDA

52



Université de Mons

Different components of CUDA

53

• CUDA is constituated of set of software layers to

communicate with the GPU: a Driver, a Runtime and

a few librairies.



Université de Mons

• Include the code of all the functions to be

executed on the GPU.

• Using those libraries, developpers can only

use a set of predefined functions.

• They do not have access to the actual GPU.

• Examples:• CUBLAS, which has a set of building blocks for linear algebra calculations

on the GPU

• CUFFT, which can handle calculation of Fourier transforms

CUDA Libraries

54



Université de Mons

• Also called « C for CUDA »

• The high-level API is implemented “above” the low-

level API, each call to a function of the Runtime is

broken down into more basic instructions managedby the Driver API

• The term “high-level API” is relative. Even the

Runtime API is still what a lot of people would

consider very low-level; yet it still offers functionsthat are highly practical for initialization.

High Level API : CUDA Runtime

55



Université de Mons

• The Driver API is more complex to manage; it

requires more work to launch processing on the

GPU.

• The upside is that it’s more flexible, giving theprogrammer additional control.

• Note that the high-level and Low-level APIs are

mutually exclusive – the programmer must use one

or the other, but it’s not possible to mix function calls

from both.

Low Level API : CUDA Driver

56

CUDA from the Hardware



Université de Mons

• Nvidia’s Shader Core is made up of several clusters Nvidia calls Texture

Processor Clusters.

• Each cluster is made up of a texture unit and 2 streaming multiprocessors.

CUDA from the Hardware

Point of View

57



Université de Mons

• These processors consist of a front

end that reads/decodes and launches

instructions and a backend made up

of a group of eight calculating units

and two SFUs (Super Function Units).

where the instructions are executed

in SIMD fashion.

• The same instruction is applied to all

the threads in the warp. Nvidia calls

this mode of execution SIMT (forsingle instruction multiple threads).

• The backend operates at double

the frequency of the front end.

The streaming Multiprocessor

58

Streaming multiprocessors’



Université de Mons

• At each cycle, a warp ready for execution is

selected by the front end, which launches

execution of an instruction.

• To apply the instruction to all 32 threads in the

warp, the backend will take four cycles, but since it

operates at double the frequency of the front end,from its point of view only two cycles will be

executed.

• to avoid having the front end remain unused for

one cycle, the ideal is to alternate types of

instructions every cycle – a classic instruction forone cycle and an SFU instruction for the other.

Streaming multiprocessors

operating mode

59



Université de Mons

• Each multoprocessors have a small

memory area called Shared Memory

with a size of 16 KB per multiprocessor.

• This memory area provides a way for

threads in the same block tocommunicate. All the threads in a given

block are executed by the same

multiprocessor.

• The assignment of blocks to the

different multiprocessors is completelyundefined, meaning that two threads

from different blocks can’t

communicate during their execution.

Shared Memory

60



Université de Mons

• To limit too-frequent access to theshared memory, Nvidia has also

provided its multiprocessors with a

cache (approximately 8 KB per

multiprocessor) for access to constants

and textures.

• The multiprocessors also have 8,192

registers that are shared among all the

threads of all the blocks active on that

multiprocessor. The number of activeblocks per multiprocessor can’t exceed

eight, and the number of active warps

are limited to 24 (768 threads)

Cache Memory - Registers

61



Université de Mons

• Finding the optimum balance between the number of blocks andtheir size – more threads per block will be useful in masking the

latency of the memory operations, but at the same time the

number of registers available per thread are reduced.

• Blocks of 512 threads would be particularly inefficient, since onlyone block might be active on a multiprocessor, potentially wasting

256 threads. So, Nvidia advises using blocks of 128 to 256 threads,

which offers the best compromise between masking latency and

the number of registers needed for most kernels.

Optimizing a CUDA program

62



Université de Mons

• Host : CPU

• Device : GPU

• Kernel : Function executed

on the GPU• Thread : basic element of the data

to be processed (very lightweight)

• Warp : group of 32 threads

• Block : set of 64 to 512 threads

• Grid : Array of blocks

Definitions



Université de Mons

VCheck

Definitions (2)


CUDA from a Software



Université de Mons

CUDA = set of extensions to the C language

Type qualifiers for functions :

__global__ void function()

Function called by the CPU, executed on the GPU

__device__ void function()

Function called by and executed on the GPU

__host__ void function() Standard function (executed on the CPU)

CUDA from a Software

Point of View




Université de Mons

Restrictions on __device__ and __global__ :

1. Cannot be recursive

2. Must have a fixed number of arguments

Type qualifier for variables :

__shared__ variableThis variable will be stored in the

multiprocessor’s shared memory

Software Point of View (2)




Université de Mons

1. CPU code is extracted

and handed to the

standard compiler

2. GPU code is converted

into PTX code(assembly code) and

scanned for

inefficiences

3. PTX is translated isGPU-specific

commands that are

incapsulated in the exe

Compilation




Université de Mons

A few applications examples

68



ATI equivalent to Nvidia’s CUDA

GPU - Graphical Processing Unit

Documents

Transcript of GPU - Graphical Processing Unit