Mapping Computational Concepts to GPUs Mark Harris NVIDIA.

40
Mapping Computational Mapping Computational Concepts to GPUs Concepts to GPUs Mark Harris Mark Harris NVIDIA NVIDIA

Transcript of Mapping Computational Concepts to GPUs Mark Harris NVIDIA.

Page 1: Mapping Computational Concepts to GPUs Mark Harris NVIDIA.

Mapping Computational Mapping Computational Concepts to GPUsConcepts to GPUs

Mark HarrisMark HarrisNVIDIANVIDIA

Page 2: Mapping Computational Concepts to GPUs Mark Harris NVIDIA.

2

OutlineOutline

• Data Parallelism and Stream Data Parallelism and Stream ProcessingProcessing

• Computational Resources InventoryComputational Resources Inventory

• CPU-GPU AnalogiesCPU-GPU Analogies

• Example: Example: • N-body gravitational simulationN-body gravitational simulation

• Parallel reductionsParallel reductions

• Overview of Branching TechniquesOverview of Branching Techniques

Page 3: Mapping Computational Concepts to GPUs Mark Harris NVIDIA.

3

The Importance of Data The Importance of Data ParallelismParallelism• GPUs are designed for graphicsGPUs are designed for graphics

• Highly parallel tasksHighly parallel tasks

• GPUs process GPUs process independentindependent vertices & vertices & fragmentsfragments• Temporary registers are zeroedTemporary registers are zeroed

• No shared or static dataNo shared or static data

• No read-modify-write buffersNo read-modify-write buffers

• Data-parallel processingData-parallel processing• GPUs architecture is ALU-heavyGPUs architecture is ALU-heavy

• Multiple vertex & pixel pipelines, multiple ALUs per pipeMultiple vertex & pixel pipelines, multiple ALUs per pipe

• Hide memory latency (with more computation)Hide memory latency (with more computation)

Page 4: Mapping Computational Concepts to GPUs Mark Harris NVIDIA.

4

Arithmetic IntensityArithmetic Intensity

• Arithmetic intensityArithmetic intensity• ops per word transferredops per word transferred

• Computation / bandwidthComputation / bandwidth

• Best to have Best to have highhigh arithmetic arithmetic intensityintensity

• Ideal GPGPU apps haveIdeal GPGPU apps have• Large data setsLarge data sets

• High parallelismHigh parallelism

• High indepence between data elementsHigh indepence between data elements

Page 5: Mapping Computational Concepts to GPUs Mark Harris NVIDIA.

5

Data Streams & KernelsData Streams & Kernels

• StreamsStreams• Collection of records requiring similar Collection of records requiring similar

computationcomputation

•Vertex positions, Voxels, FEM cells, etc.Vertex positions, Voxels, FEM cells, etc.

• Provide data parallelismProvide data parallelism

• KernelsKernels• Functions applied to each element in streamFunctions applied to each element in stream

•transforms, PDE, …transforms, PDE, …

• Few dependencies between stream elementsFew dependencies between stream elements

•Encourage high Arithmetic IntensityEncourage high Arithmetic Intensity

Page 6: Mapping Computational Concepts to GPUs Mark Harris NVIDIA.

6

Example: Simulation GridExample: Simulation Grid

• Common GPGPU computation styleCommon GPGPU computation style• Textures represent computational grids = streamsTextures represent computational grids = streams

• Many computations map to gridsMany computations map to grids• Matrix algebraMatrix algebra

• Image & Volume processingImage & Volume processing

• Physically-based simulationPhysically-based simulation

• Global IlluminationGlobal Illumination

• ray tracing, photon mapping, ray tracing, photon mapping, radiosityradiosity

• Non-grid streams can be Non-grid streams can be mapped to gridsmapped to grids

Page 7: Mapping Computational Concepts to GPUs Mark Harris NVIDIA.

7

Stream ComputationStream Computation

• Grid Simulation algorithmGrid Simulation algorithm• Made up of stepsMade up of steps

• Each step updates entire gridEach step updates entire grid

• Must complete before next step can beginMust complete before next step can begin

• Grid is a stream, steps are kernelsGrid is a stream, steps are kernels• Kernel applied to each stream elementKernel applied to each stream element

Cloud simulatio

n algorithm

Page 8: Mapping Computational Concepts to GPUs Mark Harris NVIDIA.

8

Scatter vs. GatherScatter vs. Gather

• Grid communicationGrid communication• Grid cells share informationGrid cells share information

Page 9: Mapping Computational Concepts to GPUs Mark Harris NVIDIA.

9

Computational Resources Computational Resources InventoryInventory• Programmable parallel processorsProgrammable parallel processors

• Vertex & Fragment pipelinesVertex & Fragment pipelines

• RasterizerRasterizer• Mostly useful for interpolating addresses Mostly useful for interpolating addresses

(texture coordinates) and per-vertex constants(texture coordinates) and per-vertex constants

• Texture unitTexture unit• Read-only memory interfaceRead-only memory interface

• Render to textureRender to texture• Write-only memory interfaceWrite-only memory interface

Page 10: Mapping Computational Concepts to GPUs Mark Harris NVIDIA.

10

Vertex ProcessorVertex Processor

• Fully programmable (SIMD / MIMD)Fully programmable (SIMD / MIMD)

• Processes 4-vectors (RGBA / XYZW)Processes 4-vectors (RGBA / XYZW)

• Capable of scatter but not gatherCapable of scatter but not gather• Can change the location of current vertexCan change the location of current vertex

• Cannot read info from other verticesCannot read info from other vertices

• Can only read a small constant memoryCan only read a small constant memory

• Latest GPUs: Vertex Texture FetchLatest GPUs: Vertex Texture Fetch• Random access memory for verticesRandom access memory for vertices

• Arguably still not gatherArguably still not gather

Page 11: Mapping Computational Concepts to GPUs Mark Harris NVIDIA.

11

Fragment ProcessorFragment Processor

• Fully programmable (SIMD)Fully programmable (SIMD)

• Processes 4-component vectors (RGBA / Processes 4-component vectors (RGBA / XYZW)XYZW)

• Random access memory read (textures)Random access memory read (textures)

• Capable of gather but not scatterCapable of gather but not scatter• RAM read (texture fetch), but no RAM writeRAM read (texture fetch), but no RAM write

• Output address fixed to a specific pixelOutput address fixed to a specific pixel

• Typically more useful than vertex Typically more useful than vertex processorprocessor• More fragment pipelines than vertex pipelinesMore fragment pipelines than vertex pipelines

• Direct output (fragment processor is at end of pipeline)Direct output (fragment processor is at end of pipeline)

Page 12: Mapping Computational Concepts to GPUs Mark Harris NVIDIA.

12

CPU-GPU AnalogiesCPU-GPU Analogies

• CPU programming is familiarCPU programming is familiar• GPU programming is graphics-centricGPU programming is graphics-centric

• Analogies can aid understandingAnalogies can aid understanding

Page 13: Mapping Computational Concepts to GPUs Mark Harris NVIDIA.

13

CPU-GPU AnalogiesCPU-GPU Analogies

CPUCPU GPUGPU

Stream / Data Array = TextureStream / Data Array = Texture

Memory Read = Texture Memory Read = Texture SampleSample

Page 14: Mapping Computational Concepts to GPUs Mark Harris NVIDIA.

14

KernelsKernels

Kernel / loop body / algorithm step = Fragment ProgramKernel / loop body / algorithm step = Fragment Program

CPU GPU

Page 15: Mapping Computational Concepts to GPUs Mark Harris NVIDIA.

15

FeedbackFeedback

• Each algorithm step Each algorithm step depends on the results of depends on the results of previous stepsprevious steps

• Each time step depends on Each time step depends on the results of the previous the results of the previous time steptime step

Page 16: Mapping Computational Concepts to GPUs Mark Harris NVIDIA.

16

FeedbackFeedback

.. . .Grid[i][j]= x;Grid[i][j]= x; . . . . . .

Array Write Array Write = = Render to Render to TextureTexture

CPU GPU

Page 17: Mapping Computational Concepts to GPUs Mark Harris NVIDIA.

17

GPU Simulation OverviewGPU Simulation Overview

• Analogies lead to implementationAnalogies lead to implementation• Algorithm steps are fragment programsAlgorithm steps are fragment programs

•Computational Computational kernelskernels

• Current state is stored in texturesCurrent state is stored in textures

• Feedback via render to textureFeedback via render to texture

• One question: how do we invoke One question: how do we invoke computation?computation?

Page 18: Mapping Computational Concepts to GPUs Mark Harris NVIDIA.

18

Invoking ComputationInvoking Computation

• Must invoke computation at each Must invoke computation at each pixelpixel• Just draw geometry!Just draw geometry!

• Most common GPGPU invocation is a full-Most common GPGPU invocation is a full-screen quadscreen quad

• Other Useful AnalogiesOther Useful Analogies• Rasterization = Kernel InvocationRasterization = Kernel Invocation

• Texture Coordinates = Computational DomainTexture Coordinates = Computational Domain

• Vertex Coordinates = Computational RangeVertex Coordinates = Computational Range

Page 19: Mapping Computational Concepts to GPUs Mark Harris NVIDIA.

19

Typical “Grid” ComputationTypical “Grid” Computation

• Initialize “view” (so that Initialize “view” (so that pixels:texels::1:1)pixels:texels::1:1)

glMatrixMode(GL_MODELVIEW);glMatrixMode(GL_MODELVIEW);glLoadIdentity();glLoadIdentity();glMatrixMode(GL_PROJECTION);glMatrixMode(GL_PROJECTION);glLoadIdentity();glLoadIdentity();glOrtho(0, 1, 0, 1, 0, 1);glOrtho(0, 1, 0, 1, 0, 1);glViewport(0, 0, outTexResX, outTexResY);glViewport(0, 0, outTexResX, outTexResY);

• For each algorithm step:For each algorithm step:• Activate render-to-textureActivate render-to-texture

• Setup input textures, fragment programSetup input textures, fragment program

• Draw a full-screen quad (1x1)Draw a full-screen quad (1x1)

Page 20: Mapping Computational Concepts to GPUs Mark Harris NVIDIA.

20

Example: N-Body SimulationExample: N-Body Simulation

• Brute force Brute force • N = 8192 bodiesN = 8192 bodies

• NN2 2 gravity computationsgravity computations

• 64M force comps. / frame64M force comps. / frame

• ~25 flops per force~25 flops per force

• 7.5 fps 7.5 fps

• 12.5+ GFLOPs sustained12.5+ GFLOPs sustained• GeForce 6800 UltraGeForce 6800 Ultra

Nyland, Harris, Prins,GP2 2004 poster

Page 21: Mapping Computational Concepts to GPUs Mark Harris NVIDIA.

21

Computing Gravitational ForcesComputing Gravitational Forces

• Each body attracts all other bodiesEach body attracts all other bodies•NN bodies, so bodies, so NN22 forces forces

• Draw into an Draw into an NNxxNN buffer buffer• Pixel (Pixel (ii,,jj) computes force between bodies ) computes force between bodies ii and and jj

• Very simple fragment programVery simple fragment program

•More than 2048 bodies makes it trickierMore than 2048 bodies makes it trickier– Limited by max pbuffer size…Limited by max pbuffer size…– ““exercise for the reader”exercise for the reader”

Page 22: Mapping Computational Concepts to GPUs Mark Harris NVIDIA.

22

Computing Gravitational ForcesComputing Gravitational Forces

F(i,j) = gMiMj / r(i,j)2,

r(i,j) = |pos(i) - pos(j)|

Force is proportional to the inverse square Force is proportional to the inverse square of the distance between bodiesof the distance between bodies

Page 23: Mapping Computational Concepts to GPUs Mark Harris NVIDIA.

23

Computing Gravitational ForcesComputing Gravitational Forces

N-body force TextureN-body force Texture

force(force(ii,,jj))

NNii

NN

00

j

ii

jj

Body Position TextureBody Position Texture

F(i,j) = gMiMj / r(i,j)2,

r(i,j) = |pos(i) - pos(j)|

Coordinates (Coordinates (ii,,jj) in force texture used to find bodies) in force texture used to find bodiesii and and jj in body position texture in body position texture

Page 24: Mapping Computational Concepts to GPUs Mark Harris NVIDIA.

24

Computing Gravitational ForcesComputing Gravitational Forcesfloat4 force(float2 ij : WPOS,

uniform sampler2D pos) : COLOR0

{

// Pos texture is 2D, not 1D, so we need to

// convert body index into 2D coords for pos tex

float4 iCoords = getBodyCoords(ij);

float4 iPosMass = texture2D(pos, iCoords.xy);

float4 jPosMass = texture2D(pos, iCoords.zw);

float3 dir = iPos.xyz - jPos.xyz;

float r2 = dot(dir, dir);

dir = normalize(dir);

return dir * g * iPosMass.w * jPosMass.w / r2;

}

Page 25: Mapping Computational Concepts to GPUs Mark Harris NVIDIA.

25

Computing Total ForceComputing Total Force

• Have: array of (i,j) Have: array of (i,j) forcesforces

• Need: total force on Need: total force on each particle ieach particle i

force(i,j)

N-body force TextureN-body force Texture

NNii

NN

00

Page 26: Mapping Computational Concepts to GPUs Mark Harris NVIDIA.

26

Computing Total ForceComputing Total Force

• Have: array of (i,j) Have: array of (i,j) forcesforces

• Need: total force on Need: total force on each particle ieach particle i• Sum of each column of the Sum of each column of the

force arrayforce array force(i,j)

N-body force TextureN-body force Texture

NNii

NN

00

Page 27: Mapping Computational Concepts to GPUs Mark Harris NVIDIA.

27

Computing Total ForceComputing Total Force

• Have: array of (i,j) Have: array of (i,j) forcesforces

• Need: total force on Need: total force on each particle ieach particle i• Sum of each column of the Sum of each column of the

force arrayforce array

• Can do all N columns Can do all N columns in parallelin parallel

This is called a This is called a Parallel ReductionParallel Reduction

force(i,j)

N-body force TextureN-body force Texture

NNii

NN

00

Page 28: Mapping Computational Concepts to GPUs Mark Harris NVIDIA.

28

Parallel ReductionsParallel Reductions

• 1D parallel reduction: 1D parallel reduction: • sum N columns or rows in parallelsum N columns or rows in parallel

• add two halves of texture togetheradd two halves of texture together

++NNxxNN

Page 29: Mapping Computational Concepts to GPUs Mark Harris NVIDIA.

29

Parallel ReductionsParallel Reductions

• 1D parallel reduction: 1D parallel reduction: • sum N columns or rows in parallelsum N columns or rows in parallel

• add two halves of texture togetheradd two halves of texture together

• repeatedly...repeatedly...

++NNx(x(NN/2)/2)

Page 30: Mapping Computational Concepts to GPUs Mark Harris NVIDIA.

30

Parallel ReductionsParallel Reductions

• 1D parallel reduction: 1D parallel reduction: • sum N columns or rows in parallelsum N columns or rows in parallel

• add two halves of texture togetheradd two halves of texture together

• repeatedly...repeatedly...

++NNx(x(NN/4)/4)

Page 31: Mapping Computational Concepts to GPUs Mark Harris NVIDIA.

31

Parallel ReductionsParallel Reductions

• 1D parallel reduction: 1D parallel reduction: • sum N columns or rows in parallelsum N columns or rows in parallel

• add two halves of texture togetheradd two halves of texture together

• repeatedly...repeatedly...

• Until we’re left with a single row of Until we’re left with a single row of texelstexels

NNx1x1

Requires logRequires log22NN steps steps

Page 32: Mapping Computational Concepts to GPUs Mark Harris NVIDIA.

32

Update Positions and VelocitiesUpdate Positions and Velocities

• Now we have a 1-D array of total Now we have a 1-D array of total forcesforces• One per bodyOne per body

• Update VelocityUpdate Velocity• uu((ii,,tt++dtdt) = ) = uu((ii,,tt) + ) + FFtotaltotal((ii) * ) * dtdt

• Simple pixel shader reads previous velocity and Simple pixel shader reads previous velocity and force textures, creates new velocity textureforce textures, creates new velocity texture

• Update PositionUpdate Position• xx((ii, , tt++dtdt) = ) = xx((ii,,tt) + ) + uu((ii,,tt) * ) * dtdt

• Simple pixel shader reads previous position and Simple pixel shader reads previous position and velocity textures, creates new position texturevelocity textures, creates new position texture

Page 33: Mapping Computational Concepts to GPUs Mark Harris NVIDIA.

GPGPU Flow Control GPGPU Flow Control StrategiesStrategies

Branching and LoopingBranching and Looping

Page 34: Mapping Computational Concepts to GPUs Mark Harris NVIDIA.

34

Branching TechniquesBranching Techniques• Fragment program branches can be expensiveFragment program branches can be expensive

• No true fragment branching on GeForce FX or Radeon 9x00-No true fragment branching on GeForce FX or Radeon 9x00-X850X850

• SIMD branching on GeForce 6+ SeriesSIMD branching on GeForce 6+ Series

• Incoherent branching hurts performanceIncoherent branching hurts performance

• Sometimes better to move decisions up the Sometimes better to move decisions up the pipelinepipeline• Replace with mathReplace with math

• Occlusion QueryOcclusion Query

• Static Branch ResolutionStatic Branch Resolution

• Z-cullZ-cull

• Pre-computationPre-computation

Page 35: Mapping Computational Concepts to GPUs Mark Harris NVIDIA.

35

Branching with Occlusion QueryBranching with Occlusion Query• Use it for iteration terminationUse it for iteration termination

Do Do

{ // outer loop on CPU{ // outer loop on CPU

BeginOcclusionQueryBeginOcclusionQuery

{{

// Render with fragment program that // Render with fragment program that // discards fragments that satisfy // discards fragments that satisfy

// termination criteria // termination criteria

} } EndQueryEndQuery

} } While query returns > 0While query returns > 0

• Can be used for subdivision techniquesCan be used for subdivision techniques

Page 36: Mapping Computational Concepts to GPUs Mark Harris NVIDIA.

36

Static Branch ResolutionStatic Branch Resolution

• Avoid branches where outcome is Avoid branches where outcome is fixedfixed• One region is always true, another falseOne region is always true, another false

• Separate FPs for each region, no branchesSeparate FPs for each region, no branches

• Example: boundariesExample: boundaries

Page 37: Mapping Computational Concepts to GPUs Mark Harris NVIDIA.

37

Z-CullZ-Cull• In early pass, modify depth bufferIn early pass, modify depth buffer

• Clear Z to 1Clear Z to 1

• Draw quad at Z=0Draw quad at Z=0

• Discard pixels that should be modified in later passesDiscard pixels that should be modified in later passes

• Subsequent passesSubsequent passes• Enable depth test (GL_LESS)Enable depth test (GL_LESS)

• Draw full-screen quad at z=0.5Draw full-screen quad at z=0.5

• Only pixels with previous depth=1 will be processedOnly pixels with previous depth=1 will be processed

• Can also use stencil cull on GeForce 6 Can also use stencil cull on GeForce 6 seriesseries

• Not available on GeForce FX (NV3X)Not available on GeForce FX (NV3X)• Discard and shader depth output disables Z-CullDiscard and shader depth output disables Z-Cull

Page 38: Mapping Computational Concepts to GPUs Mark Harris NVIDIA.

38

Pre-computationPre-computation

• Pre-compute anything that will not Pre-compute anything that will not change every iteration!change every iteration!

• Example: static obstacles in fluid simExample: static obstacles in fluid sim• When user draws obstacles, compute texture When user draws obstacles, compute texture

containing boundary info for cellscontaining boundary info for cells

• Reuse that texture until obstacles are modifiedReuse that texture until obstacles are modified

• Combine with Z-cull for higher performance!Combine with Z-cull for higher performance!

Page 39: Mapping Computational Concepts to GPUs Mark Harris NVIDIA.

39

GeForce 6 Series BranchingGeForce 6 Series Branching

• True, SIMD branchingTrue, SIMD branching• Lots of incoherent branching can hurt performanceLots of incoherent branching can hurt performance

• Should have coherent regions of Should have coherent regions of 1000 pixels 1000 pixels

•That is only about 30x30 pixels, so still very useable!That is only about 30x30 pixels, so still very useable!

• Don’t ignore overhead of branch Don’t ignore overhead of branch instructionsinstructions• Branching over < 5 instructions may not be worth itBranching over < 5 instructions may not be worth it

• Use branching for early exit from loopsUse branching for early exit from loops• Save a lot of computationSave a lot of computation

Page 40: Mapping Computational Concepts to GPUs Mark Harris NVIDIA.

40

SummarySummary

• Presented mappings of basic Presented mappings of basic computational concepts to GPUscomputational concepts to GPUs• Basic concepts and terminologyBasic concepts and terminology

• For introductory “Hello GPGPU” sample code, For introductory “Hello GPGPU” sample code, see http://www.gpgpu.org/developersee http://www.gpgpu.org/developer

• Only the beginning:Only the beginning:• Rest of course presents advanced techniques, Rest of course presents advanced techniques,

strategies, and specific algorithms.strategies, and specific algorithms.