Mapping Computational Concepts to GPUs Mark Harris NVIDIA.

Mapping Computational Mapping Computational Concepts to GPUsConcepts to GPUs

Mark HarrisMark HarrisNVIDIANVIDIA

2

OutlineOutline

• Data Parallelism and Stream Data Parallelism and Stream ProcessingProcessing

• Computational Resources InventoryComputational Resources Inventory

• CPU-GPU AnalogiesCPU-GPU Analogies

• Example: Example: • N-body gravitational simulationN-body gravitational simulation

• Parallel reductionsParallel reductions

• Overview of Branching TechniquesOverview of Branching Techniques

3

The Importance of Data The Importance of Data ParallelismParallelism• GPUs are designed for graphicsGPUs are designed for graphics

• Highly parallel tasksHighly parallel tasks

• GPUs process GPUs process independentindependent vertices & vertices & fragmentsfragments• Temporary registers are zeroedTemporary registers are zeroed

• No shared or static dataNo shared or static data

• No read-modify-write buffersNo read-modify-write buffers

• Data-parallel processingData-parallel processing• GPUs architecture is ALU-heavyGPUs architecture is ALU-heavy

• Multiple vertex & pixel pipelines, multiple ALUs per pipeMultiple vertex & pixel pipelines, multiple ALUs per pipe

• Hide memory latency (with more computation)Hide memory latency (with more computation)

4

Arithmetic IntensityArithmetic Intensity

• Arithmetic intensityArithmetic intensity• ops per word transferredops per word transferred

• Computation / bandwidthComputation / bandwidth

• Best to have Best to have highhigh arithmetic arithmetic intensityintensity

• Ideal GPGPU apps haveIdeal GPGPU apps have• Large data setsLarge data sets

• High parallelismHigh parallelism

• High indepence between data elementsHigh indepence between data elements

5

Data Streams & KernelsData Streams & Kernels

• StreamsStreams• Collection of records requiring similar Collection of records requiring similar

computationcomputation

•Vertex positions, Voxels, FEM cells, etc.Vertex positions, Voxels, FEM cells, etc.

• Provide data parallelismProvide data parallelism

• KernelsKernels• Functions applied to each element in streamFunctions applied to each element in stream

•transforms, PDE, …transforms, PDE, …

• Few dependencies between stream elementsFew dependencies between stream elements

•Encourage high Arithmetic IntensityEncourage high Arithmetic Intensity

6

Example: Simulation GridExample: Simulation Grid

• Common GPGPU computation styleCommon GPGPU computation style• Textures represent computational grids = streamsTextures represent computational grids = streams

• Many computations map to gridsMany computations map to grids• Matrix algebraMatrix algebra

• Image & Volume processingImage & Volume processing

• Physically-based simulationPhysically-based simulation

• Global IlluminationGlobal Illumination

• ray tracing, photon mapping, ray tracing, photon mapping, radiosityradiosity

• Non-grid streams can be Non-grid streams can be mapped to gridsmapped to grids

7

Stream ComputationStream Computation

• Grid Simulation algorithmGrid Simulation algorithm• Made up of stepsMade up of steps

• Each step updates entire gridEach step updates entire grid

• Must complete before next step can beginMust complete before next step can begin

• Grid is a stream, steps are kernelsGrid is a stream, steps are kernels• Kernel applied to each stream elementKernel applied to each stream element

Cloud simulatio

n algorithm

8

Scatter vs. GatherScatter vs. Gather

• Grid communicationGrid communication• Grid cells share informationGrid cells share information

9

Computational Resources Computational Resources InventoryInventory• Programmable parallel processorsProgrammable parallel processors

• Vertex & Fragment pipelinesVertex & Fragment pipelines

• RasterizerRasterizer• Mostly useful for interpolating addresses Mostly useful for interpolating addresses

(texture coordinates) and per-vertex constants(texture coordinates) and per-vertex constants

• Texture unitTexture unit• Read-only memory interfaceRead-only memory interface

• Render to textureRender to texture• Write-only memory interfaceWrite-only memory interface

10

Vertex ProcessorVertex Processor

• Fully programmable (SIMD / MIMD)Fully programmable (SIMD / MIMD)

• Processes 4-vectors (RGBA / XYZW)Processes 4-vectors (RGBA / XYZW)

• Capable of scatter but not gatherCapable of scatter but not gather• Can change the location of current vertexCan change the location of current vertex

• Cannot read info from other verticesCannot read info from other vertices

• Can only read a small constant memoryCan only read a small constant memory

• Latest GPUs: Vertex Texture FetchLatest GPUs: Vertex Texture Fetch• Random access memory for verticesRandom access memory for vertices

• Arguably still not gatherArguably still not gather

11

Fragment ProcessorFragment Processor

• Fully programmable (SIMD)Fully programmable (SIMD)

• Processes 4-component vectors (RGBA / Processes 4-component vectors (RGBA / XYZW)XYZW)

• Random access memory read (textures)Random access memory read (textures)

• Capable of gather but not scatterCapable of gather but not scatter• RAM read (texture fetch), but no RAM writeRAM read (texture fetch), but no RAM write

• Output address fixed to a specific pixelOutput address fixed to a specific pixel

• Typically more useful than vertex Typically more useful than vertex processorprocessor• More fragment pipelines than vertex pipelinesMore fragment pipelines than vertex pipelines

• Direct output (fragment processor is at end of pipeline)Direct output (fragment processor is at end of pipeline)

12

CPU-GPU AnalogiesCPU-GPU Analogies

• CPU programming is familiarCPU programming is familiar• GPU programming is graphics-centricGPU programming is graphics-centric

• Analogies can aid understandingAnalogies can aid understanding

13

CPU-GPU AnalogiesCPU-GPU Analogies

CPUCPU GPUGPU

Stream / Data Array = TextureStream / Data Array = Texture

Memory Read = Texture Memory Read = Texture SampleSample

14

KernelsKernels

Kernel / loop body / algorithm step = Fragment ProgramKernel / loop body / algorithm step = Fragment Program

CPU GPU

15

FeedbackFeedback

• Each algorithm step Each algorithm step depends on the results of depends on the results of previous stepsprevious steps

• Each time step depends on Each time step depends on the results of the previous the results of the previous time steptime step

16

FeedbackFeedback

.. . .Grid[i][j]= x;Grid[i][j]= x; . . . . . .

Array Write Array Write = = Render to Render to TextureTexture

CPU GPU

17

GPU Simulation OverviewGPU Simulation Overview

• Analogies lead to implementationAnalogies lead to implementation• Algorithm steps are fragment programsAlgorithm steps are fragment programs

•Computational Computational kernelskernels

• Current state is stored in texturesCurrent state is stored in textures

• Feedback via render to textureFeedback via render to texture

• One question: how do we invoke One question: how do we invoke computation?computation?

18

Invoking ComputationInvoking Computation

• Must invoke computation at each Must invoke computation at each pixelpixel• Just draw geometry!Just draw geometry!

• Most common GPGPU invocation is a full-Most common GPGPU invocation is a full-screen quadscreen quad

• Other Useful AnalogiesOther Useful Analogies• Rasterization = Kernel InvocationRasterization = Kernel Invocation

• Texture Coordinates = Computational DomainTexture Coordinates = Computational Domain

• Vertex Coordinates = Computational RangeVertex Coordinates = Computational Range

19

Typical “Grid” ComputationTypical “Grid” Computation

• Initialize “view” (so that Initialize “view” (so that pixels:texels::1:1)pixels:texels::1:1)

glMatrixMode(GL_MODELVIEW);glMatrixMode(GL_MODELVIEW);glLoadIdentity();glLoadIdentity();glMatrixMode(GL_PROJECTION);glMatrixMode(GL_PROJECTION);glLoadIdentity();glLoadIdentity();glOrtho(0, 1, 0, 1, 0, 1);glOrtho(0, 1, 0, 1, 0, 1);glViewport(0, 0, outTexResX, outTexResY);glViewport(0, 0, outTexResX, outTexResY);

• For each algorithm step:For each algorithm step:• Activate render-to-textureActivate render-to-texture

• Setup input textures, fragment programSetup input textures, fragment program

• Draw a full-screen quad (1x1)Draw a full-screen quad (1x1)

20

Example: N-Body SimulationExample: N-Body Simulation

• Brute force Brute force • N = 8192 bodiesN = 8192 bodies

• NN2 2 gravity computationsgravity computations

• 64M force comps. / frame64M force comps. / frame

• ~25 flops per force~25 flops per force

• 7.5 fps 7.5 fps

• 12.5+ GFLOPs sustained12.5+ GFLOPs sustained• GeForce 6800 UltraGeForce 6800 Ultra

Nyland, Harris, Prins,GP2 2004 poster

21

Computing Gravitational ForcesComputing Gravitational Forces

• Each body attracts all other bodiesEach body attracts all other bodies•NN bodies, so bodies, so NN22 forces forces

• Draw into an Draw into an NNxxNN buffer buffer• Pixel (Pixel (ii,,jj) computes force between bodies ) computes force between bodies ii and and jj

• Very simple fragment programVery simple fragment program

•More than 2048 bodies makes it trickierMore than 2048 bodies makes it trickier– Limited by max pbuffer size…Limited by max pbuffer size…– ““exercise for the reader”exercise for the reader”

22


F(i,j) = gMiMj / r(i,j)2,

r(i,j) = |pos(i) - pos(j)|

Force is proportional to the inverse square Force is proportional to the inverse square of the distance between bodiesof the distance between bodies

23


N-body force TextureN-body force Texture

force(force(ii,,jj))

NNii

NN

00

j

ii

jj

Body Position TextureBody Position Texture

F(i,j) = gMiMj / r(i,j)2,

r(i,j) = |pos(i) - pos(j)|

Coordinates (Coordinates (ii,,jj) in force texture used to find bodies) in force texture used to find bodiesii and and jj in body position texture in body position texture

24

Computing Gravitational ForcesComputing Gravitational Forcesfloat4 force(float2 ij : WPOS,

uniform sampler2D pos) : COLOR0

{

// Pos texture is 2D, not 1D, so we need to

// convert body index into 2D coords for pos tex

float4 iCoords = getBodyCoords(ij);

float4 iPosMass = texture2D(pos, iCoords.xy);

float4 jPosMass = texture2D(pos, iCoords.zw);

float3 dir = iPos.xyz - jPos.xyz;

float r2 = dot(dir, dir);

dir = normalize(dir);

return dir * g * iPosMass.w * jPosMass.w / r2;

}

25

Computing Total ForceComputing Total Force

• Have: array of (i,j) Have: array of (i,j) forcesforces

• Need: total force on Need: total force on each particle ieach particle i

force(i,j)


NNii

NN

00

26



• Need: total force on Need: total force on each particle ieach particle i• Sum of each column of the Sum of each column of the

force arrayforce array force(i,j)


NNii

NN

00

27



• Need: total force on Need: total force on each particle ieach particle i• Sum of each column of the Sum of each column of the

force arrayforce array

• Can do all N columns Can do all N columns in parallelin parallel

This is called a This is called a Parallel ReductionParallel Reduction

force(i,j)


NNii

NN

00

28

Parallel ReductionsParallel Reductions

• 1D parallel reduction: 1D parallel reduction: • sum N columns or rows in parallelsum N columns or rows in parallel

• add two halves of texture togetheradd two halves of texture together

++NNxxNN

29




• repeatedly...repeatedly...

++NNx(x(NN/2)/2)

30





++NNx(x(NN/4)/4)

31





• Until we’re left with a single row of Until we’re left with a single row of texelstexels

NNx1x1

Requires logRequires log22NN steps steps

32

Update Positions and VelocitiesUpdate Positions and Velocities

• Now we have a 1-D array of total Now we have a 1-D array of total forcesforces• One per bodyOne per body

• Update VelocityUpdate Velocity• uu((ii,,tt++dtdt) = ) = uu((ii,,tt) + ) + FFtotaltotal((ii) * ) * dtdt

• Simple pixel shader reads previous velocity and Simple pixel shader reads previous velocity and force textures, creates new velocity textureforce textures, creates new velocity texture

• Update PositionUpdate Position• xx((ii, , tt++dtdt) = ) = xx((ii,,tt) + ) + uu((ii,,tt) * ) * dtdt

• Simple pixel shader reads previous position and Simple pixel shader reads previous position and velocity textures, creates new position texturevelocity textures, creates new position texture

GPGPU Flow Control GPGPU Flow Control StrategiesStrategies

Branching and LoopingBranching and Looping

34

Branching TechniquesBranching Techniques• Fragment program branches can be expensiveFragment program branches can be expensive

• No true fragment branching on GeForce FX or Radeon 9x00-No true fragment branching on GeForce FX or Radeon 9x00-X850X850

• SIMD branching on GeForce 6+ SeriesSIMD branching on GeForce 6+ Series

• Incoherent branching hurts performanceIncoherent branching hurts performance

• Sometimes better to move decisions up the Sometimes better to move decisions up the pipelinepipeline• Replace with mathReplace with math

• Occlusion QueryOcclusion Query

• Static Branch ResolutionStatic Branch Resolution

• Z-cullZ-cull

• Pre-computationPre-computation

35

Branching with Occlusion QueryBranching with Occlusion Query• Use it for iteration terminationUse it for iteration termination

Do Do

{ // outer loop on CPU{ // outer loop on CPU

BeginOcclusionQueryBeginOcclusionQuery

{{

// Render with fragment program that // Render with fragment program that // discards fragments that satisfy // discards fragments that satisfy

// termination criteria // termination criteria

} } EndQueryEndQuery

} } While query returns > 0While query returns > 0

• Can be used for subdivision techniquesCan be used for subdivision techniques

36

Static Branch ResolutionStatic Branch Resolution

• Avoid branches where outcome is Avoid branches where outcome is fixedfixed• One region is always true, another falseOne region is always true, another false

• Separate FPs for each region, no branchesSeparate FPs for each region, no branches

• Example: boundariesExample: boundaries

37

Z-CullZ-Cull• In early pass, modify depth bufferIn early pass, modify depth buffer

• Clear Z to 1Clear Z to 1

• Draw quad at Z=0Draw quad at Z=0

• Discard pixels that should be modified in later passesDiscard pixels that should be modified in later passes

• Subsequent passesSubsequent passes• Enable depth test (GL_LESS)Enable depth test (GL_LESS)

• Draw full-screen quad at z=0.5Draw full-screen quad at z=0.5

• Only pixels with previous depth=1 will be processedOnly pixels with previous depth=1 will be processed

• Can also use stencil cull on GeForce 6 Can also use stencil cull on GeForce 6 seriesseries

• Not available on GeForce FX (NV3X)Not available on GeForce FX (NV3X)• Discard and shader depth output disables Z-CullDiscard and shader depth output disables Z-Cull

38

Pre-computationPre-computation

• Pre-compute anything that will not Pre-compute anything that will not change every iteration!change every iteration!

• Example: static obstacles in fluid simExample: static obstacles in fluid sim• When user draws obstacles, compute texture When user draws obstacles, compute texture

containing boundary info for cellscontaining boundary info for cells

• Reuse that texture until obstacles are modifiedReuse that texture until obstacles are modified

• Combine with Z-cull for higher performance!Combine with Z-cull for higher performance!

39

GeForce 6 Series BranchingGeForce 6 Series Branching

• True, SIMD branchingTrue, SIMD branching• Lots of incoherent branching can hurt performanceLots of incoherent branching can hurt performance

• Should have coherent regions of Should have coherent regions of 1000 pixels 1000 pixels

•That is only about 30x30 pixels, so still very useable!That is only about 30x30 pixels, so still very useable!

• Don’t ignore overhead of branch Don’t ignore overhead of branch instructionsinstructions• Branching over < 5 instructions may not be worth itBranching over < 5 instructions may not be worth it

• Use branching for early exit from loopsUse branching for early exit from loops• Save a lot of computationSave a lot of computation

40

SummarySummary

• Presented mappings of basic Presented mappings of basic computational concepts to GPUscomputational concepts to GPUs• Basic concepts and terminologyBasic concepts and terminology

• For introductory “Hello GPGPU” sample code, For introductory “Hello GPGPU” sample code, see http://www.gpgpu.org/developersee http://www.gpgpu.org/developer

• Only the beginning:Only the beginning:• Rest of course presents advanced techniques, Rest of course presents advanced techniques,

strategies, and specific algorithms.strategies, and specific algorithms.

Mapping Computational Concepts to GPUs Mark Harris NVIDIA.

Documents

Transcript of Mapping Computational Concepts to GPUs Mark Harris NVIDIA.