Mapping Computational Concepts to GPUs Mark Harris NVIDIA Developer Technology.

26
Mapping Computational Mapping Computational Concepts to GPUs Concepts to GPUs Mark Harris NVIDIA Developer Technology

Transcript of Mapping Computational Concepts to GPUs Mark Harris NVIDIA Developer Technology.

Page 1: Mapping Computational Concepts to GPUs Mark Harris NVIDIA Developer Technology.

Mapping Computational Mapping Computational Concepts to GPUsConcepts to GPUs

Mark Harris NVIDIA Developer Technology

Page 2: Mapping Computational Concepts to GPUs Mark Harris NVIDIA Developer Technology.

OutlineOutline

• Data Parallelism and Stream Processing

• Computational Resources Inventory

• CPU-GPU Analogies

• Overview of Branching Techniques

Page 3: Mapping Computational Concepts to GPUs Mark Harris NVIDIA Developer Technology.

Importance of Data ParallelismImportance of Data Parallelism

• GPUs are designed for graphics– Highly parallel tasks

• GPUs process independent vertices & fragments– Temporary registers are zeroed– No shared or static data– No read-modify-write buffers

• Data-parallel processing– GPUs architecture is ALU-heavy

• Multiple vertex & pixel pipelines, multiple ALUs per pipe– Hide memory latency (with more computation)

Page 4: Mapping Computational Concepts to GPUs Mark Harris NVIDIA Developer Technology.

Arithmetic IntensityArithmetic Intensity

• Arithmetic intensity = ops per word transferred

• “Classic” graphics pipeline– Vertex

• BW: 1 triangle = 32 bytes; • OP: 100-500 f32-ops / triangle

– Rasterization• Create 16-32 fragments per triangle

– Fragment • BW: 1 fragment = 10 bytes• OP: 300-1000 i8-ops/fragment

Courtesy of Pat Hanrahan

Page 5: Mapping Computational Concepts to GPUs Mark Harris NVIDIA Developer Technology.

Data Streams & KernelsData Streams & Kernels

• Streams– Collection of records requiring similar computation

• Vertex positions, Voxels, FEM cells, etc.– Provide data parallelism

• Kernels– Functions applied to each element in stream

• transforms, PDE, …– No dependencies between stream elements

• Encourage high Arithmetic Intensity

Courtesy of Ian Buck

Page 6: Mapping Computational Concepts to GPUs Mark Harris NVIDIA Developer Technology.

Example: Simulation GridExample: Simulation Grid

• Common GPGPU computation style– Textures represent computational grids = streams

• Many computations map to grids– Matrix algebra– Image & Volume processing– Physical simulation– Global Illumination

• ray tracing, photon mapping, radiosity

• Non-grid streams can be mapped to grids

Page 7: Mapping Computational Concepts to GPUs Mark Harris NVIDIA Developer Technology.

Stream ComputationStream Computation

• Grid Simulation algorithm– Made up of steps– Each step updates entire grid– Must complete before next step can begin

• Grid is a stream, steps are kernels– Kernel applied to each stream element

Page 8: Mapping Computational Concepts to GPUs Mark Harris NVIDIA Developer Technology.

Scatter vs. GatherScatter vs. Gather

• Grid communication– Grid cells share information

Page 9: Mapping Computational Concepts to GPUs Mark Harris NVIDIA Developer Technology.

Computational Resource InventoryComputational Resource Inventory

• Programmable parallel processors– Vertex & Fragment pipelines

• Rasterizer– Mostly useful for interpolating addresses (texture coordinates)

and per-vertex constants

• Texture unit– Read-only memory interface

• Render to texture– Write-only memory interface

Page 10: Mapping Computational Concepts to GPUs Mark Harris NVIDIA Developer Technology.

Vertex ProcessorVertex Processor

• Fully programmable (SIMD / MIMD)

• Processes 4-vectors (RGBA / XYZW)

• Capable of scatter but not gather– Can change the location of current vertex– Cannot read info from other vertices– Can only read a small constant memory

• Latest GPUs: Vertex Texture Fetch– Random access memory for vertices– Arguably still not gather

Page 11: Mapping Computational Concepts to GPUs Mark Harris NVIDIA Developer Technology.

Fragment ProcessorFragment Processor

• Fully programmable (SIMD)

• Processes 4-vectors (RGBA / XYZW)

• Random access memory read (textures)

• Capable of gather but not scatter– RAM read (texture), but no RAM write– Output address fixed to a specific pixel

• Typically more useful than vertex processor– More fragment pipelines than vertex pipelines– Gather– Direct output (fragment processor is at end of pipeline)

Page 12: Mapping Computational Concepts to GPUs Mark Harris NVIDIA Developer Technology.

CPU-GPU AnalogiesCPU-GPU Analogies

• CPU programming is familiar– GPU programming is graphics-centric

• Analogies can aid understanding

Page 13: Mapping Computational Concepts to GPUs Mark Harris NVIDIA Developer Technology.

CPU-GPU AnalogiesCPU-GPU Analogies

CPU GPU

Stream / Data Array = Texture

Memory Read = Texture Sample

Page 14: Mapping Computational Concepts to GPUs Mark Harris NVIDIA Developer Technology.

CPU-GPU AnalogiesCPU-GPU Analogies

Kernel / loop body / algorithm step = Fragment Program

CPU GPU

Page 15: Mapping Computational Concepts to GPUs Mark Harris NVIDIA Developer Technology.

FeedbackFeedback

• Each algorithm step depends on the results of previous steps

• Each time step depends on the results of the previous time step

Page 16: Mapping Computational Concepts to GPUs Mark Harris NVIDIA Developer Technology.

CPU-GPU AnalogiesCPU-GPU Analogies

. .

.

Grid[i][j]= x; . . .

Array Write = Render to Texture

CPU GPU

Page 17: Mapping Computational Concepts to GPUs Mark Harris NVIDIA Developer Technology.

GPU Simulation OverviewGPU Simulation Overview

• Analogies lead to implementation– Algorithm steps are fragment programs

• Computational kernels– Current state variables stored in textures– Feedback via render to texture

• One question: how do we invoke computation?

Page 18: Mapping Computational Concepts to GPUs Mark Harris NVIDIA Developer Technology.

Invoking ComputationInvoking Computation

• Must invoke computation at each pixel– Just draw geometry!– Most common GPGPU invocation is a full-screen quad

Page 19: Mapping Computational Concepts to GPUs Mark Harris NVIDIA Developer Technology.

Typical “Grid” ComputationTypical “Grid” Computation

• Initialize “view” (so that pixels:texels::1:1)

glMatrixMode(GL_MODELVIEW);glLoadIdentity();glMatrixMode(GL_PROJECTION);glLoadIdentity();glOrtho(0, 1, 0, 1, 0, 1);glViewport(0, 0, outTexResX, outTexResY);

• For each algorithm step:– Activate render-to-texture– Setup input textures, fragment program– Draw a full-screen quad (1x1)

Page 20: Mapping Computational Concepts to GPUs Mark Harris NVIDIA Developer Technology.

Branching TechniquesBranching Techniques

• Fragment program branches can be expensive– No true fragment branching on GeForce FX or Radeon– SIMD branching on GeForce 6 Series

• Incoherent branching hurts performance

• Sometimes better to move decisions up the pipeline– Replace with math– Occlusion Query– Static Branch Resolution– Z-cull– Pre-computation

Page 21: Mapping Computational Concepts to GPUs Mark Harris NVIDIA Developer Technology.

Branching with OQBranching with OQ

• Use it for iteration terminationDo { // outer loop on CPU

BeginOcclusionQuery { // Render with fragment program that

// discards fragments that satisfy

// termination criteria} EndQuery

} While query returns > 0

• Can be used for subdivision techniques– Demo

Page 22: Mapping Computational Concepts to GPUs Mark Harris NVIDIA Developer Technology.

Static Branch ResolutionStatic Branch Resolution

• Avoid branches where outcome is fixed– One region is always true, another false– Separate FPs for each region, no branches

• Example: boundaries

Page 23: Mapping Computational Concepts to GPUs Mark Harris NVIDIA Developer Technology.

Z-CullZ-Cull

• In early pass, modify depth buffer– Clear Z to 1– Draw quad at Z=0– Discard pixels that should be modified in later passes

• Subsequent passes– Enable depth test (GL_LESS)– Draw full-screen quad at z=0.5– Only pixels with previous depth=1 will be processed

• Can also use early stencil test

• Not available on NV3X– Depth replace disables ZCull

Page 24: Mapping Computational Concepts to GPUs Mark Harris NVIDIA Developer Technology.

Pre-computationPre-computation

• Pre-compute anything that will not change every iteration!

• Example: arbitrary boundaries– When user draws boundaries, compute texture containing

boundary info for cells– Reuse that texture until boundaries modified– Combine with Z-cull for higher performance!

Page 25: Mapping Computational Concepts to GPUs Mark Harris NVIDIA Developer Technology.

GeForce 6 Series BranchingGeForce 6 Series Branching

• True, SIMD branching– Lots of incoherent branching can hurt performance– Should have coherent regions of 1000 pixels

• That is only about 30x30 pixels, so still very useable!

• Don’t ignore overhead of branch instructions– Branching over < 5 instructions may not be worth it

• Use branching for early exit from loops– Save a lot of computation

Page 26: Mapping Computational Concepts to GPUs Mark Harris NVIDIA Developer Technology.

SummarySummary

• Presented mappings of basic computational concepts to GPUs– Basic concepts and terminology– For introductory “Hello GPGPU” sample code, see

http://www.gpgpu.org/developer

• Only the beginning:– Rest of course presents advanced techniques, strategies,

and specific algorithms.