Prepared 5/24/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Prepared 5/24/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

CUDA Lecture 2History of GPUs

Make great imagesIntricate shapesComplex optical effectsSeamless motion

Make them fastInvent clever techniquesUse every trick imaginableBuild monster hardware

Eugene d’Eon, David Luebke, Eric Enderton, In Proc. EGSR 2007 and GPU Gems 3

History of GPUs – Slide 2

Graphics in a Nutshell


The Graphics PipelineVertex Transform & Lighting

Triangle Setup & Rasterization

Texturing & Pixel Shading

Depth Test & Blending

Framebuffer






Framebuffer






Framebuffer

Transform from “world space” to “image space”

Compute per-vertex lighting






Framebuffer

Convert geometric representation (vertex) to image representation (fragment)

Interpolate per-vertex quantities across pixels






Framebuffer

The Graphics PipelineKey abstraction of real-time

graphics

Hardware used to look like this

One chip/board per stage

Fixed data flow through pipeline

Vertex

Rasterize

Pixel

Test & Blend

FramebufferHistory of GPUs – Slide 8

The Graphics PipelineEverything fixed function with

a certain number of modes

Number of modes for each stage grew over time

Hard to optimize hardware

Developers always wanted more flexibility

Vertex

Rasterize

Pixel

Test & Blend


The Graphics PipelineRemains a key abstraction

Hardware used to look like this

Vertex and pixel processing became programmable, new stages added

GPU architecture increasingly centers around shader execution

Vertex

Rasterize

Pixel

Test & Blend


The Graphics PipelineExposing an (at first limited)

instruction set for some stages

Limited instructions and instruction types and no control flow at first

Expanded to full ISA

Vertex

Rasterize

Pixel

Test & Blend


Workload and programming model provide lots of parallelism

Applications provide large groups of vertices at onceVertices can be processed in parallelApply same transform to all vertices

Triangles contain many pixelsPixels from a triangle can be processed in

parallelApply same shader to all pixels

Very efficient hardware to hide serialization bottlenecks History of GPUs – Slide 12

Why GPUs Scale So Nicely


With Moore’s Law…

Raster

Vertex

Pixel

Blend

Rast

er

VertexPixel 0

Ble

nd

Pixel 1

Pixel 2

Pixel 3

Vrtx 0

Vrt

x 2

Vrt

x 1

Note that we do the same thing for lots of pixels/vertices

A warp = 32 threads launched togetherUsually execute together as well


More Efficiency

ALU

Control

ALU

Control

ALU

Control

ALU

Control

ALU

Control

ALU

Control

ALU ALU ALU

Control

ALU ALU ALU

All this performance attracted developersTo use GPUs, re-expressed their algorithms

as general purpose computations using GPUs and graphics API in applications other than 3-D graphicsPretend to be graphics; disguise data as

textures or geometry, disguise algorithm as render passes

Fool graphics pipeline to do computation to take advantage of massive parallelism of GPU

GPU accelerates critical path of application


What Is (Historical) GPGPU?

Data parallel algorithms leverage GPU attributesLarge data arrays, streaming throughputFine-grain SIMD parallelismLow-latency floating point (FP) computation

Applications – see http://GPGPU.orgGame effects (FX) physics, image processingPhysical modeling, computational engineering,

matrix algebra, convolution, correlation, sorting


General Purpose GPUs (GPGPUs)

Previous GPGPU ConstraintsDealing with graphics API

Working with the corner cases of the graphics API

Addressing modes Limited texture size/dimension

Shader capabilities Limited outputs

Instruction sets Lack of integer & bit ops

Communication limited Between pixels Scatter a[i] = p


Input Registers

Fragment Program

Output Registers

Constants

Texture

Temp Registers

per threadper Shaderper Context

FB Memory

To use GPUs, re-expressed algorithms as graphics computations

Very tedious, limited usabilityStill had some very nice results

This was the lead up to CUDA


Summary: Early GPGPUs

General purpose programming modelUser kicks off batches of threads on the GPUGPU = dedicated super-threaded, massively

data parallel co-processorTargeted software stack

Compute oriented drivers, language, and tools


Compute Unified Device Architecture (CUDA)

Driver for loading computation programs into GPUStandalone Driver - Optimized for computation Interface designed for compute – graphics-free

APIData sharing with OpenGL buffer objects Guaranteed maximum download & readback

speedsExplicit GPU memory management


Compute Unified Device Architecture (CUDA)


Example of Physical Reality behind CUDA

21

CPU(host)

GPU w/ local DRAM

(device)

8-series GPUs deliver 25 to 200+ GFLOPSon compiled parallel C applications Available in laptops,

desktops, and clusters

GPU parallelism is doubling every year

Programming model scales transparently


Parallel Computing on a GPU

GeForce 8800

Tesla D870

Programmable in C with CUDA tools Multithreaded SPMD model uses application

data parallelism and thread parallelism


Parallel Computing on a GPU

Tesla S870

GPUs evolve as hardware and software evolve

Five stage graphics pipelining

An example of GPGPU

Intro to CUDA


Final Thoughts

Reading: Chapter 2, “Programming Massively Parallel Processors” by Kirk and Hwu.

Based on original material fromThe University of Illinois at Urbana-Champaign

David Kirk, Wen-mei W. HwuThe University of Minnesota: Weijun XiaoStanford University: Jared Hoberock, David

TarjanRevision history: last updated 5/24/2011.


End Credits

Prepared 5/24/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Documents

Transcript of Prepared 5/24/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.