GPU evolution Will Graphics Morph Into Compute? Norm Rubin Fellow GPG graphics products group AMD...

GPU evolution Will Graphics Morph Into Compute?

Norm Rubin Fellow GPG graphics products groupAMDPACT 2008

Pact 2008 | Oct, 20082

Performance is King

Cinematic world: On average studios typically use 100,000 min of compute per min of image

Blinn's Law (the flip side of Moore's Law):

– Time per frame is ~constant

– Audience expectation of quality per frame is higher every year!

– Expectation of quality increases with compute increases

GPUs are real time – 1 min of compute per min of image

so users want 100,000 times faster machines

Cars. Courtesy of Pixar Animation Studios

Pact 2008 | Oct, 20083

GPU Chip Design Focus Point

CPU

Lots of instructions little data

– Out of order exec

– Branch prediction

Reuse and locality

Task parallel

Needs OS

Complex sync

Latency machines

Instructions stay the same

GPU

Few instructions lots of data

– SIMD

– Hardware threading

Little reuse

Data parallel

No OS

Simple sync

Throughput machines

Instructions keep changing

Pact 2008 | Oct, 20084

Effects of Hardware ChangesCall of Juarez

Pact 2008 | Oct, 20085

Why do ISA instructions change rapidly?

1) A new Game becomes popular,

2) API designers (Ms) adds new functions to make the game simpler to write

3) Hardware vendors look at the game – line by line

and add new hardware to speed up the game

4) New Hardware

5) Game developers look at the new hardware and think of interesting new effects (more realism) by pushing past what anyone thought the instruction could do

6) New Games

repeat!

No backward compatibility at the hardware level

Pact 2008 | Oct, 20086

Co-evolution in action

Photograph courtesy of Peter Chew (www.brisbaneInsects.com)

Pact 2008 | Oct, 20087

Is the GPU just a lot of CPU devices?

GPU / CPU have been co-evolving

Not a process of radical change

Current GPUs have rendering-specific hardware

Transitional features

Ex: fixed function systems

Memory systems with filtering (==texture)

Depth buffer

Rasterizer

…

These are crucial for graphics performance!

Pact 2008 | Oct, 20088

Latency or throughput

CPU and GPU are fundamentally different

CPU strength is single thread performance

(latency machine)

GPU strength is massive thread performance

(throughput machine)

What classes of problems can be solved by massive parallel processing?

What exactly does latency or throughput mean?

Pact 2008 | Oct, 20089

GPU vs. CPU performance

thread:

// load

r1 = load (index)

// series of adds

r1 = r1 + r1

r1 = r1 + r1

… Run lots of threads

Can you get peak performance/multi-core/cluster?

Peak performance = do alu ops every cycle

Pact 2008 | Oct, 200810

Typical CPU Operation

Fetch Alu

Wait for memory, gaps prevent peak performanceGap size varies dynamicallyHard to tolerate latency

One iteration at a timeSingle CPU unit Cannot reach 100%

Hard to prefetch dataMulti-core does not helpCluster does not helpLimited number of outstandingfetches

Pact 2008 | Oct, 200811

100% ALU utilization

GPU THREADS Throughput (Lower Clock – Different Scale)

Overlapped fetch and aluMany outstanding fetches

Lots of threads Fetch unit + ALU unitFast thread switchIn-order finish

ALU units reach 100%utilizationHardware sync for final Output

Pact 2008 | Oct, 200812

One wavefront is 64 threadsTwo Wavefronts/simd (running)16 processing elements/simd10 simd engines20 program counters

Once enough resources are available a thread goes into the run queue16 instructions finish per simd Each instruction is 5 way vliw

2 wf SIMD

Vertex Fetch Seq

Texture Fetch Seq

Output

Filter Resources

PS VS

wavefronts

Wf status

Select waves to execute

2 wf SIMD

2 wf SIMD

2 wf SIMD

Pact 2008 | Oct, 200813

Threads in Run Queue

Each simd has 256 sets of registers

64 registers in a set (each holds 128 bits)

If each thread needs 5 (128 bit) registers, then

256/5 = 51 wavefronts can get into run queue

51 wavefronts = 3264 threads per SIMD or 32,640 running or waiting threads

256 * 64 * 10 vector registers

256 * 64 * 10 *4 (32 bit registers) = 665,360 registers

Pact 2008 | Oct, 200814

Implications

CPU: Loads determine performance

Compiler works hard to

– Minimize ALU code

– Reduce memory overhead

– Try to use prefetch and other magic to reduce the amount of time waiting for memory

GPU: Threads determine performance

Compiler works hard to

– Minimize ALU code

– Maximize threads

– Try to reorder instructions to reduce synchronization and other magic to reduce the amount of time waiting for threads

Pact 2008 | Oct, 200815

Graphics programming model

RasterizerOutput merger

Vertex shaderGeometry

shaderPixel

shaderInput

assembler

ParallelLoop over all input points

Combine points (vertices) into shape

ParallelLoop over all shapes

One round of nested parallelism

Generate one threadPer pixel in the shape

Parallel Loop over all pixels

Combine the outputsMultiple passes

Pact 2008 | Oct, 200816

Parallelism Model

All parallel operations are hidden via domain specific API calls

Developers write sequential code + kernels

Kernel operate on one vertex or pixel

Developers never deal with parallelism directly

No need for auto parallel compilers

Pact 2008 | Oct, 200817

Observations

Developers only write a small part of the program, rest of the code comes from libraries

200-300 small kernels each < 100 lines

No race conditions are possible

No error reporting (just keep going)

Can view the program as serial (per vertex/shape/pixel)

No developer knows the number of processors

Not like pthreads

Result: Lots of success, simple enough to program

Pact 2008 | Oct, 200818

Power and cost have appeared

Vendors always release a family of cards (change the number of simd engines per chip)

-Programs need to scale

-Avoid huge monolithic cores

Goal:

Best performance comes from two gpu chips on a card

Mid range performance from one gpu,

Low range – just remove simd engines

Pact 2008 | Oct, 200819

Change in the design metric (GPU Efficiency)

02

46

81

0Processor Efficiency

Release Date

2003 2004 2005 2006 2007 2008

02

46

81

0

GFLOPS/wattGFLOPS/$

Graph based on historical performanceOf ATI Radeon tm GPUs

Pact 2008 | Oct, 200820

Changes in the last generation ATI Radeon™ HD 38XX vs ATI Radeon™ HD 48XX

Pact 2008 | Oct, 200821

GPU ATI Radeon ™ HD 4870

• 800 stream processors• = 160 x 5 way vliw• = 10 simd cores

• New SIMD core layout• New memory architecture• Optimized render back-ends for

faster anti-aliasing performance• Enhanced geometry shader &

tessellator performance

• 1.2 tera flops performance• 2.4 tera flops for

ATI Radeon™ HD 4870 X2

Pact 2008 | Oct, 200822

ATI Radeon™ HD 4800 Series Architecture

SIMD cores:

Changed 4 to 10

Memory Bandwidth:

changed 72 GB/s to 115 GB/s

GDDR3 to GDDR5

UVD & UVD & Display Display

ControllersControllers

GDDR5 Memory InterfaceGDDR5 Memory Interface

Texture Texture UnitsUnits

SIMDSIMDCoresCores

PCI Express Bus InterfacePCI Express Bus Interface

Pact 2008 | Oct, 200823

SIMD CoresWhat is new for gpgpu?

Double precision – 5 way vliw allows

pairs to do double add

4 to do double multiply

5 way vliw 4 normal functional units + 1 fat unit

Local memory on simd (communication)

Global memory on chip (more communication)

Better thread scheduling

General scatter/gather operations

Pact 2008 | Oct, 200824

9700 X1800 X1900 HD3850 HD4850

Percent programmable area

Per

cent

020

4060

8010

00

2040

6080

100

pixel/combined vertex gpgpu fixed

All columns refer toATI Radeon TM (Data from ati engineering)

Pact 2008 | Oct, 200825

Transitional applications

Written by graphics programmers

Real connection with graphics is that the result is rendered and looks cute

Really programming physics and AI

Evaluating physics, simulations, and artificial intelligence on a GPU is becoming an element of future game programs.

Massively parallel algorithm formulations

Combined with responsive gameplay and rendering

Pact 2008 | Oct, 200826

How is software evolving?

Graphics API’s have added compute shaders

Dx11 compute shaders

Another stage in pipeline

Some programming languages which are “sort of C “

OpenCL

CUDA

CT

Streaming languages

Brook+

Pact 2008 | Oct, 200827

Two transitional apps: Toyshop

Pact 2008 | Oct, 200828

Two transitional apps: Froblins

Pact 2008 | Oct, 200829

Toyshop demo

ToyShop Demo

Pact 2008 | Oct, 200830

PuddlesPuddles

Dynamic realistic wave motion of interacting ripples over the water surface

Treat water surface as a thin elastic membrane

Simulate response due to surface tension

Numeric solution to a PDE on the GPU

Pact 2008 | Oct, 200831

Rain on window

Pact 2008 | Oct, 200832

Rain on window

Physics-based movement of drops on window surface

The droplet shape and motion is influenced by the forces of gravity and the interfacial tension forces, as well as air resistance

The surface of the glass is represented by a lattice of cells

Pact 2008 | Oct, 200833

Rain

Looks great, but do not try to predict rain using this random number technique.

Random number generator: done by a load from a small table based on screen xy coordinate + an offset that changes each frame.

GPU generation allowed only 16 persistent outputs per thread – Could not save seed

Reference: Tatarchuk, N., Isidoro, J. R. 2006. Artist-Directable Real-Time Rain Rendering in City Environments. In Proceedings of Eurographics Workshop on Natural Phenomena, Vienna, Austria.

Pact 2008 | Oct, 200834

Programming

256 meg of memory on card, aggressive compression, to fit the texture data into 250 meg (originally 478 meg)

~ 500 small shader programs

½ for the rain

Misty objects in rain

Halos around light sources

Water surface simulation for ripples/splashes

Streaming water

Warped reflections

….

Pact 2008 | Oct, 200835

The ToyShop TeamThe ToyShop Team

Lead ArtistLead Artist Lead ProgrammerLead Programmer

Dan Roeger Natalya TatarchukDan Roeger Natalya Tatarchuk

David GosselinDavid Gosselin

ArtistsArtists

Daniel Szecket, Eli Turner, and Abe WileyDaniel Szecket, Eli Turner, and Abe Wiley

Engine / Shader ProgrammingEngine / Shader Programming

John Isidoro, Dan Ginsburg, Thorsten Scheuermann and Chris John Isidoro, Dan Ginsburg, Thorsten Scheuermann and Chris OatOat

ProducerProducer ManagerManager

Lisa CloseLisa Close Callan McInallyCallan McInally

Pact 2008 | Oct, 200836

Froblin demo

Simulation and Rendering Massive Crowds of Intelligent and Detailed Creatures on GPU

Pact 2008 | Oct, 200837

A Smörgåsbord of Features

Dynamic pathfinding AI computations on GPU

Massive crowd rendering with LOD management

Tessellation for high quality close-ups and stable performance

HDR lighting and post-processing effects with gamma-correct rendering

Terrain system

Cascade shadows for large-range environments

Advanced global illumination system

Actual .9 TeraFlops performance

Pact 2008 | Oct, 200838

Run froblin demo

Tessilation allowed by hardware support for limited nested parallelism

Pact 2008 | Oct, 200839

Pathfinding on GPUPathfinding on GPU

Numerically solve a 2Numerically solve a 2ndnd order PDE on GPU with order PDE on GPU with a computational iterative approach (eikonal solver) a computational iterative approach (eikonal solver)

• Represent environment as a cost fieldRepresent environment as a cost field

• Through discretization of the eikonal equationThrough discretization of the eikonal equation

Applicable to many general algorithms and areasApplicable to many general algorithms and areas

Pact 2008 | Oct, 200840

Smooth, crack-free LOD without degenerates

Tessellation and instancing

Leverages Direct3D® 10.1functionality to help minimize memory footprint

Complex material system

All slides © 2008 Advanced Micro Devices, Inc. Used with permission.

Froblin Land: Terrain Rendering

Pact 2008 | Oct, 200841

Froblin demo

512 MB of memory on card, aggressive compression to fit the data.

One giant shader program for AI and animation logic (executed per-agent): 3200 instructions

~6-8 Million triangles rendered each frame

Atmospheric scattering to convey sense of scene depth

Cascaded shadow soft shadow edges

GPU scene management using stream-out

High dynamic range imaging with post-processing for light blooms and tone mapping

Characters are tessellated and displaced dynamically giving higher detail near the viewer

Pact 2008 | Oct, 200842

More information about the Froblins demo

Reference: Shopf, J., Barczak, J., Oat, C., and Tatarchuk, N. 2008. March of the Froblins: simulation and rendering massive crowds of intelligent and detailed creatures on GPU. In ACM SIGGRAPH 2008 Classes (Los Angeles, California, August 11 - 15, 2008). SIGGRAPH '08.

Pact 2008 | Oct, 200843

Game Computing Applications GroupGame Computing Applications Group

Acknowledgements: Froblins

Josh BarczakJosh Barczak

Jeremy Jeremy ShopfShopf

Abe WileyAbe Wiley

ExigentExigentChaingun Chaingun StudiosStudios AllegorithmicAllegorithmic

Natalya Natalya TatarchukTatarchuk

Christopher Christopher OatOat

Pact 2008 | Oct, 200844

What did it take to program these demos?

ToyShop

~120 engineer and artist man-months

256 MB of video memory

Over 500 individual shaders – explicit permutations

Floblins

~56 engineer and artist man-months

512 MB of video memory

More flexible shader programming model – quicker development

Many shaders – more dynamic permutations

Froblin character control shader has around 3200 line of code alone!

Pact 2008 | Oct, 200845

Architecture/software issue: Improve the Video Interface

To lower power the GPU has a dedicated processor that does video (decode), this offloads work from the cpu

Programmable cores are used for de-interlace, scale, color space convert, composition

Today this is hand coded; challenge is design a programming model for heterogeneous compute

Pact 2008 | Oct, 200846

Challenge

Most of the successful parallel applications seem to have dedicated languages (DirectX®11/map-reduce/sawzsall) for limited domains

Small programs can build interesting applications,

Programmers are not super experts in the hardware

Programs survive machine generations

Can you replicate this success in other domains?

Pact 2008 | Oct, 200847

Disclaimer and Attribution

DISCLAIMER

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.

AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.

AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

ATTRIBUTION

© 2008 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, ATI, the ATI logo, CrossFireX, PowerPlay and Radeon and combinations thereof are trademarks of Advanced Micro Devices, Inc. Other names are for informational purposes only and may be trademarks of their respective owners.

Pact 2008 | Oct, 200848

Questions?

GPU evolution Will Graphics Morph Into Compute? Norm Rubin Fellow GPG graphics products group AMD...

Documents

Transcript of GPU evolution Will Graphics Morph Into Compute? Norm Rubin Fellow GPG graphics products group AMD...