GPU evolution Will Graphics Morph Into Compute? Norm Rubin Fellow GPG graphics products group AMD...
-
Upload
geoffrey-dorsey -
Category
Documents
-
view
213 -
download
0
Transcript of GPU evolution Will Graphics Morph Into Compute? Norm Rubin Fellow GPG graphics products group AMD...
GPU evolution Will Graphics Morph Into Compute?
Norm Rubin Fellow GPG graphics products groupAMDPACT 2008
Pact 2008 | Oct, 20082
Performance is King
Cinematic world: On average studios typically use 100,000 min of compute per min of image
Blinn's Law (the flip side of Moore's Law):
– Time per frame is ~constant
– Audience expectation of quality per frame is higher every year!
– Expectation of quality increases with compute increases
GPUs are real time – 1 min of compute per min of image
so users want 100,000 times faster machines
Cars. Courtesy of Pixar Animation Studios
Pact 2008 | Oct, 20083
GPU Chip Design Focus Point
CPU
Lots of instructions little data
– Out of order exec
– Branch prediction
Reuse and locality
Task parallel
Needs OS
Complex sync
Latency machines
Instructions stay the same
GPU
Few instructions lots of data
– SIMD
– Hardware threading
Little reuse
Data parallel
No OS
Simple sync
Throughput machines
Instructions keep changing
Pact 2008 | Oct, 20084
Effects of Hardware ChangesCall of Juarez
Pact 2008 | Oct, 20085
Why do ISA instructions change rapidly?
1) A new Game becomes popular,
2) API designers (Ms) adds new functions to make the game simpler to write
3) Hardware vendors look at the game – line by line
and add new hardware to speed up the game
4) New Hardware
5) Game developers look at the new hardware and think of interesting new effects (more realism) by pushing past what anyone thought the instruction could do
6) New Games
repeat!
No backward compatibility at the hardware level
Pact 2008 | Oct, 20086
Co-evolution in action
Photograph courtesy of Peter Chew (www.brisbaneInsects.com)
Pact 2008 | Oct, 20087
Is the GPU just a lot of CPU devices?
GPU / CPU have been co-evolving
Not a process of radical change
Current GPUs have rendering-specific hardware
Transitional features
Ex: fixed function systems
Memory systems with filtering (==texture)
Depth buffer
Rasterizer
…
These are crucial for graphics performance!
Pact 2008 | Oct, 20088
Latency or throughput
CPU and GPU are fundamentally different
CPU strength is single thread performance
(latency machine)
GPU strength is massive thread performance
(throughput machine)
What classes of problems can be solved by massive parallel processing?
What exactly does latency or throughput mean?
Pact 2008 | Oct, 20089
GPU vs. CPU performance
thread:
// load
r1 = load (index)
// series of adds
r1 = r1 + r1
r1 = r1 + r1
… Run lots of threads
Can you get peak performance/multi-core/cluster?
Peak performance = do alu ops every cycle
Pact 2008 | Oct, 200810
Typical CPU Operation
Fetch Alu
Wait for memory, gaps prevent peak performanceGap size varies dynamicallyHard to tolerate latency
One iteration at a timeSingle CPU unit Cannot reach 100%
Hard to prefetch dataMulti-core does not helpCluster does not helpLimited number of outstandingfetches
Pact 2008 | Oct, 200811
100% ALU utilization
GPU THREADS Throughput (Lower Clock – Different Scale)
Overlapped fetch and aluMany outstanding fetches
Lots of threads Fetch unit + ALU unitFast thread switchIn-order finish
ALU units reach 100%utilizationHardware sync for final Output
Pact 2008 | Oct, 200812
One wavefront is 64 threadsTwo Wavefronts/simd (running)16 processing elements/simd10 simd engines20 program counters
Once enough resources are available a thread goes into the run queue16 instructions finish per simd Each instruction is 5 way vliw
2 wf SIMD
Vertex Fetch Seq
Texture Fetch Seq
Output
Filter Resources
PS VS
wavefronts
Wf status
Select waves to execute
2 wf SIMD
2 wf SIMD
2 wf SIMD
Pact 2008 | Oct, 200813
Threads in Run Queue
Each simd has 256 sets of registers
64 registers in a set (each holds 128 bits)
If each thread needs 5 (128 bit) registers, then
256/5 = 51 wavefronts can get into run queue
51 wavefronts = 3264 threads per SIMD or 32,640 running or waiting threads
256 * 64 * 10 vector registers
256 * 64 * 10 *4 (32 bit registers) = 665,360 registers
Pact 2008 | Oct, 200814
Implications
CPU: Loads determine performance
Compiler works hard to
– Minimize ALU code
– Reduce memory overhead
– Try to use prefetch and other magic to reduce the amount of time waiting for memory
GPU: Threads determine performance
Compiler works hard to
– Minimize ALU code
– Maximize threads
– Try to reorder instructions to reduce synchronization and other magic to reduce the amount of time waiting for threads
Pact 2008 | Oct, 200815
Graphics programming model
RasterizerOutput merger
Vertex shaderGeometry
shaderPixel
shaderInput
assembler
ParallelLoop over all input points
Combine points (vertices) into shape
ParallelLoop over all shapes
One round of nested parallelism
Generate one threadPer pixel in the shape
Parallel Loop over all pixels
Combine the outputsMultiple passes
Pact 2008 | Oct, 200816
Parallelism Model
All parallel operations are hidden via domain specific API calls
Developers write sequential code + kernels
Kernel operate on one vertex or pixel
Developers never deal with parallelism directly
No need for auto parallel compilers
Pact 2008 | Oct, 200817
Observations
Developers only write a small part of the program, rest of the code comes from libraries
200-300 small kernels each < 100 lines
No race conditions are possible
No error reporting (just keep going)
Can view the program as serial (per vertex/shape/pixel)
No developer knows the number of processors
Not like pthreads
Result: Lots of success, simple enough to program
Pact 2008 | Oct, 200818
Power and cost have appeared
Vendors always release a family of cards (change the number of simd engines per chip)
-Programs need to scale
-Avoid huge monolithic cores
Goal:
Best performance comes from two gpu chips on a card
Mid range performance from one gpu,
Low range – just remove simd engines
Pact 2008 | Oct, 200819
Change in the design metric (GPU Efficiency)
02
46
81
0Processor Efficiency
Release Date
2003 2004 2005 2006 2007 2008
02
46
81
0
GFLOPS/wattGFLOPS/$
Graph based on historical performanceOf ATI Radeon tm GPUs
Pact 2008 | Oct, 200820
Changes in the last generation ATI Radeon™ HD 38XX vs ATI Radeon™ HD 48XX
Pact 2008 | Oct, 200821
GPU ATI Radeon ™ HD 4870
• 800 stream processors• = 160 x 5 way vliw• = 10 simd cores
• New SIMD core layout• New memory architecture• Optimized render back-ends for
faster anti-aliasing performance• Enhanced geometry shader &
tessellator performance
• 1.2 tera flops performance• 2.4 tera flops for
ATI Radeon™ HD 4870 X2
Pact 2008 | Oct, 200822
ATI Radeon™ HD 4800 Series Architecture
SIMD cores:
Changed 4 to 10
Memory Bandwidth:
changed 72 GB/s to 115 GB/s
GDDR3 to GDDR5
UVD & UVD & Display Display
ControllersControllers
GDDR5 Memory InterfaceGDDR5 Memory Interface
Texture Texture UnitsUnits
SIMDSIMDCoresCores
PCI Express Bus InterfacePCI Express Bus Interface
Pact 2008 | Oct, 200823
SIMD CoresWhat is new for gpgpu?
Double precision – 5 way vliw allows
pairs to do double add
4 to do double multiply
5 way vliw 4 normal functional units + 1 fat unit
Local memory on simd (communication)
Global memory on chip (more communication)
Better thread scheduling
General scatter/gather operations
Pact 2008 | Oct, 200824
9700 X1800 X1900 HD3850 HD4850
Percent programmable area
Per
cent
020
4060
8010
00
2040
6080
100
pixel/combined vertex gpgpu fixed
All columns refer toATI Radeon TM (Data from ati engineering)
Pact 2008 | Oct, 200825
Transitional applications
Written by graphics programmers
Real connection with graphics is that the result is rendered and looks cute
Really programming physics and AI
Evaluating physics, simulations, and artificial intelligence on a GPU is becoming an element of future game programs.
Massively parallel algorithm formulations
Combined with responsive gameplay and rendering
Pact 2008 | Oct, 200826
How is software evolving?
Graphics API’s have added compute shaders
Dx11 compute shaders
Another stage in pipeline
Some programming languages which are “sort of C “
OpenCL
CUDA
CT
Streaming languages
Brook+
Pact 2008 | Oct, 200827
Two transitional apps: Toyshop
Pact 2008 | Oct, 200828
Two transitional apps: Froblins
Pact 2008 | Oct, 200829
Toyshop demo
ToyShop Demo
Pact 2008 | Oct, 200830
PuddlesPuddles
Dynamic realistic wave motion of interacting ripples over the water surface
Treat water surface as a thin elastic membrane
Simulate response due to surface tension
Numeric solution to a PDE on the GPU
Pact 2008 | Oct, 200831
Rain on window
Pact 2008 | Oct, 200832
Rain on window
Physics-based movement of drops on window surface
The droplet shape and motion is influenced by the forces of gravity and the interfacial tension forces, as well as air resistance
The surface of the glass is represented by a lattice of cells
Pact 2008 | Oct, 200833
Rain
Looks great, but do not try to predict rain using this random number technique.
Random number generator: done by a load from a small table based on screen xy coordinate + an offset that changes each frame.
GPU generation allowed only 16 persistent outputs per thread – Could not save seed
Reference: Tatarchuk, N., Isidoro, J. R. 2006. Artist-Directable Real-Time Rain Rendering in City Environments. In Proceedings of Eurographics Workshop on Natural Phenomena, Vienna, Austria.
Pact 2008 | Oct, 200834
Programming
256 meg of memory on card, aggressive compression, to fit the texture data into 250 meg (originally 478 meg)
~ 500 small shader programs
½ for the rain
Misty objects in rain
Halos around light sources
Water surface simulation for ripples/splashes
Streaming water
Warped reflections
….
Pact 2008 | Oct, 200835
The ToyShop TeamThe ToyShop Team
Lead ArtistLead Artist Lead ProgrammerLead Programmer
Dan Roeger Natalya TatarchukDan Roeger Natalya Tatarchuk
David GosselinDavid Gosselin
ArtistsArtists
Daniel Szecket, Eli Turner, and Abe WileyDaniel Szecket, Eli Turner, and Abe Wiley
Engine / Shader ProgrammingEngine / Shader Programming
John Isidoro, Dan Ginsburg, Thorsten Scheuermann and Chris John Isidoro, Dan Ginsburg, Thorsten Scheuermann and Chris OatOat
ProducerProducer ManagerManager
Lisa CloseLisa Close Callan McInallyCallan McInally
Pact 2008 | Oct, 200836
Froblin demo
Simulation and Rendering Massive Crowds of Intelligent and Detailed Creatures on GPU
Pact 2008 | Oct, 200837
A Smörgåsbord of Features
Dynamic pathfinding AI computations on GPU
Massive crowd rendering with LOD management
Tessellation for high quality close-ups and stable performance
HDR lighting and post-processing effects with gamma-correct rendering
Terrain system
Cascade shadows for large-range environments
Advanced global illumination system
Actual .9 TeraFlops performance
Pact 2008 | Oct, 200838
Run froblin demo
Tessilation allowed by hardware support for limited nested parallelism
Pact 2008 | Oct, 200839
Pathfinding on GPUPathfinding on GPU
Numerically solve a 2Numerically solve a 2ndnd order PDE on GPU with order PDE on GPU with a computational iterative approach (eikonal solver) a computational iterative approach (eikonal solver)
• Represent environment as a cost fieldRepresent environment as a cost field
• Through discretization of the eikonal equationThrough discretization of the eikonal equation
Applicable to many general algorithms and areasApplicable to many general algorithms and areas
Pact 2008 | Oct, 200840
Smooth, crack-free LOD without degenerates
Tessellation and instancing
Leverages Direct3D® 10.1functionality to help minimize memory footprint
Complex material system
All slides © 2008 Advanced Micro Devices, Inc. Used with permission.
Froblin Land: Terrain Rendering
Pact 2008 | Oct, 200841
Froblin demo
512 MB of memory on card, aggressive compression to fit the data.
One giant shader program for AI and animation logic (executed per-agent): 3200 instructions
~6-8 Million triangles rendered each frame
Atmospheric scattering to convey sense of scene depth
Cascaded shadow soft shadow edges
GPU scene management using stream-out
High dynamic range imaging with post-processing for light blooms and tone mapping
Characters are tessellated and displaced dynamically giving higher detail near the viewer
Pact 2008 | Oct, 200842
More information about the Froblins demo
Reference: Shopf, J., Barczak, J., Oat, C., and Tatarchuk, N. 2008. March of the Froblins: simulation and rendering massive crowds of intelligent and detailed creatures on GPU. In ACM SIGGRAPH 2008 Classes (Los Angeles, California, August 11 - 15, 2008). SIGGRAPH '08.
Pact 2008 | Oct, 200843
Game Computing Applications GroupGame Computing Applications Group
Acknowledgements: Froblins
Josh BarczakJosh Barczak
Jeremy Jeremy ShopfShopf
Abe WileyAbe Wiley
ExigentExigentChaingun Chaingun StudiosStudios AllegorithmicAllegorithmic
Natalya Natalya TatarchukTatarchuk
Christopher Christopher OatOat
Pact 2008 | Oct, 200844
What did it take to program these demos?
ToyShop
~120 engineer and artist man-months
256 MB of video memory
Over 500 individual shaders – explicit permutations
Floblins
~56 engineer and artist man-months
512 MB of video memory
More flexible shader programming model – quicker development
Many shaders – more dynamic permutations
Froblin character control shader has around 3200 line of code alone!
Pact 2008 | Oct, 200845
Architecture/software issue: Improve the Video Interface
To lower power the GPU has a dedicated processor that does video (decode), this offloads work from the cpu
Programmable cores are used for de-interlace, scale, color space convert, composition
Today this is hand coded; challenge is design a programming model for heterogeneous compute
Pact 2008 | Oct, 200846
Challenge
Most of the successful parallel applications seem to have dedicated languages (DirectX®11/map-reduce/sawzsall) for limited domains
Small programs can build interesting applications,
Programmers are not super experts in the hardware
Programs survive machine generations
Can you replicate this success in other domains?
Pact 2008 | Oct, 200847
Disclaimer and Attribution
DISCLAIMER
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
ATTRIBUTION
© 2008 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, ATI, the ATI logo, CrossFireX, PowerPlay and Radeon and combinations thereof are trademarks of Advanced Micro Devices, Inc. Other names are for informational purposes only and may be trademarks of their respective owners.
Pact 2008 | Oct, 200848
Questions?