Graphics Performance: Balancing the Rendering Pipeline Cem Cebenoyan and Matthias Wloka.
-
Upload
arthur-bradford -
Category
Documents
-
view
220 -
download
5
Transcript of Graphics Performance: Balancing the Rendering Pipeline Cem Cebenoyan and Matthias Wloka.
Graphics Performance: Balancing the Rendering Pipeline
Graphics Performance: Balancing the Rendering Pipeline
Cem Cebenoyan and Matthias Wloka
NVIDIA PROPRIETARY AND CONFIDENTIAL
Introduction
At a minimum, PC is a 2 processor system
CPU
GPU
Maximum efficiency IFF
All processors are busy
All the time
CPU GPUAGP Bus
NVIDIA PROPRIETARY AND CONFIDENTIAL
Actually, It’s Worse
CPU
AGP
Bus
Application
Large Cache
API
GPU
Vertex Processing
Triangle Setup
Fragment Shading
Framebuffer Access
NVIDIA PROPRIETARY AND CONFIDENTIAL
Multi-Processor System
Conceptually, 5 processors
CPU
Vertex-processor(s)
Setup processor(s)
Fragment processor(s)
Blending processor(s)
All connected via some form of cache
To smooth data flow
To keep things humming
NVIDIA PROPRIETARY AND CONFIDENTIAL
MP Systems Become Inefficient If…
One or more processors sync to each other
For example, frame-buffer lock
Insures that all caches drain
Insures that all processors idle (CPU and GPU!)
Overhead in restarting the processors
A single processor bottlenecks all others
NVIDIA PROPRIETARY AND CONFIDENTIAL
Overview
CPU
AGP Bus
Vertex Processing
Triangle Setup
Rasterization
Memory bandwidth
Writing to and blending with video memory
NVIDIA PROPRIETARY AND CONFIDENTIAL
Overview: For Each Stage
What are its characteristics?
How does it behave?
How to measure whether it is the bottleneck
How to influence it
NVIDIA PROPRIETARY AND CONFIDENTIAL
CPU Characteristics
Stay within on-chip cache for maximum performance
Use CPU for
Collision detection
Physics
AI
Etc.
NVIDIA PROPRIETARY AND CONFIDENTIAL
CPU Characteristics (cont.)
Note that graphics is capable of
20+ MTri/s (2 year old high-end)
20+ MTri/s (integrated graphics)
100+ MTri/s (current high-end)
CPU also responsible for pushing data to GPU
Cannot look at every triangle
Don’t limit graphics with CPU processing
NVIDIA PROPRIETARY AND CONFIDENTIAL
CPU Measurement
Use VTune
Or any other profiler
Most games are CPU-limited
Little to no time in the graphics driver:
CPU is the bottleneck
Faster GPU will NOT result in faster graphics
Use VTune to track where you spend your time
Optimize those places
NVIDIA PROPRIETARY AND CONFIDENTIAL
CPU Measurement (cont.)
But even if most time is spent in graphics driver:
CPU might still be the bottleneck
Faster GPU will NOT result in faster graphics
Use Nvidia Stats-driver (NVTune) to trace into the GPU
Timing graphics calls is pointless
Remember the large cache between CPU/GPU
Use Nvidia Stats-driver (NVTune) instead
NVTune available from Nvidia’s registered developer site
NVIDIA PROPRIETARY AND CONFIDENTIAL
CPU Common Problems
Small batches of geometry being sent to the GPU
100 triangles per batch should be your minimum
Would like to see ~500 triangles/batch
Up to 10,000 triangles/batch
Combination of causes kill your performance
Runtime
Driver
Hardware
NVIDIA PROPRIETARY AND CONFIDENTIAL
CPU: Batch Size Characteristic
MTris/sec vs. Batch Size(all draw-calls use same render-state)
0
2
4
6
8
10
12
14
16
Batch Size in vertices
MT
ris/
sec
NVIDIA PROPRIETARY AND CONFIDENTIAL
CPU: Batching Solutions
Sort by render-state
Texture switches
Combine textures into one large (4kx4k) texture
Modify uv-coordinates accordingly
Tessellate geometry to overcome mirroring and wrapping
Mip-mapping works just fine
Transform switches
Pre-transform on the CPU into world-space
Replicate data into VBs (costs AGP memory)
NVIDIA PROPRIETARY AND CONFIDENTIAL
Other Common CPU Problems
Specify vertex buffers as WRITEONLY
Minimize state changes
consider using a PURE device, iff you are optimal
Do not lock and read data from GPU
Multi-processor sync!
NVIDIA PROPRIETARY AND CONFIDENTIAL
AGP Bus Characteristics
AGP 4x supports 20+ MTri/s
Even if all vertices and indices are dynamic
BenMark5 does just that
http://developer.nvidia.com/view.asp?IO=BenMark5
Too often AGP 4x support is busted
Use BenMark5 to test for AGP 4x support
AGP Bus through-put influenced by
Size of vertex format of dynamically written vertices
How many vertices are dynamically written
NVIDIA PROPRIETARY AND CONFIDENTIAL
AGP Bus Characteristics (cont.)
But if frame-buffer and textures exceed video-memory, AGP is also used
to transfer STATIC vertices to GPU every frame
to transfer textures to GPU every frame
Make sure you avoid partial writes
See “Fast AGP Writes for Dynamic Vertex Data” by Dean Macri for details
Always modify all vertex-data,
even if only some data changes
Pentium 3: write in 32 byte chunks
Pentium 4: write in 64 byte chunks
NVIDIA PROPRIETARY AND CONFIDENTIAL
AGP Bus Characteristics (cont.)
GPU caches vertex fetches
Hitting this cache causes no data to cross the bus
Cache has 32-byte lines
Vertex sizes that are multiples of 32 are beneficial
See also http://developer.nvidia.com/view.asp?IO=Vertex_Buffer_Statistics
NVIDIA PROPRIETARY AND CONFIDENTIAL
AGP Bus Characteristics
MTris/sec vs. VB Size vs. FVF size
0
2
4
6
8
10
12
14
16
100 500 1000 5000 10000 20000 30000
Ordered VB Size, in vertices
MT
ris/
sec
24 byte FVF
32 byte FVF
40 byte FVF
48 byte FVF
56 byte FVF
64 byte FVF
NVIDIA PROPRIETARY AND CONFIDENTIAL
AGP Bus Measurement
You can tell you’re bound by the bus if:
Increasing/decreasing vertex format size significantly impacts performance
Best to increase vertex format size using components not needed by rasterizer
for example, normals
NVIDIA PROPRIETARY AND CONFIDENTIAL
Increasing AGP Bus Performance
Make sure frame buffer and textures fit into video-memory
Decrease number of dynamic objects (vertices)
Use vertex-shaders to animate static VBs!
Decrease vertex size
Let vertex-shader generate vertex-components!
Compress components and use vertex shader to decompress
For example, use 16bit short normals
Reorder vertices in VB to be sequential in use
Can use NVTriStrip to do this
Pad to multiples of 32-bytes
NVIDIA PROPRIETARY AND CONFIDENTIAL
Vertex Processing Characteristics
Each vertex is transformed and lit
Performance correlates directly to
Number of vertices processed
Length of vertex shader or
Fixed-function factors, such as
Number of active lights
Type of lights
Specular on/off
LOCALVIEWER on/off
Texgen on/off
GPU core clock frequency
NVIDIA PROPRIETARY AND CONFIDENTIAL
Vertex Processing Characteristics
Vertex Processing Performance
0
1
1 6
11
16
21
26
31
36
41
46
51
56
61
66
71
76
81
86
91
96
10
1
10
6
11
1
11
6
12
1
12
6
Instructions per Vertex Shader
Ve
rts
/s_
NVIDIA PROPRIETARY AND CONFIDENTIAL
Vertex Processing Characteristics
After processing, vertices land in post-TnL FIFO
GeForce1/2/4 MX: effectively 10 entries
GeForce3/4 Ti: effectively 18 entries
Cache-hit saves:
all TnL work!
Everything before TnL in the pipeline
Only works with indexed primitives
NVIDIA PROPRIETARY AND CONFIDENTIAL
Vertex Processing Performance
Do not be afraid to use triangles
Rarely the bottleneck
Even if it is, it would make us happy
A lot of vertex processing power available
6 * 6 pixel-quad with 2 tris is not vertex bound
If you can tell an object is made from triangles, you are not using enough triangles
~10k triangles/frame is off by 2 (two!) orders of magnitude
NVIDIA PROPRIETARY AND CONFIDENTIAL
Code Creatures Demo
Grass scenes are NOT vertex-bound
In excess of 1,000,000 tris/frame for opening scene
~250k tris/frame minimum
CodeCreatures demo available from: http://www.codecult.de/
NVIDIA PROPRIETARY AND CONFIDENTIAL
Vertex Processing Measurement
You are bound by vertex processing if:
Increasing/decreasing vertex shader length significantly influences performance
Adding unnecessary instructions may be optimized out by driver, though
Instead, use instructions that access constant memory to add zero to a result, for example
Fixed-function TnL performance improves when
Reducing number of lights
Turning off texgen
Simplifying light types
NVIDIA PROPRIETARY AND CONFIDENTIAL
Improving Vertex Processing
Optimize for the post-TnL vertex cache
Use indexed primitives
Access vertices mostly sequentially, revisiting only recently accessed vertices
Let NVTriStrip or ID3DXMesh do the work
Turn off unnecessary calculations
LOCALVIEWER often unnecessary for specular
Prefer cheap approximations for lighting and other math when using vertex shaders
NVIDIA PROPRIETARY AND CONFIDENTIAL
Improving Vertex Processing (cont.)
Optimize your vertex shaders
Use swizzling/masking extensively
Question all MOV instructions
Storing lookup tables in constant memory
for example, to compute sin/cos
See “Implementation of ‘Missing’ Vertex Shader Instructions” for more ideas
http://developer.nvidia.com/view.asp?IO=Implementation_Missing_Instructions
NVIDIA PROPRIETARY AND CONFIDENTIAL
Improving Vertex Processing (cont.)
Consider moving per-vertex work to per-pixel
Consider using ‘shader-LODing’
Do far-away objects really need 4-bone skinning?
Can always increase screen-res/use AA to NOT be vertex-processing bound!
NVIDIA PROPRIETARY AND CONFIDENTIAL
Triangle Setup Characteristics
Triangle setup is never the bottleneck
Except when rating the GPU
Since it is the fastest stage
Setup speed influenced by:
Number of triangles
Vertex attributes needed by rasterization
Extremely small triangles running very simple TnL
i.e., degenerate triangles!
No TnL cost, since most likely hits post-TnL cache
No fill-cost, since rejected in setup
NVIDIA PROPRIETARY AND CONFIDENTIAL
Measuring/Improving Triangle Setup
Has never come up
Reduce ratio of degenerate triangles to real triangles
Reduce unnecessary components written out from the vertex shader
NVIDIA PROPRIETARY AND CONFIDENTIAL
Rasterization Characteristics
Prefer the term “fragment” to “pixel”
May not correspond to any pixel in framebuffer, for example, due to z/stencil/alpha tests
May correspond to more than one pixel due to multisampling
Commonly referred to as “fill-rate”
NVIDIA PROPRIETARY AND CONFIDENTIAL
Fill-Rate Characteristics
Fill-rate is function of number of fragments filledcost of each fragmentGPU’s core clock
Parallel SIMD operation, processesUp to 4 pixels per clock on GeForce1/2/3/4 TiUp to 2 pixels per clock on GeForce2 MX / 4 MX
Broken into a number of parts:Texture fetchingTexture addressing operationsColor blending operations
NVIDIA PROPRIETARY AND CONFIDENTIAL
Texture Fetching Characteristics
Texture fetches are
From AGP to local video-memory, only if frame-buffer and textures exceed video-memory (to be avoided), then
From local video-memory to on-chip cache
NVIDIA PROPRIETARY AND CONFIDENTIAL
Texture Fetching Characteristics (cont.)
Minimize cache-misses:
Use mip-mapping!
Avoid LOD bias to sharpen: it hurts caching and adds aliasing
Prefer anisotropic filtering for sharpening
Use DXT everywhere you can
Texture size as big as needed and no bigger
Texture format as small as possible
16 vs. 32 bit
Localize texture access
E.g., normal texture reads
Dependent texture reads are less localPer-pixel reflection potentially really bad
NVIDIA PROPRIETARY AND CONFIDENTIAL
Texture Fetching Characteristics (cont.)
Number of samples taken also affects performance:
Trilinear filtering cuts fillrate in half
Anisotropic even worse
Depending on level of anisotropy
The hardware is intelligent in this regard, you only pay for the anisotropy you use
NVIDIA PROPRIETARY AND CONFIDENTIAL
Texture Addressing Characteristics
Different texture addressing operations have wildly different performance characteristics
But texture cache hits/misses more significant
Pixels/s
1D
2D
Cubemap
Passthrough
Pixel kill
Dependent AR
Dependent GB
Offset 2D (no luminance)
Offset 2D (luminance)
Dot product 2D
Dot product depth
Dot product cubemap
Dot product reflection
Sh
ad
er
pro
gra
m t
yp
e
Texture Shader Performance
NVIDIA PROPRIETARY AND CONFIDENTIAL
Texture Addressing Characteristics
Also, every two textures cuts fill-rate in half:
1 or 2 textures runs at full speed
3 or 4 textures runs at half speed (two clocks)
NVIDIA PROPRIETARY AND CONFIDENTIAL
Color Blending Characteristics
Color blending operations also called ‘Register Combiners’
1 or 2 instructions (combiners) – full speed
3 or 4 instructions (combiners) – half speed
5 or 6 instructions (combiners) – one third speed
7 or 8 instructions (combiners) – one quarter speed
These numbers are for GF3 / 4 Ti
But if using 4 textures
Already at half-speed or less
Using up to 4 combiners is free
NVIDIA PROPRIETARY AND CONFIDENTIAL
Fill-Rate Measurement
You are bound by fill-rate, if
Reducing texture sizes
Or better turning off texturing
Increases performance significantly
Turning on / off trilinear affects performance
Increasing texture units used to 4, but not actually fetching from any textures (using pixel shader instructions like texcoord), causes you to slow down
NVIDIA PROPRIETARY AND CONFIDENTIAL
Improving Fill-Rate
Render z-only pass first
Because z-optimizations happen before rasterization
Helps with memory bandwidth as well
Even for older chips without z-optimizations
Do everything to reduce texture cache misses
Turn on anisotropic, but turn off trilinear filtering
Mip-map transitions are less visible with anisotropic filtering on
NVIDIA PROPRIETARY AND CONFIDENTIAL
Improving Fill-Rate (cont.)
Consider palletized normal maps for compression
Consider moving per-pixel work to per-vertex
Consider ‘shader LODing’
Turn off detail map computations in the distance
NVIDIA PROPRIETARY AND CONFIDENTIAL
Memory Bandwidth Characteristics
Memory bandwidth is often the bottleneck
especially at high resolutions
Memory bandwidth influenced by:
Screen and render-target resolutions
Render-target color / z bit depth
FSAA
Texture sizes and formats (texture fetching)
Overdraw complexity
Alpha blending
GPU’s memory-interface width
Memory clock
NVIDIA PROPRIETARY AND CONFIDENTIAL
Memory Bandwidth Characteristics
FSAA hits memory bandwidth exclusively
no fill-rate hit with multi-sample
Failing the z/stencil/alpha test means
Pixel color is not written
Z is not written
NVIDIA PROPRIETARY AND CONFIDENTIAL
Measuring Memory Bandwidth
Switch frame-buffer format to 16bit
Switch all render-targets to 16bit
If performance doubles
App was 100% memory-bandwidth bound
If performance unchanged
App is not memory-bandwidth bound
NVIDIA PROPRIETARY AND CONFIDENTIAL
Improving Memory Bandwidth
Overdraw
Reduce as much as possible
Lightly sort objects front to back
All architectures benefit, since z-test fails
Reduce blending as much as possible
Always enable alpha-test when blendingTweak test-value as much as possible
Consider using 2-pass alpha-test/-blend technique
Always clear z/stencil (using clear())
Do not clear color if not necessary
Writing z from shader destroys early z
NVIDIA PROPRIETARY AND CONFIDENTIAL
Improving Memory Bandwidth (cont.)
Prefer FSAA over high resolution
Consider using z-only pass
Turn off z-writing for all subsequent passes
NVIDIA PROPRIETARY AND CONFIDENTIAL
Conclusion
A lot of different performance bottle-necks
Know which one to tweak
Use suggestions here to
make things faster w/o making it visibly worse
Make things prettier for free!
NVIDIA PROPRIETARY AND CONFIDENTIAL
Questions…
http://developer.nvidia.com