Graphics Performance: Balancing the Rendering Pipeline Cem Cebenoyan and Matthias Wloka.

Graphics Performance: Balancing the Rendering Pipeline

Graphics Performance: Balancing the Rendering Pipeline

Cem Cebenoyan and Matthias Wloka

NVIDIA PROPRIETARY AND CONFIDENTIAL

Introduction

At a minimum, PC is a 2 processor system

CPU

GPU

Maximum efficiency IFF

All processors are busy

All the time

CPU GPUAGP Bus


Actually, It’s Worse

CPU

AGP

Bus

Application

Large Cache

API

GPU

Vertex Processing

Triangle Setup

Fragment Shading

Framebuffer Access


Multi-Processor System

Conceptually, 5 processors

CPU

Vertex-processor(s)

Setup processor(s)

Fragment processor(s)

Blending processor(s)

All connected via some form of cache

To smooth data flow

To keep things humming


MP Systems Become Inefficient If…

One or more processors sync to each other

For example, frame-buffer lock

Insures that all caches drain

Insures that all processors idle (CPU and GPU!)

Overhead in restarting the processors

A single processor bottlenecks all others


Overview

CPU

AGP Bus

Vertex Processing

Triangle Setup

Rasterization

Memory bandwidth

Writing to and blending with video memory


Overview: For Each Stage

What are its characteristics?

How does it behave?

How to measure whether it is the bottleneck

How to influence it


CPU Characteristics

Stay within on-chip cache for maximum performance

Use CPU for

Collision detection

Physics

AI

Etc.


CPU Characteristics (cont.)

Note that graphics is capable of

20+ MTri/s (2 year old high-end)

20+ MTri/s (integrated graphics)

100+ MTri/s (current high-end)

CPU also responsible for pushing data to GPU

Cannot look at every triangle

Don’t limit graphics with CPU processing


CPU Measurement

Use VTune

Or any other profiler

Most games are CPU-limited

Little to no time in the graphics driver:

CPU is the bottleneck

Faster GPU will NOT result in faster graphics

Use VTune to track where you spend your time

Optimize those places


CPU Measurement (cont.)

But even if most time is spent in graphics driver:

CPU might still be the bottleneck

Faster GPU will NOT result in faster graphics

Use Nvidia Stats-driver (NVTune) to trace into the GPU

Timing graphics calls is pointless

Remember the large cache between CPU/GPU

Use Nvidia Stats-driver (NVTune) instead

NVTune available from Nvidia’s registered developer site


CPU Common Problems

Small batches of geometry being sent to the GPU

100 triangles per batch should be your minimum

Would like to see ~500 triangles/batch

Up to 10,000 triangles/batch

Combination of causes kill your performance

Runtime

Driver

Hardware


CPU: Batch Size Characteristic

MTris/sec vs. Batch Size(all draw-calls use same render-state)

0

2

4

6

8

10

12

14

16

Batch Size in vertices

MT

ris/

sec


CPU: Batching Solutions

Sort by render-state

Texture switches

Combine textures into one large (4kx4k) texture

Modify uv-coordinates accordingly

Tessellate geometry to overcome mirroring and wrapping

Mip-mapping works just fine

Transform switches

Pre-transform on the CPU into world-space

Replicate data into VBs (costs AGP memory)


Other Common CPU Problems

Specify vertex buffers as WRITEONLY

Minimize state changes

consider using a PURE device, iff you are optimal

Do not lock and read data from GPU

Multi-processor sync!


AGP Bus Characteristics

AGP 4x supports 20+ MTri/s

Even if all vertices and indices are dynamic

BenMark5 does just that

http://developer.nvidia.com/view.asp?IO=BenMark5

Too often AGP 4x support is busted

Use BenMark5 to test for AGP 4x support

AGP Bus through-put influenced by

Size of vertex format of dynamically written vertices

How many vertices are dynamically written


AGP Bus Characteristics (cont.)

But if frame-buffer and textures exceed video-memory, AGP is also used

to transfer STATIC vertices to GPU every frame

to transfer textures to GPU every frame

Make sure you avoid partial writes

See “Fast AGP Writes for Dynamic Vertex Data” by Dean Macri for details

Always modify all vertex-data,

even if only some data changes

Pentium 3: write in 32 byte chunks

Pentium 4: write in 64 byte chunks


AGP Bus Characteristics (cont.)

GPU caches vertex fetches

Hitting this cache causes no data to cross the bus

Cache has 32-byte lines

Vertex sizes that are multiples of 32 are beneficial

See also http://developer.nvidia.com/view.asp?IO=Vertex_Buffer_Statistics


AGP Bus Characteristics

MTris/sec vs. VB Size vs. FVF size

0

2

4

6

8

10

12

14

16

100 500 1000 5000 10000 20000 30000

Ordered VB Size, in vertices

MT

ris/

sec

24 byte FVF

32 byte FVF

40 byte FVF

48 byte FVF

56 byte FVF

64 byte FVF


AGP Bus Measurement

You can tell you’re bound by the bus if:

Increasing/decreasing vertex format size significantly impacts performance

Best to increase vertex format size using components not needed by rasterizer

for example, normals


Increasing AGP Bus Performance

Make sure frame buffer and textures fit into video-memory

Decrease number of dynamic objects (vertices)

Use vertex-shaders to animate static VBs!

Decrease vertex size

Let vertex-shader generate vertex-components!

Compress components and use vertex shader to decompress

For example, use 16bit short normals

Reorder vertices in VB to be sequential in use

Can use NVTriStrip to do this

Pad to multiples of 32-bytes


Vertex Processing Characteristics

Each vertex is transformed and lit

Performance correlates directly to

Number of vertices processed

Length of vertex shader or

Fixed-function factors, such as

Number of active lights

Type of lights

Specular on/off

LOCALVIEWER on/off

Texgen on/off

GPU core clock frequency



Vertex Processing Performance

0

1

1 6

11

16

21

26

31

36

41

46

51

56

61

66

71

76

81

86

91

96

10

1

10

6

11

1

11

6

12

1

12

6

Instructions per Vertex Shader

Ve

rts

/s_



After processing, vertices land in post-TnL FIFO

GeForce1/2/4 MX: effectively 10 entries

GeForce3/4 Ti: effectively 18 entries

Cache-hit saves:

all TnL work!

Everything before TnL in the pipeline

Only works with indexed primitives


Vertex Processing Performance

Do not be afraid to use triangles

Rarely the bottleneck

Even if it is, it would make us happy

A lot of vertex processing power available

6 * 6 pixel-quad with 2 tris is not vertex bound

If you can tell an object is made from triangles, you are not using enough triangles

~10k triangles/frame is off by 2 (two!) orders of magnitude


Code Creatures Demo

Grass scenes are NOT vertex-bound

In excess of 1,000,000 tris/frame for opening scene

~250k tris/frame minimum

CodeCreatures demo available from: http://www.codecult.de/


Vertex Processing Measurement

You are bound by vertex processing if:

Increasing/decreasing vertex shader length significantly influences performance

Adding unnecessary instructions may be optimized out by driver, though

Instead, use instructions that access constant memory to add zero to a result, for example

Fixed-function TnL performance improves when

Reducing number of lights

Turning off texgen

Simplifying light types


Improving Vertex Processing

Optimize for the post-TnL vertex cache

Use indexed primitives

Access vertices mostly sequentially, revisiting only recently accessed vertices

Let NVTriStrip or ID3DXMesh do the work

Turn off unnecessary calculations

LOCALVIEWER often unnecessary for specular

Prefer cheap approximations for lighting and other math when using vertex shaders


Improving Vertex Processing (cont.)

Optimize your vertex shaders

Use swizzling/masking extensively

Question all MOV instructions

Storing lookup tables in constant memory

for example, to compute sin/cos

See “Implementation of ‘Missing’ Vertex Shader Instructions” for more ideas

http://developer.nvidia.com/view.asp?IO=Implementation_Missing_Instructions


Improving Vertex Processing (cont.)

Consider moving per-vertex work to per-pixel

Consider using ‘shader-LODing’

Do far-away objects really need 4-bone skinning?

Can always increase screen-res/use AA to NOT be vertex-processing bound!


Triangle Setup Characteristics

Triangle setup is never the bottleneck

Except when rating the GPU

Since it is the fastest stage

Setup speed influenced by:

Number of triangles

Vertex attributes needed by rasterization

Extremely small triangles running very simple TnL

i.e., degenerate triangles!

No TnL cost, since most likely hits post-TnL cache

No fill-cost, since rejected in setup


Measuring/Improving Triangle Setup

Has never come up

Reduce ratio of degenerate triangles to real triangles

Reduce unnecessary components written out from the vertex shader


Rasterization Characteristics

Prefer the term “fragment” to “pixel”

May not correspond to any pixel in framebuffer, for example, due to z/stencil/alpha tests

May correspond to more than one pixel due to multisampling

Commonly referred to as “fill-rate”


Fill-Rate Characteristics

Fill-rate is function of number of fragments filledcost of each fragmentGPU’s core clock

Parallel SIMD operation, processesUp to 4 pixels per clock on GeForce1/2/3/4 TiUp to 2 pixels per clock on GeForce2 MX / 4 MX

Broken into a number of parts:Texture fetchingTexture addressing operationsColor blending operations


Texture Fetching Characteristics

Texture fetches are

From AGP to local video-memory, only if frame-buffer and textures exceed video-memory (to be avoided), then

From local video-memory to on-chip cache


Texture Fetching Characteristics (cont.)

Minimize cache-misses:

Use mip-mapping!

Avoid LOD bias to sharpen: it hurts caching and adds aliasing

Prefer anisotropic filtering for sharpening

Use DXT everywhere you can

Texture size as big as needed and no bigger

Texture format as small as possible

16 vs. 32 bit

Localize texture access

E.g., normal texture reads

Dependent texture reads are less localPer-pixel reflection potentially really bad


Texture Fetching Characteristics (cont.)

Number of samples taken also affects performance:

Trilinear filtering cuts fillrate in half

Anisotropic even worse

Depending on level of anisotropy

The hardware is intelligent in this regard, you only pay for the anisotropy you use


Texture Addressing Characteristics

Different texture addressing operations have wildly different performance characteristics

But texture cache hits/misses more significant

Pixels/s

1D

2D

Cubemap

Passthrough

Pixel kill

Dependent AR

Dependent GB

Offset 2D (no luminance)

Offset 2D (luminance)

Dot product 2D

Dot product depth

Dot product cubemap

Dot product reflection

Sh

ad

er

pro

gra

m t

yp

e

Texture Shader Performance


Texture Addressing Characteristics

Also, every two textures cuts fill-rate in half:

1 or 2 textures runs at full speed

3 or 4 textures runs at half speed (two clocks)


Color Blending Characteristics

Color blending operations also called ‘Register Combiners’

1 or 2 instructions (combiners) – full speed

3 or 4 instructions (combiners) – half speed

5 or 6 instructions (combiners) – one third speed

7 or 8 instructions (combiners) – one quarter speed

These numbers are for GF3 / 4 Ti

But if using 4 textures

Already at half-speed or less

Using up to 4 combiners is free


Fill-Rate Measurement

You are bound by fill-rate, if

Reducing texture sizes

Or better turning off texturing

Increases performance significantly

Turning on / off trilinear affects performance

Increasing texture units used to 4, but not actually fetching from any textures (using pixel shader instructions like texcoord), causes you to slow down


Improving Fill-Rate

Render z-only pass first

Because z-optimizations happen before rasterization

Helps with memory bandwidth as well

Even for older chips without z-optimizations

Do everything to reduce texture cache misses

Turn on anisotropic, but turn off trilinear filtering

Mip-map transitions are less visible with anisotropic filtering on


Improving Fill-Rate (cont.)

Consider palletized normal maps for compression

Consider moving per-pixel work to per-vertex

Consider ‘shader LODing’

Turn off detail map computations in the distance


Memory Bandwidth Characteristics

Memory bandwidth is often the bottleneck

especially at high resolutions

Memory bandwidth influenced by:

Screen and render-target resolutions

Render-target color / z bit depth

FSAA

Texture sizes and formats (texture fetching)

Overdraw complexity

Alpha blending

GPU’s memory-interface width

Memory clock


Memory Bandwidth Characteristics

FSAA hits memory bandwidth exclusively

no fill-rate hit with multi-sample

Failing the z/stencil/alpha test means

Pixel color is not written

Z is not written


Measuring Memory Bandwidth

Switch frame-buffer format to 16bit

Switch all render-targets to 16bit

If performance doubles

App was 100% memory-bandwidth bound

If performance unchanged

App is not memory-bandwidth bound


Improving Memory Bandwidth

Overdraw

Reduce as much as possible

Lightly sort objects front to back

All architectures benefit, since z-test fails

Reduce blending as much as possible

Always enable alpha-test when blendingTweak test-value as much as possible

Consider using 2-pass alpha-test/-blend technique

Always clear z/stencil (using clear())

Do not clear color if not necessary

Writing z from shader destroys early z


Improving Memory Bandwidth (cont.)

Prefer FSAA over high resolution

Consider using z-only pass

Turn off z-writing for all subsequent passes


Conclusion

A lot of different performance bottle-necks

Know which one to tweak

Use suggestions here to

make things faster w/o making it visibly worse

Make things prettier for free!


Questions…

[email protected]

[email protected]

http://developer.nvidia.com

Graphics Performance: Balancing the Rendering Pipeline Cem Cebenoyan and Matthias Wloka.

Documents

Transcript of Graphics Performance: Balancing the Rendering Pipeline Cem Cebenoyan and Matthias Wloka.