Windows to reality getting the most out of direct3 d 10 graphics in your games

53

Transcript of Windows to reality getting the most out of direct3 d 10 graphics in your games

Page 1: Windows to reality   getting the most out of direct3 d 10 graphics in your games
Page 2: Windows to reality   getting the most out of direct3 d 10 graphics in your games

Windows to Reality:Getting the Most out of Direct3D 10 Graphics in Your GamesShanon DroneSoftware Development EngineerXNA Developer ConnectionMicrosoft

Page 3: Windows to reality   getting the most out of direct3 d 10 graphics in your games

Key areasDebug LayerDraw CallsConstant UpdatesState ManagementShader LinkageResource UpdatesDynamic GeometryPorting Tips

Standard Slide without subtitle

Page 4: Windows to reality   getting the most out of direct3 d 10 graphics in your games

Debug LayerUse it!

The D3D10 layer can help find performance issues

App controlled by passing D3D10_CREATE_DEVICE_DEBUG into D3D10CreateDevice.

Use the D3DX10 Debug RuntimeLink against D3DX10d.lib

Only do this for debug builds!Look for performance warnings in the debug output

Page 5: Windows to reality   getting the most out of direct3 d 10 graphics in your games

Draw CallsDraw calls are still “not free”Draw overhead is reduced in D3D10

But not enough that you can be lazy

Efficiency in the number of draw calls will still give a performance win

Standard Slide without subtitle

Page 6: Windows to reality   getting the most out of direct3 d 10 graphics in your games

Draw CallsExcess baggage

An increase in the number of draw calls generally increases the number of API calls associated with those draws

ConstantBuffer updatesResource changes (VBs, IBs, Textures)InputLayout changes

These all have effects on performance that vary with draw call count

Page 7: Windows to reality   getting the most out of direct3 d 10 graphics in your games

Constant UpdatesUpdating shader constants was often a bottleneck in D3D9It can still be a bottleneck in D3D10The main difference between the two is the new Constant Buffer object in D3D10This is the largest section of this talk

Standard Slide without subtitle

Page 8: Windows to reality   getting the most out of direct3 d 10 graphics in your games

Constant UpdatesConstant Buffer Recap

Constant Buffers are buffer objects that hold shader constant dataThey are updated using D3D10_MAP_WRITE_DISCARD or by calling UpdateSubresourceThere are 16 Constant Buffer slots available to each shader in the pipeline

Try not to use all 16 to leave some headroom

Page 9: Windows to reality   getting the most out of direct3 d 10 graphics in your games

Constant UpdatesPorting Issues

D3D9 constants were updated individually by calling SetXXXXXShaderConstantXIn D3D10, you have to update the entire constant buffer all at onceA naïve port from D3D9 to D3D10 can have crippling performance implications if Constant Buffers are not handled correctly!Rule of thumb: Do not update more data than you need to

Page 10: Windows to reality   getting the most out of direct3 d 10 graphics in your games

Constant UpdatesNaïve Port: AKA how to cripple perf

Each shader uses one big constant bufferSubmitting one value submits them all!If you have one 4096 byte Constant Buffer, and you only need to update your World matrix, you will still have to update 4096 bytes of data and send it across the busDon’t do this!

Page 11: Windows to reality   getting the most out of direct3 d 10 graphics in your games

Constant UpdatesNaïve Port: AKA how to cripple perf

100 skinned meshes (100 materials), 900 static meshes (400 materials), 1 shadow + 1 lighting pass

cbuffer VSGlobalsCB{ matrix ViewProj; matrix Bones[100]; matrix World; float SpecPower; float4 BDRFCoefficients; float AppTime; uint2 RenderTargetSize;};

6560 Bytes

6560 Bytes x 100 = 656000 Bytes

Update VSGlobalCBShadow Pass

6560 Bytes x 900 = 5904000 Bytes

Update VSGlobalCB

6560 Bytes x 100 = 656000 Bytes

Update VSGlobalCBLight Pass

6560 Bytes x 900 = 5904000 Bytes

Update VSGlobalCB

= 13,120,000 Bytes

Page 12: Windows to reality   getting the most out of direct3 d 10 graphics in your games

Constant UpdatesOrganize Constants

The first step is to organize constants by frequency of updateOne shader will generally be used to draw several objectsSome data in this shader doesn’t need to be set for every draw

For example: Time, ViewProj matrices

Split these out into their own buffers

Page 13: Windows to reality   getting the most out of direct3 d 10 graphics in your games

cbuffer VSPerSkinnedCB{ matrix Bones[100];};

cbuffer VSGlobalPerFrameCB { float AppTime;};

cbuffer VSPerPassCB{ matrix ViewProj; uint2 RenderTargetSize;};

cbuffer VSPerMaterialCB{ float SpecPower; float4 BDRFCoefficients;};

4 Bytes

6400 Bytes

72 Bytes

20 Bytes

cbuffer VSPerStaticCB{ matrix World};

64 Bytes

4 Bytes x 1 = 4 Bytes

Update VSGlobalPerFrameCB

6400 Bytes x 100 = 640000 Bytes

Update VSPerSkinnedCBs

64 Bytes x 900 = 57600 Bytes

Update VSPerStaticCBs

72 Bytes x 1 = 72 Bytes

Update VSPerPassCB

72 Bytes x 1 = 72 Bytes

Update VSPerPassCB

20 Bytes x 500 = 10000 Bytes

Update VSPerMaterialCBs

Shadow Pass

Light Pass

Begin Frame

= 707,748 Bytes

Page 14: Windows to reality   getting the most out of direct3 d 10 graphics in your games

Constant Updates

13,120,000 Bytes

707,748Bytes/ =18

x

Page 15: Windows to reality   getting the most out of direct3 d 10 graphics in your games

Constant UpdatesManaging Buffers

Constant buffers need to be managed in the applicationCreating a few buffers that are used for all shader constants just won’t work

We update more data than necessary due to large buffers

Page 16: Windows to reality   getting the most out of direct3 d 10 graphics in your games

Constant UpdatesManaging Buffers

Solution 1 (Fastest)Create Constant Buffers that line up exactly with the number of elements of each frequency group

Global CBsCBs per MeshCBs per MaterialCBs per Pass

This ensures that EVERY constant buffer is no larger than it absolutely needs to beThis also ensures the most efficient update of CBs based upon frequency

Page 17: Windows to reality   getting the most out of direct3 d 10 graphics in your games

Constant UpdatesManaging Buffers

Solution 2 (Second Best)If you cannot create a CBs that line up exactly with elements, you can create a tiered constant buffer systemCreate arrays of 32-byte, 64-byte, 128-byte, 256-byte, etc. constant buffersKeep a shadow copy of the constant data in system memoryWhen it comes time to render, select the smallest CB from the array that will hold the necessary constant dataMay have to resubmit redundant data for separate passesHybrid approach?

Page 18: Windows to reality   getting the most out of direct3 d 10 graphics in your games

Constant UpdatesCase Study: Skinning using Solution 1Skinning in D3D9 (or a bad D3D10

port)Multiple passes causes redundant bone data uploads to the GPU

Skinning in D3D10Using Constant Buffers we only need to upload it once

Page 19: Windows to reality   getting the most out of direct3 d 10 graphics in your games

Constant UpdatesD3D9 Version / or Naïve D3D10 Version

Pass1 Mesh1 Bone0

Mesh1 Bone1

Mesh1 Bone2

Mesh1 Bone3

Mesh1 Bone4

Mesh1 BoneN

Mesh2 Bone0

Mesh2 Bone1

Mesh2 Bone2

Mesh2 Bone3

Mesh2 Bone4

Mesh2 BoneN

Set Mesh1 Bones

Draw Mesh1

Set Mesh2 Bones

Draw Mesh2

Pass2

Set Mesh1 Bones

Draw Mesh1

Set Mesh2 Bones

Draw Mesh2

Constant Data

Page 20: Windows to reality   getting the most out of direct3 d 10 graphics in your games

Constant UpdatesPreferred D3D10 Version

Pass1

Mesh1 Bone0

Mesh1 Bone1

Mesh1 Bone2

Mesh1 Bone3

Mesh1 Bone4

Mesh1 BoneN

Mesh2 Bone0

Mesh2 Bone1

Mesh2 Bone2

Mesh2 Bone3

Mesh2 Bone4

Mesh2 BoneN

Bind Mesh1 CB

Draw Mesh1

Bind Mesh2 CB

Draw Mesh2

Pass2

Bind Mesh1 CB

Draw Mesh1

Bind Mesh2 CB

Draw Mesh2

Mesh1 CBFrame Start

Update Mesh1 CB

Update Mesh2 CB

Mesh2 CB

CB Slot 0

Mesh1 CBMesh2 CB

Page 21: Windows to reality   getting the most out of direct3 d 10 graphics in your games

Constant UpdatesAdvanced D3D10 Version

Why not store all of our characters’ bones in a 128-bit FP texture?We can upload bones for all visible characters at the start of a frameWe can draw similar characters using instancing instead of individual draws

Use SV_InstanceID to select the start of the character’s bone data in the texture

Stream the skinned meshes to memory using Stream Output and render all subsequent passes from the post-skinned buffer

Page 22: Windows to reality   getting the most out of direct3 d 10 graphics in your games

State ManagementIndividual state setting is no longer possible in D3D10State in D3D10 is stored in state objectsThese state objects are immutableTo change even one aspect of a state object requires that you create an entirely new state object with that one change

Standard Slide without subtitle

Page 23: Windows to reality   getting the most out of direct3 d 10 graphics in your games

State ManagementManaging State Objects

Solution 1 (Fastest)If you have a known set of materials and required states, you can create all state objects at load timeState objects are small and there are finite set of permutationsWith all state objects created at runtime, all that needs to be done during rendering is to bind the object

Page 24: Windows to reality   getting the most out of direct3 d 10 graphics in your games

State ManagementManaging State Objects

Solution 2 (Second Best)If your content is not finalized, or if you CANNOT get your engine to lump state togetherCreate a state object hash tableHash off of the setting that has the most unique statesGrab pre-created states from the hash-tableWhy not give your tools pipeline the ability to do this for a level and save out the results?

Page 25: Windows to reality   getting the most out of direct3 d 10 graphics in your games

Shader LinkageD3D9 shader linkage was based off of semantics (POSITION, NORMAL, TEXCOORDN)D3D10 linkage is based off of offsets and sizesThis means stricter linkage rulesThis also means that the driver doesn’t have to link shaders together at every draw call!

Standard Slide without subtitle

Page 26: Windows to reality   getting the most out of direct3 d 10 graphics in your games

Shader LinkageNo Holes Allowed!

Elements must be read in the order they are output from the previous stageCannot have “holes” between linkages

Struct VS_OUTPUT{ float3 Norm : NORMAL; float2 Tex : TEXCOORD0; float2 Tex2 : TEXCOORD1; float4 Pos : SV_POSITION;};

Struct PS_INPUT{ float2 Tex : TEXCOORD0; float3 Norm : NORMAL; float2 Tex2 : TEXCOORD1; };

Struct VS_OUTPUT{ float3 Norm : NORMAL; float2 Tex : TEXCOORD0; float2 Tex2 : TEXCOORD1; float4 Pos : SV_POSITION;};

Struct PS_INPUT{ float3 Norm : NORMAL;

float2 Tex2 : TEXCOORD1; };

Struct VS_OUTPUT{ float3 Norm : NORMAL; float2 Tex : TEXCOORD0; float2 Tex2 : TEXCOORD1; float4 Pos : SV_POSITION;};

Struct PS_INPUT{ float3 Norm : NORMAL; float3 Tex : TEXCOORD0; float2 Tex2 : TEXCOORD1; };

Holes at the end are OK

Page 27: Windows to reality   getting the most out of direct3 d 10 graphics in your games

Shader LinkageInput Assembler to Vertex Shader

Input Layouts define the signature of the vertex stream dataInput Layouts are the similar to Vertex Declarations in D3D9

Strict linkage rules are a big difference

Creating Input Layouts on the fly is not recommendedCreateInputLayout requires a shader signature to validate against

Page 28: Windows to reality   getting the most out of direct3 d 10 graphics in your games

Shader LinkageInput Assembler to Vertex Shader

Solution 1 (Fastest)Create an Input Layout for each unique Vertex Stream / Vertex Shader combination up frontInput Layouts are smallThis assumes that the shader input signature is available when you call CreateInputLayoutTry to normalize Input Layouts across level or be art directed

Page 29: Windows to reality   getting the most out of direct3 d 10 graphics in your games

Shader LinkageInput Assembler to Vertex Shader

Solution 2 (Second Best)If you load meshes and create input layouts before loading shaders, you might have a problemYou can use a similar hashing scheme as the one used for State ObjectsWhen the Input Layout is needed, search the hash for an Input Layout that matches the Vertex Stream and Vertex Shader signatureWhy not store this data to a file and pre-populate the Input Layouts after your content is tuned?

Page 30: Windows to reality   getting the most out of direct3 d 10 graphics in your games

Shader LinkageAside: Instancing

Instancing is a first class citizen on D3D10!Stream source frequency is now part of the Input LayoutMultiple frequencies will mean multiple Input Layouts

Page 31: Windows to reality   getting the most out of direct3 d 10 graphics in your games

Resource UpdatesUpdating resources is different in D3D10Create / Lock / Fill / Unlock paradigm is no longer necessary (although you can still do it)Texture data can be passed into the texture at create time

Standard Slide without subtitle

Page 32: Windows to reality   getting the most out of direct3 d 10 graphics in your games

Resource UpdatesResource Usage Types

D3D10_USAGE_DEFAULTD3D10_USAGE_IMMUTABLED3D10_USAGE_DYNAMICD3D10_USAGE_STAGING

Page 33: Windows to reality   getting the most out of direct3 d 10 graphics in your games

Resource UpdatesD3D10_USAGE_DEFAULT

Use for resources that need fast GPU read and write accessCan only be updated using UpdateSubresourceRender targets are good candidatesTextures that are updated infrequently (less than once per frame) are good candidates

Page 34: Windows to reality   getting the most out of direct3 d 10 graphics in your games

Resource UpdatesD3D10_USAGE_IMMUTABLE

Use for resources that need fast GPU read access onlyOnce they are created, they cannot be updated... everInitial data must be passed in during the creation callResources that will never change (static textures, VBs / Ibs) are good candidatesDon’t bend over backwards trying to make everything D3D10_USAGE_IMMUTABLE

Page 35: Windows to reality   getting the most out of direct3 d 10 graphics in your games

Resource UpdatesD3D10_USAGE_DYNAMIC

Use for resources that need fast CPU write access (at the expense of slower GPU read access)No CPU read accessCan only be updated using Map with:

D3D10_MAP_WRITE_DISCARDD3D10_MAP_WRITE_NO_OVERWRITE

Dynamic Vertex Buffers are good candidatesDynamic (> once per frame) textures are good candidates

Page 36: Windows to reality   getting the most out of direct3 d 10 graphics in your games

Resource UpdatesD3D10_USAGE_STAGING

This is the only way to read data back from the GPUCan only be updated using MapCannot map with D3D10_MAP_WRITE_DISCARD or D3D10_MAP_WRITE_NO_OVERWRITEMight want to double buffer to keep from stalling GPUThe GPU cannot directly use these

Page 37: Windows to reality   getting the most out of direct3 d 10 graphics in your games

Resource UpdatesSummary

CPU updates the resource frequently (more than once per frame)

Use D3D10_USAGE_DYNAMIC

CPU updates the resource infrequently (once per frame or less)

Use D3D10_USAGE_DEFAULT

CPU doesn’t update the resourceUse D3D10_USAGE_IMMUTABLE

CPU needs to read the resourceUse D3D10_USAGE_STAGING

Page 38: Windows to reality   getting the most out of direct3 d 10 graphics in your games

Resource UpdatesExample: Vertex Buffer

The vertex buffer is touched by the CPU less than once per frame

Create it with D3D10_USAGE_DEFAULTUpdate it with UpdateSubresource

The vertex buffer is used for dynamic geometry and CPU need to update if multiple times per frame

Create it with D3D10_USAGE_DYNAMICUpdate it with Map

Page 39: Windows to reality   getting the most out of direct3 d 10 graphics in your games

Resource UpdatesThe Exception: Constant Buffers

CBs are always expected to be updated frequentlySelect CB usage based upon which one causes the least amount of system memory to be transferred

Not just to the GPU, but system-to-system memory copies as well

Page 40: Windows to reality   getting the most out of direct3 d 10 graphics in your games

Resource UpdatesUpdateSubresource

UpdateSubresource requires a system memory buffer and incurs an extra copyUse if you have system copies of your constant data already in one place

Page 41: Windows to reality   getting the most out of direct3 d 10 graphics in your games

Resource UpdatesMap

Map requires no extra system memory but may hit driver renaming limits if abused Use if compositing values on the fly or collecting values from other places

Page 42: Windows to reality   getting the most out of direct3 d 10 graphics in your games

Resource UpdatesA note on overusing discard

Use D3D10_MAP_WRITE_DISCARD carefully with buffers!D3D10_MAP_WRITE_DISCARD tells the driver to give us a new memory buffer if the current one is busyThere are a LIMITED set of temporary buffersIf these run out, then your app will stall until another buffer can be freedThis can happen if you do dynamic geometry using one VB and D3D10_MAP_WRITE_DISCARD

Page 43: Windows to reality   getting the most out of direct3 d 10 graphics in your games

Dynamic GeometryDrawIndexedPrimitiveUP is gone!DrawPrimitiveUP is gone!Your well-behaved D3D9 app isn’t using these anyway, right?

Standard Slide without subtitle

Page 44: Windows to reality   getting the most out of direct3 d 10 graphics in your games

Dynamic GeometrySolution: Same as in D3D9

Use one large buffer, and map it with D3D10_MAP_WRITE_NO_OVERWRITEAdvance the write position with every draw

Wrap to the beginning

Make sure your buffer is large enough that you’re not overwriting data that the GPU is readingThis is what happens under the covers for D3D9 when using DIPUP or DUP in Windows Vista

Page 45: Windows to reality   getting the most out of direct3 d 10 graphics in your games

Porting TipsStretchRect is Gone

Work around using render-to-texture

A8R8G8B8 have been replaced with R8G8B8A8 formats

Swizzle on texture load or swizzle in the shader

Fixed Function AlphaTest is GoneAdd logic to the shader and call discard

Fixed Function Fog is GoneAdd it to the shader

Standard Slide without subtitle

Page 46: Windows to reality   getting the most out of direct3 d 10 graphics in your games

Porting TipsContinued

User Clip Planes usage has changedThey’ve move to the shaderExperiment with the SV_ClipDistance SEMANTIC vs discard in the PS to determine which is faster for your shader

Query data sizes might have changedOcclusion queries are UINT64 vs DWORD

No Triangle Fan SupportWork around in content pipeline or on load

SetCursorProperties, ShowCursor are goneUse Win32 APIs to handle cursors now

Page 47: Windows to reality   getting the most out of direct3 d 10 graphics in your games

Porting TipsContinued

No offsets on Map callsThis was basically API clutter in D3D9Calculate the offset from the returned pointer

Clears are no longer bound to pipeline stateIf you want a clear call to respect scissor, stencil, or other state, draw a full-screen quadThis is closer to the HWThe Driver/HW has been doing for you for years

OMSetBlendStateNever set the SampleMask to 0 in OMSetBlendState

Page 48: Windows to reality   getting the most out of direct3 d 10 graphics in your games

Porting TipsContinued

Input Layout conversions tightened upD3DDECLTYPE_UBYTE4 in the vertex stream could be converted to a float4 in the VS in D3D9IE. 255u in the stream would show up as 255.0 in the VSIn D3D10 you either get a normalized [0..1] value or 255 (u)int

Register keywordIt doesn’t mean the same thing in D3D10Use register to determine which CB slot a CB binds toUse packoffset to place a variable inside a CB

Page 49: Windows to reality   getting the most out of direct3 d 10 graphics in your games

Porting TipsContinued

Sampler and Texture bindingsSamplers can be bound independently of texturesThis is very flexible!Sampler and Texture slots are not always the same

Register PackingIn D3D9 all variables took up at least one float4 register (even if you only used a single float!)In D3D10 variables are packed togetherThis saves a lot of spaceMake sure your engine doesn’t do everything based upon register offsets or your variables might alias

Page 50: Windows to reality   getting the most out of direct3 d 10 graphics in your games

Porting Tips Continued

D3DSAMP_SRGBTEXTUREThis sampler state setting does not exist on D3D10Instead it’s included in the texture formatThis is more like the Xbox 360

Consider re-optimizing resource usage and upload for better D3D10 performance

But use D3D10_USAGE_DEFAULT resources and UpdateSubresource and a baseline

Page 51: Windows to reality   getting the most out of direct3 d 10 graphics in your games

SummaryUse the debug runtime!More draw calls usually means more constant updating and state changing callsBe frugal with constant updates

Avoid resubmitting redundant data!

Create as much state and input layout information up front as possibleSelect D3D10_USAGE for resources based upon the CPU access patterns neededUse D3D10_MAP_NO_OVERWRITE and a big buffer as a replacement for DIPUP and DUP

Standard Slide without subtitle

Page 52: Windows to reality   getting the most out of direct3 d 10 graphics in your games

Call to ActionActually exploit D3D10!This talk tells you how to get performance gains from a straight portYou can get a whole lot more by using D3D10’s advanced features!

StreamOut to minimize skinning costsFirst class instancing supportStore some vertex data in texturesMove some systems to the GPU (Particles?)Aggressive use of Constant Buffers

Standard Slide without subtitle

Page 53: Windows to reality   getting the most out of direct3 d 10 graphics in your games

© 2007 Microsoft Corporation. All rights reserved.This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.

http://www.xna.com