Windows to reality getting the most out of direct3 d 10 graphics in your games
-
Upload
changhee-lee -
Category
Technology
-
view
11.078 -
download
3
Transcript of Windows to reality getting the most out of direct3 d 10 graphics in your games
Windows to Reality:Getting the Most out of Direct3D 10 Graphics in Your GamesShanon DroneSoftware Development EngineerXNA Developer ConnectionMicrosoft
Key areasDebug LayerDraw CallsConstant UpdatesState ManagementShader LinkageResource UpdatesDynamic GeometryPorting Tips
Standard Slide without subtitle
Debug LayerUse it!
The D3D10 layer can help find performance issues
App controlled by passing D3D10_CREATE_DEVICE_DEBUG into D3D10CreateDevice.
Use the D3DX10 Debug RuntimeLink against D3DX10d.lib
Only do this for debug builds!Look for performance warnings in the debug output
Draw CallsDraw calls are still “not free”Draw overhead is reduced in D3D10
But not enough that you can be lazy
Efficiency in the number of draw calls will still give a performance win
Standard Slide without subtitle
Draw CallsExcess baggage
An increase in the number of draw calls generally increases the number of API calls associated with those draws
ConstantBuffer updatesResource changes (VBs, IBs, Textures)InputLayout changes
These all have effects on performance that vary with draw call count
Constant UpdatesUpdating shader constants was often a bottleneck in D3D9It can still be a bottleneck in D3D10The main difference between the two is the new Constant Buffer object in D3D10This is the largest section of this talk
Standard Slide without subtitle
Constant UpdatesConstant Buffer Recap
Constant Buffers are buffer objects that hold shader constant dataThey are updated using D3D10_MAP_WRITE_DISCARD or by calling UpdateSubresourceThere are 16 Constant Buffer slots available to each shader in the pipeline
Try not to use all 16 to leave some headroom
Constant UpdatesPorting Issues
D3D9 constants were updated individually by calling SetXXXXXShaderConstantXIn D3D10, you have to update the entire constant buffer all at onceA naïve port from D3D9 to D3D10 can have crippling performance implications if Constant Buffers are not handled correctly!Rule of thumb: Do not update more data than you need to
Constant UpdatesNaïve Port: AKA how to cripple perf
Each shader uses one big constant bufferSubmitting one value submits them all!If you have one 4096 byte Constant Buffer, and you only need to update your World matrix, you will still have to update 4096 bytes of data and send it across the busDon’t do this!
Constant UpdatesNaïve Port: AKA how to cripple perf
100 skinned meshes (100 materials), 900 static meshes (400 materials), 1 shadow + 1 lighting pass
cbuffer VSGlobalsCB{ matrix ViewProj; matrix Bones[100]; matrix World; float SpecPower; float4 BDRFCoefficients; float AppTime; uint2 RenderTargetSize;};
6560 Bytes
6560 Bytes x 100 = 656000 Bytes
Update VSGlobalCBShadow Pass
6560 Bytes x 900 = 5904000 Bytes
Update VSGlobalCB
6560 Bytes x 100 = 656000 Bytes
Update VSGlobalCBLight Pass
6560 Bytes x 900 = 5904000 Bytes
Update VSGlobalCB
= 13,120,000 Bytes
Constant UpdatesOrganize Constants
The first step is to organize constants by frequency of updateOne shader will generally be used to draw several objectsSome data in this shader doesn’t need to be set for every draw
For example: Time, ViewProj matrices
Split these out into their own buffers
cbuffer VSPerSkinnedCB{ matrix Bones[100];};
cbuffer VSGlobalPerFrameCB { float AppTime;};
cbuffer VSPerPassCB{ matrix ViewProj; uint2 RenderTargetSize;};
cbuffer VSPerMaterialCB{ float SpecPower; float4 BDRFCoefficients;};
4 Bytes
6400 Bytes
72 Bytes
20 Bytes
cbuffer VSPerStaticCB{ matrix World};
64 Bytes
4 Bytes x 1 = 4 Bytes
Update VSGlobalPerFrameCB
6400 Bytes x 100 = 640000 Bytes
Update VSPerSkinnedCBs
64 Bytes x 900 = 57600 Bytes
Update VSPerStaticCBs
72 Bytes x 1 = 72 Bytes
Update VSPerPassCB
72 Bytes x 1 = 72 Bytes
Update VSPerPassCB
20 Bytes x 500 = 10000 Bytes
Update VSPerMaterialCBs
Shadow Pass
Light Pass
Begin Frame
= 707,748 Bytes
Constant Updates
13,120,000 Bytes
707,748Bytes/ =18
x
Constant UpdatesManaging Buffers
Constant buffers need to be managed in the applicationCreating a few buffers that are used for all shader constants just won’t work
We update more data than necessary due to large buffers
Constant UpdatesManaging Buffers
Solution 1 (Fastest)Create Constant Buffers that line up exactly with the number of elements of each frequency group
Global CBsCBs per MeshCBs per MaterialCBs per Pass
This ensures that EVERY constant buffer is no larger than it absolutely needs to beThis also ensures the most efficient update of CBs based upon frequency
Constant UpdatesManaging Buffers
Solution 2 (Second Best)If you cannot create a CBs that line up exactly with elements, you can create a tiered constant buffer systemCreate arrays of 32-byte, 64-byte, 128-byte, 256-byte, etc. constant buffersKeep a shadow copy of the constant data in system memoryWhen it comes time to render, select the smallest CB from the array that will hold the necessary constant dataMay have to resubmit redundant data for separate passesHybrid approach?
Constant UpdatesCase Study: Skinning using Solution 1Skinning in D3D9 (or a bad D3D10
port)Multiple passes causes redundant bone data uploads to the GPU
Skinning in D3D10Using Constant Buffers we only need to upload it once
Constant UpdatesD3D9 Version / or Naïve D3D10 Version
Pass1 Mesh1 Bone0
Mesh1 Bone1
Mesh1 Bone2
Mesh1 Bone3
Mesh1 Bone4
…
Mesh1 BoneN
Mesh2 Bone0
Mesh2 Bone1
Mesh2 Bone2
Mesh2 Bone3
Mesh2 Bone4
…
Mesh2 BoneN
Set Mesh1 Bones
Draw Mesh1
Set Mesh2 Bones
Draw Mesh2
Pass2
Set Mesh1 Bones
Draw Mesh1
Set Mesh2 Bones
Draw Mesh2
Constant Data
Constant UpdatesPreferred D3D10 Version
Pass1
Mesh1 Bone0
Mesh1 Bone1
Mesh1 Bone2
Mesh1 Bone3
Mesh1 Bone4
…
Mesh1 BoneN
Mesh2 Bone0
Mesh2 Bone1
Mesh2 Bone2
Mesh2 Bone3
Mesh2 Bone4
…
Mesh2 BoneN
Bind Mesh1 CB
Draw Mesh1
Bind Mesh2 CB
Draw Mesh2
Pass2
Bind Mesh1 CB
Draw Mesh1
Bind Mesh2 CB
Draw Mesh2
Mesh1 CBFrame Start
Update Mesh1 CB
Update Mesh2 CB
Mesh2 CB
CB Slot 0
Mesh1 CBMesh2 CB
Constant UpdatesAdvanced D3D10 Version
Why not store all of our characters’ bones in a 128-bit FP texture?We can upload bones for all visible characters at the start of a frameWe can draw similar characters using instancing instead of individual draws
Use SV_InstanceID to select the start of the character’s bone data in the texture
Stream the skinned meshes to memory using Stream Output and render all subsequent passes from the post-skinned buffer
State ManagementIndividual state setting is no longer possible in D3D10State in D3D10 is stored in state objectsThese state objects are immutableTo change even one aspect of a state object requires that you create an entirely new state object with that one change
Standard Slide without subtitle
State ManagementManaging State Objects
Solution 1 (Fastest)If you have a known set of materials and required states, you can create all state objects at load timeState objects are small and there are finite set of permutationsWith all state objects created at runtime, all that needs to be done during rendering is to bind the object
State ManagementManaging State Objects
Solution 2 (Second Best)If your content is not finalized, or if you CANNOT get your engine to lump state togetherCreate a state object hash tableHash off of the setting that has the most unique statesGrab pre-created states from the hash-tableWhy not give your tools pipeline the ability to do this for a level and save out the results?
Shader LinkageD3D9 shader linkage was based off of semantics (POSITION, NORMAL, TEXCOORDN)D3D10 linkage is based off of offsets and sizesThis means stricter linkage rulesThis also means that the driver doesn’t have to link shaders together at every draw call!
Standard Slide without subtitle
Shader LinkageNo Holes Allowed!
Elements must be read in the order they are output from the previous stageCannot have “holes” between linkages
Struct VS_OUTPUT{ float3 Norm : NORMAL; float2 Tex : TEXCOORD0; float2 Tex2 : TEXCOORD1; float4 Pos : SV_POSITION;};
Struct PS_INPUT{ float2 Tex : TEXCOORD0; float3 Norm : NORMAL; float2 Tex2 : TEXCOORD1; };
Struct VS_OUTPUT{ float3 Norm : NORMAL; float2 Tex : TEXCOORD0; float2 Tex2 : TEXCOORD1; float4 Pos : SV_POSITION;};
Struct PS_INPUT{ float3 Norm : NORMAL;
float2 Tex2 : TEXCOORD1; };
Struct VS_OUTPUT{ float3 Norm : NORMAL; float2 Tex : TEXCOORD0; float2 Tex2 : TEXCOORD1; float4 Pos : SV_POSITION;};
Struct PS_INPUT{ float3 Norm : NORMAL; float3 Tex : TEXCOORD0; float2 Tex2 : TEXCOORD1; };
Holes at the end are OK
Shader LinkageInput Assembler to Vertex Shader
Input Layouts define the signature of the vertex stream dataInput Layouts are the similar to Vertex Declarations in D3D9
Strict linkage rules are a big difference
Creating Input Layouts on the fly is not recommendedCreateInputLayout requires a shader signature to validate against
Shader LinkageInput Assembler to Vertex Shader
Solution 1 (Fastest)Create an Input Layout for each unique Vertex Stream / Vertex Shader combination up frontInput Layouts are smallThis assumes that the shader input signature is available when you call CreateInputLayoutTry to normalize Input Layouts across level or be art directed
Shader LinkageInput Assembler to Vertex Shader
Solution 2 (Second Best)If you load meshes and create input layouts before loading shaders, you might have a problemYou can use a similar hashing scheme as the one used for State ObjectsWhen the Input Layout is needed, search the hash for an Input Layout that matches the Vertex Stream and Vertex Shader signatureWhy not store this data to a file and pre-populate the Input Layouts after your content is tuned?
Shader LinkageAside: Instancing
Instancing is a first class citizen on D3D10!Stream source frequency is now part of the Input LayoutMultiple frequencies will mean multiple Input Layouts
Resource UpdatesUpdating resources is different in D3D10Create / Lock / Fill / Unlock paradigm is no longer necessary (although you can still do it)Texture data can be passed into the texture at create time
Standard Slide without subtitle
Resource UpdatesResource Usage Types
D3D10_USAGE_DEFAULTD3D10_USAGE_IMMUTABLED3D10_USAGE_DYNAMICD3D10_USAGE_STAGING
Resource UpdatesD3D10_USAGE_DEFAULT
Use for resources that need fast GPU read and write accessCan only be updated using UpdateSubresourceRender targets are good candidatesTextures that are updated infrequently (less than once per frame) are good candidates
Resource UpdatesD3D10_USAGE_IMMUTABLE
Use for resources that need fast GPU read access onlyOnce they are created, they cannot be updated... everInitial data must be passed in during the creation callResources that will never change (static textures, VBs / Ibs) are good candidatesDon’t bend over backwards trying to make everything D3D10_USAGE_IMMUTABLE
Resource UpdatesD3D10_USAGE_DYNAMIC
Use for resources that need fast CPU write access (at the expense of slower GPU read access)No CPU read accessCan only be updated using Map with:
D3D10_MAP_WRITE_DISCARDD3D10_MAP_WRITE_NO_OVERWRITE
Dynamic Vertex Buffers are good candidatesDynamic (> once per frame) textures are good candidates
Resource UpdatesD3D10_USAGE_STAGING
This is the only way to read data back from the GPUCan only be updated using MapCannot map with D3D10_MAP_WRITE_DISCARD or D3D10_MAP_WRITE_NO_OVERWRITEMight want to double buffer to keep from stalling GPUThe GPU cannot directly use these
Resource UpdatesSummary
CPU updates the resource frequently (more than once per frame)
Use D3D10_USAGE_DYNAMIC
CPU updates the resource infrequently (once per frame or less)
Use D3D10_USAGE_DEFAULT
CPU doesn’t update the resourceUse D3D10_USAGE_IMMUTABLE
CPU needs to read the resourceUse D3D10_USAGE_STAGING
Resource UpdatesExample: Vertex Buffer
The vertex buffer is touched by the CPU less than once per frame
Create it with D3D10_USAGE_DEFAULTUpdate it with UpdateSubresource
The vertex buffer is used for dynamic geometry and CPU need to update if multiple times per frame
Create it with D3D10_USAGE_DYNAMICUpdate it with Map
Resource UpdatesThe Exception: Constant Buffers
CBs are always expected to be updated frequentlySelect CB usage based upon which one causes the least amount of system memory to be transferred
Not just to the GPU, but system-to-system memory copies as well
Resource UpdatesUpdateSubresource
UpdateSubresource requires a system memory buffer and incurs an extra copyUse if you have system copies of your constant data already in one place
Resource UpdatesMap
Map requires no extra system memory but may hit driver renaming limits if abused Use if compositing values on the fly or collecting values from other places
Resource UpdatesA note on overusing discard
Use D3D10_MAP_WRITE_DISCARD carefully with buffers!D3D10_MAP_WRITE_DISCARD tells the driver to give us a new memory buffer if the current one is busyThere are a LIMITED set of temporary buffersIf these run out, then your app will stall until another buffer can be freedThis can happen if you do dynamic geometry using one VB and D3D10_MAP_WRITE_DISCARD
Dynamic GeometryDrawIndexedPrimitiveUP is gone!DrawPrimitiveUP is gone!Your well-behaved D3D9 app isn’t using these anyway, right?
Standard Slide without subtitle
Dynamic GeometrySolution: Same as in D3D9
Use one large buffer, and map it with D3D10_MAP_WRITE_NO_OVERWRITEAdvance the write position with every draw
Wrap to the beginning
Make sure your buffer is large enough that you’re not overwriting data that the GPU is readingThis is what happens under the covers for D3D9 when using DIPUP or DUP in Windows Vista
Porting TipsStretchRect is Gone
Work around using render-to-texture
A8R8G8B8 have been replaced with R8G8B8A8 formats
Swizzle on texture load or swizzle in the shader
Fixed Function AlphaTest is GoneAdd logic to the shader and call discard
Fixed Function Fog is GoneAdd it to the shader
Standard Slide without subtitle
Porting TipsContinued
User Clip Planes usage has changedThey’ve move to the shaderExperiment with the SV_ClipDistance SEMANTIC vs discard in the PS to determine which is faster for your shader
Query data sizes might have changedOcclusion queries are UINT64 vs DWORD
No Triangle Fan SupportWork around in content pipeline or on load
SetCursorProperties, ShowCursor are goneUse Win32 APIs to handle cursors now
Porting TipsContinued
No offsets on Map callsThis was basically API clutter in D3D9Calculate the offset from the returned pointer
Clears are no longer bound to pipeline stateIf you want a clear call to respect scissor, stencil, or other state, draw a full-screen quadThis is closer to the HWThe Driver/HW has been doing for you for years
OMSetBlendStateNever set the SampleMask to 0 in OMSetBlendState
Porting TipsContinued
Input Layout conversions tightened upD3DDECLTYPE_UBYTE4 in the vertex stream could be converted to a float4 in the VS in D3D9IE. 255u in the stream would show up as 255.0 in the VSIn D3D10 you either get a normalized [0..1] value or 255 (u)int
Register keywordIt doesn’t mean the same thing in D3D10Use register to determine which CB slot a CB binds toUse packoffset to place a variable inside a CB
Porting TipsContinued
Sampler and Texture bindingsSamplers can be bound independently of texturesThis is very flexible!Sampler and Texture slots are not always the same
Register PackingIn D3D9 all variables took up at least one float4 register (even if you only used a single float!)In D3D10 variables are packed togetherThis saves a lot of spaceMake sure your engine doesn’t do everything based upon register offsets or your variables might alias
Porting Tips Continued
D3DSAMP_SRGBTEXTUREThis sampler state setting does not exist on D3D10Instead it’s included in the texture formatThis is more like the Xbox 360
Consider re-optimizing resource usage and upload for better D3D10 performance
But use D3D10_USAGE_DEFAULT resources and UpdateSubresource and a baseline
SummaryUse the debug runtime!More draw calls usually means more constant updating and state changing callsBe frugal with constant updates
Avoid resubmitting redundant data!
Create as much state and input layout information up front as possibleSelect D3D10_USAGE for resources based upon the CPU access patterns neededUse D3D10_MAP_NO_OVERWRITE and a big buffer as a replacement for DIPUP and DUP
Standard Slide without subtitle
Call to ActionActually exploit D3D10!This talk tells you how to get performance gains from a straight portYou can get a whole lot more by using D3D10’s advanced features!
StreamOut to minimize skinning costsFirst class instancing supportStore some vertex data in texturesMove some systems to the GPU (Particles?)Aggressive use of Constant Buffers
Standard Slide without subtitle
© 2007 Microsoft Corporation. All rights reserved.This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.
http://www.xna.com