DIRECTX CONSTANTS OPTIMIZATIONS FOR INTEL® INTEGRATED GRAPHICS · DirectX 9 auto-allocates shader...
Transcript of DIRECTX CONSTANTS OPTIMIZATIONS FOR INTEL® INTEGRATED GRAPHICS · DirectX 9 auto-allocates shader...
DIRECTX CONSTANTS OPTIMIZATIONS FOR INTEL®
INTEGRATED GRAPHICS
Katen Shah
Luis Gimenez
Arzhange Safdarzadeh
December 2008
Intel® Corporation
2
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® ® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTEL® LECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL® 'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL® ASSUMES NO LIABILITY WHATSOEVER, AND INTEL® DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL® PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTEL® LECTUAL PROPERTY RIGHT. Intel® products are not intended for use in medical, life saving, life sustaining, critical control or safety systems, or in nuclear facility applications.
Intel® may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel® reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.
This white paper, as well as the software described in it, is furnished under license and may only be used or copied in accordance with the terms of the license. The information in this document is furnished for informational use only, is subject to change without notice, and should not be construed as a commitment by Intel® Corporation. Intel® Corporation assumes no responsibility or liability for any errors or inaccuracies that may appear in this document or any software that may be provided in association with this document.
Intel® processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. See www.Intel.com/products/processor_number for details.
The Intel® processor/chipset families may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request.
Recipient is not obligated to provide Intel with comments or suggestions regarding this document. However,
should Recipient provide Intel with comments or suggestions for the modification, correction, improvement
or enhancement of: (a) this document; or (b) Intel products which may embody this document, Recipient
grants to Intel a non-exclusive, irrevocable, worldwide, royalty-free license, with the right to sublicense Intel’s
licensees and customers, under Recipient intellectual property rights, to use and disclose such comments and suggestions in any manner Intel chooses and to display, perform, copy, make, have made, use, sell, and
otherwise dispose of Intel's and its sublicensee’s products embodying such comments and suggestions in any
manner and via any media Intel chooses, without reference to the source.
Copies of documents, which have an order number and are referenced in this document, or other Intel® literature, may be obtained by calling 1-800-548-4725, or by visiting Intel 's Web Site.
Intel® and the Intel® Logo are trademarks of Intel® Corporation in the U.S. and other countries.
*Other names and brands may be claimed as the property of others.
Copyright © 2008, Intel® Corporation. All rights reserved
3
TABLE OF CONTENTS
Purpose 4
Introduction 4
constants in Directx 9 and directx 10................................................................................. 5
TIPS for Direct3D 9 constant in IIG ................................................................................... 7
D3D10 Constants Management ........................................................................................ 11
Mul t Iple CBUFFERS PERFORMANCE IMP ACT ...................................................................... 17 Example 1: Ocean Fog D3D10 Demo ............................................................... 17 EXAMPLE 2: SKINNING10 ....................................................................... 21
Note 2: These measurements were taking on a HP Pavillon with Mobile Intel® 4 Series Express Chipset FamilyOpt imizing
DIRECT X 10 ............................................................................................... 22
Summary 24
Web site and Engineering support ................................................................................................... 24
References 24
Appendix 25
TABLES 25
4
PURPOSE
The goal of this paper is to describe the behavior of constants in DirectX 9 and DirectX 10 to
help developers optimize the performance of the applications in Intel Integrated Graphics.
Intel Integrated Graphics refers to Intel® Graphics Media Accelerator used in the Intel® 4 Series Chipsets (the Intel® 4500, X4500, and X4500HD GMAs). These chipsets are used in desktop G41, G43, and G45 and mobile GM45 and GM47 systems. In general, the core of the graphics media accelerators is broken into generations; these generations are known as Gen3, Gen4, Gen5, etc. Commonly, these are known as “GenX”. Each year, more capabilities and better performance are provided by new integrated graphics cores. Intel Integrated Graphics has the largest market share for new PC shipments. Source Mercury
Research (Q1/09). Therefore, it makes sense to write your 3D applications to this market segment and optimize the experience for the largest number of people.
INTRODUCTION
Constants are external variables passed as parameters to the shaders; their values remain “constant” during each invocation of the shader program. Despite their name, constants are one of the most frequently changing values in a DirectX application. A shader program can initialize a constant variable statically to a value in the shader file or at runtime through the application. Most of the recommendations described here are not completely new and may have been described elsewhere. However, it is still very much applicable to integrated graphics and the recommendations are provided in a cohesive manner. Finally, care needs to be taken when porting from DirectX 9 to DirectX 10 to maintain performance. .
5
CONSTANTS IN DIRECTX 9 AND DIRECTX 10
In DirectX9 the constant data is specified in constant registers, while in DirectX 10 external variables residing in constant buffers are passed as parameter to the shader program. Depending on the use and declaration in the shader program constants can be immediate, immediate indexed, or dynamic indexed. Table 1 shows code samples with examples of each case. TABLE 1 IMMEDIATE, IMMEDIATE INDEXED, DYNAMIC INDEXED CONSTANTS
Direct X 9 Direct X 10 // Generated by Microsoft (R) HLSL Shader
Compiler 9.24.950.2656
// Parameters:
//
// float4x4 mViewProj;
// float4x3 mWorldMatrixArray[26];
//
// Registers:
//
// Name Reg Size
// ----------------- ----- ----
// mWorldMatrixArray c0 78
// mViewProj c78 4
vs_2_0
mova a0.w, r1.x
dp3 r2.x, v3, c0[a0.w]
dynamic indexedconstants
mova a0.w, r1.x
dp3 r2.y, v3, c1[a0.w]
mova a0.w, r1.x
dp3 r2.z, v3, c2[a0.w]
…..
dp4 r5.x, v0, c0[a0.w]
mova a0.w, r1.y
dp4 oPos.x, r2, c78 immediate constants
dp4 oPos.y, r2, c79
dp4 oPos.z, r2, c80
dp4 oPos.w, r2, c81
VertexShader = asm {
// Generated by Microsoft (R) HLSL Shader
Compiler 9.24.949.2307
// Buffer Definitions:
// cbuffer cbAnimMrtx
//{
//float4x4 g_mConstBoneWrld[255]; //Offset:0
Size: 16320
//}
//
//cbuffer cbDynamic
//{
//float4x4 g_mWorld;//Offset:32 Size:64
//float4x4 g_mWorldViewProjection;
//Offset:96 Size:64
//}
. . . . . .
vs_4_0
dcl_constantbuffer cb0[1020], dynamic
indexeddynamicIndexed
dcl_constantbuffer cb1[8], immediate
indexedimmediateIndexed
dp4o0.y, v0.xyzw, cb1[7].xyzw
dp4o0.z, v0.xyzw, cb1[8].xyzw
dp4o0.w, v0.xyzw, cb1[9].xyzw
...........
dp4 r2.x, r0.xyzw, cb0[r1.y + 0].xyzw dynamic
indexed constants
dp4 r2.y, r0.xyzw, cb0[r1.y + 1].xyzw
dp4 r2.z, r0.xyzw, cb0[r1.y + 2].xyzw
Direct X 10 also supports immediate constants as a source operand per instruction using a 32-bit
immediate scalar or 32-bit immediate 4-component vector. It is equivalent to the def instruction
used in d3d9 to define immediate constants. The immediate constant values are used during the
life of the shader. These occur as a result of literal values used in HLSL code. See the example Table 2
6
TABLE 2 DIRECTX 10 INMEDIATE CONSTANT
BASICHLSL VERTEX SHADER CODE:
… // ANIMATION THE VERTEX BASED ON TIME AND THE VERTEX’S OBJECT SPACE POSITION IF (BANIMATE) VANIMATEDPOS += FLOAT4(VNORMAL, 0) * (SIN(G_FTIME + 5.5) + 0.5) * 5
…
D3D9 CODE:
//FLOAT G_FTIME; //G_FTIME C12 1 VS2_0 DEF C13, 5.5, 0.159154937, 0.5, 5 DEF C14, 6.28318548, -3.14159274, 0, 1 DEF C15, -1.55009923E-006, -2.170389E-
005, 0.0026041667, 0.00026041668 DEF C16, -0.020833334, -0.125, 1, 0.5 ... //R0.X = 5.5
MOV R0.X, C13.X
// G_FTIME + 5.5 ADD R0.X, R0.X, C12.X
...
D3D10 CODE:
… //CBUFFER $GLOBALS //{ // … //FLOAT G_FTIME; //OFFSET: 160 SIZE: 4 // … //} VS_4_0 DCL_CONSTANTBUFFER CB0[19],
IMMEDIATEINDEXED … //G_FTIME + 5.5 ADD R0.X, CB0[10].X, L(5.500000)
//SIN(G_FTIME + 5.5) SINCOS R0.X, NULL, R0.X
// SIN(G_FTIME + 5.5)+0.5 ADD R0.X, R0.X, L(0.500000)
...
7
TIPS FOR DIRECT3D 9 CONSTANT IN IIG
DirectX 9 auto-allocates shader constants, assigning each a set of float4 registers. Table 3 outlines the type, size and use of the registers that DirectX 9 requires to support across Shader Model 3.
TABLE 3 DIRECTX 9 SHADER MODEL 3
Register Name Count R/W # Read ports
# Reads / inst
Size Rel Addr *
Defaults
c# Constant Float Register
VS 256 PS 224
R 1 Unlimited 4 VS a0/aL PS No
(0, 0, 0, 0)
a0 Address Register 1 R/W 1 Unlimited 4 No None
b# Constant Boolean Register
16 R 1 1 1 No FALSE
i# Constant Integer Register
16 R 1 1 4 No (0, 0, 0, 0)
aL Loop Counter Register
1 R 1 Unlimited 1 No None
* Only the Vertex Shader allows relative addressing and only floating-point constant registers
can be indexed.
In DirectX 9 local constants always take precedence over global constants and the scope of local constants is restricted to the shader they are defined in. As mentioned in the previous section the shader program can define a constant as immediate, (stored as a constant register) or as constant array (stored as indexed constant). Constant arrays can be indexed with either an immediate index (such as int i=0) or with a dynamic index. The Intel Integrated Graphics Driver treats immediate constants and immediate indexed constants the same way, for example C78[0] is C78 and C78[1] is C79. Dynamically indexed constants are only available in vertex shaders. Dynamic indexing addressing allows the shader to access a register based on the value stored in the address register a0 or loop register aL. The Intel Integrated Graphics Driver optimizes the access of most frequently used immediate constants by loading them into a constant hardware buffer. This buffer is the most efficient way to access shader constants on Intel Integrated Graphics. In addition local variables perform better than global variables because the driver is able to optimize them and convert them into an immediate value.
8
TABLE 4 SAMPLECODE FROM MICROSOFT DIRECT3D SDK SKINNED MESH
HLSL // Skinned Mesh Effect file
// Copyright (c) 2000-2002 Microsoft Corporation. All
rights reserved.
//
float4 lhtDir = {0.0f, 0.0f, -1.0f, 1.0f}; //light
Direction
……….
FLOAT4 MATERIALAMBIENT : MATERIALAMBIENT =
{0.1F, 0.1F, 0.1F, 1.0F};
// Matrix Pallette
static const int MAX_MATRICES = 26;
float4x3 mWorldMatrixArray[MAX_MATRICES]:
WORLDMATRIXARRAY;
float4x4 mViewProj : VIEWPROJECTION; // IMMEDIATE
constants
///////////////////////////////////////////////////////
VS_OUTPUT VShade(VS_INPUT i, uniform int NumBones)
{
VS_OUTPUT o;
float3 Pos = 0.0f;
float3 Normal = 0.0f;
float LastWeight = 0.0f;
……………………………………….
// calculate the pos/normal using the "normal" weights
// and accumulate the weights to calculate the
last weight
for (int iBone = 0; iBone < NumBones-1; iBone++)
{
LastWeight = LastWeight + BlendWeightsArray[iBone];
Pos += mul(i.Pos, mWorldMatrixArray[IndexArray[iBone]])
* BlendWeightsArray[iBone];
Normal += mul(i.Normal,
mWorldMatrixArray[IndexArray[iBone]]) *
BlendWeightsArray[iBone];
}
………………………………
ASSEMBLY ASM Generated by Microsoft (R) HLSL
Shader Compiler 9.24.950.2656
//
// Parameters:
//…………….
// float4 MaterialAmbient;
// float4 lhtDir;
// float4x4 mViewProj;
// float4x3 mWorldMatrixArray[26];
//
// Registers:
// Name Reg Size
// ----------------- ----- ----
// mWorldMatrixArray c0 78
// MVIEWPROJ C78 4
// LH C82 1
// MATERIALAMBIENT C83 1
// ………
VS_2_0
…….
mova a0.w, r1.x
DP3 R2.X, V3, C0[A0.W] DYNAMIC
INDEXED CONSTANTS
mova a0.w, r1.x
dp3 r2.y, v3, c1[a0.w]
mova a0.w, r1.x
dp3 r2.z, v3, c2[a0.w]
mul r0.xyz, r2, v1.x
………
dp4 oPos.x, r2, c78 immediate
constants
dp4 oPos.y, r2, c79
dp4 oPos.z, r2, c80
DP4 OPOS.W, R2, C81
……..
In Table5 mWorldmatrixArray is dynamic indexed (HLSL mWorldMatrixArray[IndexArray[iBone]] with its values initialized by the application at runtime. The dynamic index is referenced in ASM by the address register (a0). The IIG driver optimizes the use of the immediate constants C78-C83 by pushing them into the Hardware Constant Buffer. Since constants Cx[a0] are dynamic indexed, the driver will not include them into the optimization algorithm. Constants that have static values are compiled into the shader as an immediate value as shown in
TABLE 6. Those constants should be declared as static const as the shader will improve
performance.
9
TABLE 5 SAMPLECODE FROM MICROSOFT DIRECT3D SDK REFLECTIVE LIGHTING MODEL
HLSL // Reflective Lighting Model
// Copyright (c) Microsoft Corporation. All
rights reserved.
//---------------------------------------------
-----------------------------------------
…
// light direction (world space)
float3 lightDir = {0.577, -0.577, -0.577};
// Transformation Matrices
matrix matView : VIEW;
matrix matProj : PROJECTION;
matrix matWorld : WORLD;
….
-----------------------------------------------
--------------------------------------------
HLSL // Reflective Lighting Model
// Copyright (c) Microsoft Corporation. All
rights reserved.
//---------------------------------------------
-----------------------------------------
// light direction (world space)
static const float3 lightDir = {0.577, -0.577,
-0.577};
//lightdir will not use the constants
register//
// Transformation Matrices
matrix matView : VIEW;
matrix matProj : PROJECTION;
matrix matWorld : WORLD;
….
-----------------------------------------------
--------------------------------------------
ASSEMBLY vertexshader =
asm {
// Generated by Microsoft (R) HLSL Shader
Compiler 9.24.950.2656
// ……
// Registers:
//
// Name Reg Size
// ------------ ----- ----
// matWorld c0 4
// matView c4 3
// …..
//
PRESHADER
mul r0, c4.x, c0
……
mul r0, c6.w, c3
add c3, r1, r0
dot r0.xyz, (0.577, -0.577, -0.577), c4.xyz
dot r0.yzw, (0.577, -0.577, -0.577), c5.xyz
dot r0.zwx, (0.577, -0.577, -0.577), c6.xyz
dot r1.xyz, r0.xyz, r0.xyz
rsq r0.w, r1.x
mul c0.xyz, r0.w, r0.xyzmul c5.xyz, c7.xyz,
(0.358824, 0.311765, 0.059804)
mul c4.xyz, c8.xyz, (0.358824, 0.311765,
0.059804)
mul c6.xyz, (0.9, 0.9, 0.9), c9.xyz
// approximately 30 instructions used
ASSEMBLY vertexshader =
asm {
// Generated by Microsoft (R) HLSL Shader
Compiler 9.24.950.2656
//………
// Registers:
//
// Name Reg Size
// ------------ ----- ----
// matWorld c0 4
// matView c4 3
//
preshader
mul r0, c4.x, c0
……..
mul r0, c6.w, c3
add c7, r1, r0
dot r0.xyz, (0.577, -0.577, -0.577), c4.xyz
//static values
dot r0.yzw, (0.577, -0.577, -0.577), c5.xyz
dot r0.zwx, (0.577, -0.577, -0.577), c6.xyz
dot r1.xyz, r0.xyz, r0.xyz
rsq r0.w, r1.x
mul c4.xyz, r0.w, r0.xyz
// approximately 27 instructions use
Though the immediate constants and immediate indexed constants “pushed” into the hardware
register perform better, constant buffers created by the driver are required to support the amount
of constant registers specified for Shader Model 3.0 for the Intel 4 series Chipset Family
(224+16+ =256 for the pixel shader and 256+16+16=288 for the vertex shader).
10
OPTIMIZING DirectX 9
Higher performance is obtained with local constants over global constants. Immediate constants provide better performance than dynamic indexed constants. In dynamic indexed constants the driver cannot determine a prior the index value and needs to create a full size constant buffer space in memory, instead of using the hardware constant buffer. To take advantage of the optimization, limit the use of global constants and the use of dynamically indexed constants C[ax] as these skip the IIG optimization algorithm within the Intel Driver.
11
D3D10 CONSTANTS MANAGEMENT
Direct3D10 places all shader constants in one or more buffer resources in memory and allows
managing this like any other resource. This is in contrast to Direct3D9 where each shader stage
had a limited constant register file and required frequent CPU access for changing or resetting
the values using SetXXXXXShaderConstantX.
The new method in Direct3D10 minimizes bandwidth as well as the overhead associated with
setting of shader constants. However, the D3D9 driver could optimize the constant delivery to
the hardware whereas on D3D10 more of this burden has shifted to the software developer.
Constant buffers are managed in D3D10-like vertex or texture data buffers. They are updated via
Map (D3D10_MAP_WRITE_DISCARD) or by calling UpdateSubResource which enables CPU
copy of data from memory to the buffer.
Constants are organized into two constant buffer types - cbuffer and tbuffer. cbuffers are
optimized for uniformly indexed data and sequential access whereas tbuffers are optimized for
arbitrarily indexed data and more random access like a texture.
The sizes for these buffers are:
• cbuffer <= 4096*4*32-bit entries; although a large number of buffers can be created, D3D10 limits the maximum number of simultaneous cbuffers to 14, plus 1 immediate constant
buffer
• tbuffer <= 128Mbytes
As noted by the layout of the cbuffer, they are packed with a float4 granularity. As an example
two float2 values can be packed together whereas two float3 values would be stored as separate
entries. By default, the D3D10 compiler packs as many variables as possible per entry.
However, a keyword (packoffset) can be used to arrange constants in specific ways. This is
described in detail in the Microsoft SDK.
A constant buffer is bound to a shader stage using one of the following APIs:
[VS/GS/PS]SetConstantBuffers
Example: Microsoft SDK Skinning10 sample demonstrates different methods of indexing bone
transformation matrices for skinning on the GPU along with Stream out. We show in Table 6 the
use of cbuffers and tbuffers in skinning:
12
TABLE 6 SKINNING 10 EXAMPLE
// BUFFER DEFINITIONS:
// CBUFFER CBANIMMATRICES
// {
// FLOAT4X4 G_MCONSTBONEWORLD[255]; //OFFSET: 0 SIZE: 16320
// }
// BUFFER DEFINITIONS:
// TBUFFER TBANIMMATRICES
// {
// FLOAT4X4 G_MTEXBONEWORLD[255]; //OFFSET: 0 SIZE: 16320
// }
When Stream Out (SO) is disabled, the cbuffers performance of is much higher than tbuffers. As
expected, with SO enabled, there is minimal performance difference. The disassembly of the
sample shows that tbuffer has more number of instructions including texture loads. We
recommend using cbuffers where possible especially for smaller payloads. Usage of tbuffers is
observed to be minimal in today’s games.
In addition to this all constants that are not placed in constant buffers are grouped under a global
cbuffer $Globals as shown from the BasicHLSL10 in Table 7
TABLE 7 BASICHLSL10
//
// BUFFER DEFINITIONS:
// CBUFFER $GLOBALS
// {
//
// FLOAT4 G_MATERIALAMBIENTCOLOR; // OFFSET: 0 SIZE: 16
// FLOAT4 G_MATERIALDIFFUSECOLOR; // OFFSET: 16 SIZE: 16
// INT G_NNUMLIGHTS; // OFFSET: 32 SIZE: 4 [UNUSED]
// FLOAT3 G_LIGHTDIR[3]; // OFFSET: 48 SIZE: 44
// FLOAT4 G_LIGHTDIFFUSE[3]; //OFFSET: 96 SIZE:48
// FLOAT4 G_LIGHTAMBIENT; //OFFSET: 144 SIZE:16
// FLOAT4 G_FTIME; //OFFSET: 160 SIZE:4
// FLOAT4X4 G_MWORLD; //OFFSET: 176 SIZE: 64
// FLOAT4 G_MWORLDVIEWPROJECTION; //OFFSET: 240 SIZE:64
// }
It is tempting to create an uber constant buffer that houses all of the constants especially if
porting from DX9 which can result in a large global buffer. However, constant buffers are
typically characterized by frequent updates from the CPU. Therefore, if any constant value is
changed it results in reloading the whole buffer to the GPU. This can cause significant
performance impact.
For optimal constant buffer management it is recommended that constants are partitioned into a
set of separate buffers based on the frequency of updates and according to the access pattern
within a buffer.
13
Example: Constants are grouped in terms of whether they are used once per Level, once per
Frame, once per batch, once per Draw(), etc.,. essentially based on how often they are updated.
TABLE 8 CONSTANT UPDATE
CBUFFER GLOBAL$ { VFOGCOLOR, … }
CBUFFER CBPERLEVELDATA { VSUNPOSITION, … }
CBUFFER CBPERFRAMEDATA { VAPPTIME, … }
CBUFFER CBPERPASSDATA { MATVIEWPROJ, VRENDERTARGETSIZE, … }
CBUFFER CBPEROBJECTDYNAMIC { VBONES, … }
CBUFFER CBPEROBJECTSTATIC { MATWORLD, … }
CBUFFER CBPERMATERIALA { VSPECPOWER, VBDRFCOEFFICIENT, … }
Carsten Wenzel describes the benefits in his Siggraph 2007 presentation, “Porting Game Engines
to Direct3D 10: Crysis/Cryengine2”.. According to Wenzel, a simple port from DirectX 9
showed ~7000 updates per frame in the in-game profiler and after optimizations that figure
dropped to ~5000 which was equivalent to number of draw calls. Cryengine2 groups constants
by frequency of update – Per-frame, Per-Batch, Per-Instance, Per-Material and Per-Light group.
Microsoft has shown an example which shows the difference in terms of number of bytes
updated if using an uber buffer vs. splitting into multiple buffers. The benefit is outlined in the
Table 9 below:
TABLE 9 BYTES UPDATED UBER BUFFER VS MULTIPLE BUFFERS
100 SKINNED MESHES (100 MATERIALS), 900 STATIC MESHES (400 MATERIALS), 2 PASSES
PER FRAME
CBUFFER UBERCB
{
MATRIX VIEWPROJ;
MATRIX BONES[100];
MATRIX WORLD;
FLOAT SPECPOWER;
FLOAT4 BDRFCOEFFICIENTS;
FLOAT APPTIME; SIZE: 4 BYTES
UINT2 RENDERTARGETSIZE;
}
BEGIN FRAME
SHADOW PASS
UPDATE UBERCB
6560X100 = 656000 BYTES
UPDATE UBERCB
6560X900 = 5904000 BYTES
LIGHT PASS
UPDATE UBERCB
6560X100 = 656000 BYTES
UPDATE UBERCB
6560X900 = 5904000 BYTES
END FRAME
TOTAL = 13,120,000 = 13MB/FRAME
CBUFFER VSGLOBALPERFRAMECB
{
FLOAT APPTIME; SIZE: 4 BYTES
};
CBUFFER VSPERSKINNEDCB
{
MATRIX BONES[100]; SIZE: 6400 BYTES
};
CBUFFER VSPERSTATICCB
{
MATRIX WORLD; SIZE: 64 BYTES
};
CBUFFER VSPERPASSCB
{
MATRIX VIEWPROJ; SIZE: 64 BYTES
UINT2 RENDERTARGETSIZE; SIZE: 8 BYTESM
};
BEGIN FRAME
UPDATE VSGLOBALPERFRAMECB
4 X 1 = 4BYTES
UPDATE VSPERSKINNEDCB
6400X100 = 640000 BYTES
UPDATE VSPERSTATICCB
64X900 = 57600 BYTES
SHADOW PASS
UPDATE VSPERPASSCB
72X1 = 72 BYTES
LIGHT PASS
UPDATE VSPERPASSCB
72X1 = 72 BYTES
UPDATE VSPERMATERIALCB
500X20 = 10000 BYTES
END FRAME
TOTAL = 707, 748 BYTES = 708KB/FRAME
14
CBUFFER VSPERMATERIALCB
{
FLOAT SPECPOWER; SIZE: 4 BYTES
FLOAT4 BDRFCOEFFICIENTS; SIZE: 16
BYTES
};
BETTER THAN 18X LESS DATA UPDATED
EVERY FRAME
It is generally preferred to have a larger number of small size constant buffers. Additionally it is
better where possible to share constant buffers between different shaders. Listing below shows
an example where cbuffer cbConstant is shared between the vertex shader and pixel shader. The
pixel shader only uses the float3 watercolour only, the rest are unused. This is generally observed
in D3D10 code. Another optimization to keep in mind is that if there are constants that are
unused by most of the shaders then moving those to the bottom will allow binding a smaller
buffer to those shaders. In the Table 10 example below,Table 10 both float sun shininess and
float sun strength could be moved to the bottom since neither shader uses them.
TABLE 10 CONSTANT ORDER
//
//GENERATED BY MICROSOFT (R) HLSL SHADER COMPILER 9.24.949.2307
// BUFFER DEFINITIONS:
//
// CBUFFER CBCONSTANT
// {
// FLOAT3 WATERCOLOUR; // OFFSET: 0 SIZE: 12 [UNUSED]M
// FLOAT SUN_SHININESS; // OFFSET: 12 SIZE: 4 [UNUSED]
// FLOAT SUN_STRENGTH; // OFFSET: 16 SIZE: 4 [UNUSED]EE
// FLOAT3 SUN_VEC; // OFFSET: 20 SIZE: 12E
// }
//
// CBUFFER CBDYNAMIC
// {
// FLOAT4X4 MWORLD; // OFFSET: 0 SIZE: 64
// FLOAT4X4 MWORLDVIEWPROJ; // OFFSET: 64 SIZE: 64
// FLOAT4 CLIPPLANE; // OFFSET: 128 SIZE: 16
// }
// RESOURCE BINDINGS:
//
// NAME TYPE FORMAT DIM SLOT ELEMENTS
// ---------------- ---------- ------- ----------- ---- --------
// CBCONSTANT CBUFFER NA NA 0 1
// CBDYNAMIC CBUFFER NA NA 1 1
//
// INPUT SIGNATURE:
//
VS_4_0
DCL_INPUT V0.XYZ
DCL_INPUT V1.XYZ
DCL_INPUT V2.XY
DCL_OUTPUT_SIV O0.XYZW , POSITION
DCL_OUTPUT_SIV O1.X , CLIP_DISTANCE
DCL_OUTPUT O2.XY
DCL_OUTPUT O2.Z
DCL_OUTPUT O3.XYZW
DCL_CONSTANTBUFFER CB0[2], IMMEDIATEINDEXED
DCL_CONSTANTBUFFER CB1[9], IMMEDIATEINDEXED
DCL_TEMPS 2
MOV R0.XYZ, V0.XYZX
MOV R0.W, L(1.000000)
DP4 O0.X, R0.XYZW, CB1[4].XYZW
…………
MOV_SAT O3.XYZW, R0.XXXX
15
RET
//
//GENERATED BY MICROSOFT (R) HLSL SHADER COMPILER 9.24.949.2307
// BUFFER DEFINITIONS:
//
// CBUFFER CBCONSTANT
// {
// FLOAT3 WATERCOLOUR; // OFFSET: 0 SIZE: 12
// FLOAT SUN_SHININESS; // OFFSET: 12 SIZE: 4 [UNUSED]
// FLOAT SUN_STRENGTH; // OFFSET: 16 SIZE: 4 [UNUSED]
// FLOAT3 SUN_VEC; // OFFSET: 20 SIZE: 12 [UNUSED]
// }
// RESOURCE BINDINGS:
// NAME TYPE FORMAT DIM SLOT ELEMENTS
// ---------------- ---------- ------- ----------- ---- --------
// SDIFFUSE SAMPLER NA NA 0 1
// G_MESHTEXTURE TEXTURE FLOAT4 2D 0 1
// CBCONSTANT CBUFFER NA NA 0 1
// …………
PS_4_0
DCL_INPUT_PS LINEAR V2.XY
DCL_INPUT_PS LINEAR V2.Z
DCL_INPUT_PS LINEAR V3.XYZW
DCL_OUTPUT O0.XYZW
DCL_CONSTANTBUFFER CB0[1], IMMEDIATEINDEXED
DCL_SAMPLER S0, MODE_DEFAULT
DCL_RESOURCE_TEXTURE2D ( FLOAT , FLOAT , FLOAT , FLOAT ) T0
DCL_TEMPS 3
SAMPLE R0.XYZW, V2.XYXX, T0.XYZW, S0
MUL R1.XYZW, R0.XYZW, V3.XYZW
MAD R2.XYZ, -V3.XYZX, R0.XYZX, CB0[0].XYZX
MAD R2.W, -V3.W, R0.W, L(1.000000)
MAD O0.XYZW, V2.ZZZZ, R2.XYZW, R1.XYZW
RET
The assembly code generated by the HLSL compiler has two main declarations for constant
buffers: dcl_constantBuffer and dcl_immediateConstantBuffer
• A shader constant buffer declared using dcl_constantBuffer cbN[size] where N is the
constant buffer register number and size is the # of elements it has. In addition to this the
declaration also includes the access type of the buffer. There are 2 types:
immediateIndexed where index used is a literal value and dynamicIndexed where the
index is a computed value. This is applicable to VS, GS and PS.
• A shader immediate-constant buffer can also be declared using
dcl_immediateConstantBuffer {values} where values are an array of four-component
elements. The buffer must contain at least one but less than 4096 values. Only one
immediate constant buffer can be used with a shader. It is accessed similar to the constant
buffer with dynamic indexing. This is also applicable to VS, GS and PS. In general this is
not observed to be used a lot in games.
The listing Table 11 below shows the Skinning10 example.
TABLE 11 SKINNING 10 EXAMPLE
//
// GENERATED BY MICROSOFT (R) HLSL SHADER COMPILER 9.23.949.2378
// BUFFER DEFINITIONS:
//
16
// CBUFFER $PARAMS
// {
// UINT IFETCHTYPE; // OFFSET: 0 SIZE: 4
//
// }
//
// CBUFFER CB0
// {
// FLOAT4X4 G_MWORLDVIEWPROJ; // OFFSET: 0 SIZE: 64
// FLOAT4X4 G_MWORLD; // OFFSET: 64 SIZE: 64
// }
//
// CBUFFER CBANIMMATRICES
// {
// FLOAT4X4 G_MCONSTBONEWORLD[255]; // OFFSET: 0 SIZE: 16320
// }
//
// TBUFFER TBANIMMATRICES
// {
// FLOAT4X4 G_MTEXBONEWORLD[255]; // OFFSET: 0 SIZE: 16320
// }
VS_4_0
DCL_INPUT V0.XYZ
DCL_INPUT V1.XYZW
DCL_INPUT V2.XYZW
DCL_INPUT V3.XYZ
DCL_INPUT V4.XY
DCL_INPUT V5.XYZ
DCL_OUTPUT_SIV O0.XYZW , POSITION
DCL_OUTPUT O1.XYZ
DCL_OUTPUT O2.XYZ
DCL_OUTPUT O3.XY
DCL_OUTPUT O4.XYZ
DCL_CONSTANTBUFFER CB0[1], IMMEDIATEINDEXED
DCL_CONSTANTBUFFER CB1[7], IMMEDIATEINDEXED
DCL_CONSTANTBUFFER CB2[1020], DYNAMICINDEXED
…
MOV O3.XY, V4.XYXX
RET
17
MULTIPLE CBUFFERS PERFORMANCE IMPACT
EXAMPLE 1: OCEAN FOG D3D10 DEMO
The Ocean Fog demo FIGURE1 is a good example of how to scale code for Intel Integrated
graphics. It utilizes the Perlin noise algorithm based blur along with a Gaussian blur to give a
smooth effect. Fog is projected onto mesh surfaces in the GPU. Ocean Fog demo has 22 Shaders
in 9 effects files, and it uses 1.3 Kb of constants. The demo allocates all the constants in cbuffers
- there are no tbuffers. It does not utilize DynamicIndexed constants enabling the IntelIntegrated
Graphics driver to optimize the constant accesses using a lower latency path.
Figure 1
Figure 2 Oceanfog show the metrics impact of different cbuffers arrangements in the time taken
to update resources running the Oceanfog over 100 seconds.
18
Figure 2
Note 1: These measurements were taking on a Lenovo X301 with Mobile Intel® 4 Series Express
Chipset Family
Note 2: Shorter time implies better performance
The chart above shows the time taken to make COPYREGION_D3D10 and
UPDATESUBRESOURCEUP_D310 API calls. We notice a small 7% improvement from the
original measurement to the second set of measurements when using one cbuffer local per fx file
vs. most constants in a global buffer located in an include file. The third set of measurements
shows a significant 70% improvement when using 18 buffers optimized per frequency of
constant update (2 cbuffers per effect file). In the optimized version, we use one cbuffer
cbconstant grouping the constants that do not change during the shader invocation and cbuffer
cdynamic grouping the constants that change per frame as shown in Table 12 for the effect file
RE
SO
UR
CE
CO
PY
RE
GIO
N_D
3D
10,
7.1
77
RE
SO
UR
CE
CO
PY
RE
GIO
N_D
3D
10,
6.6
01
RE
SO
UR
CE
CO
PY
RE
GIO
N_D
3D
10,
0.0
00
RE
SO
UR
CE
UP
DA
TE
SU
BR
ES
OU
RC
EU
P_D
310,
20.9
31
RE
SO
UR
CE
UP
DA
TE
SU
BR
ES
OU
RC
EU
P_D
310,
19.5
87
RE
SO
UR
CE
UP
DA
TE
SU
BR
ES
OU
RC
EU
P_D
310
8.6
94
0
5
10
15
20
25
1 global cbuffer 9 cbuffers (1 local per fx file)
18 cbuffers (optimized per freq of update)
Seco
nd
s
cbuffers
OCEAN FOG
19
“fogmesh”. Table 13 shows the ASM including a global cbuffer and Table 13 shows the ASM
for un-optimized local cbuffers (one per fx file).
TABLE 12 OCEAN FOG CBUFFERS PER SHADER AND PER FRAME
TABLE 13 OCEAN FOG GLOBAL CONSTANT BUFFER
// Fogmesh
// FX Version: fx_4_0
// Child effect (requires effect pool): false
//
// 2 local buffer(s)
//
cbuffer cbGlobal
{
float4x4 g_mWorld; // Offset: 0, size: 64
float4x4 g_mWorldViewProjection; // Offset: 64, size: 64
float4 ClipPlane; // Offset: 128, size: 16
float4 g_vCloudColor; // Offset: 144, size: 16
float g_fHeight; // Offset: 160, size: 4
float3 g_sunvec; // Offset: 164, size: 12
float4 g_sundiffuse; // Offset: 176, size: 16
float3 g_vCameraPos; // Offset: 192, size: 12
float3 g_vSpriteCenter; // Offset: 208, size: 12
float g_fCloudDensity; // Offset: 220, size: 4
//
// FOGMESH
// FX Version: fx_4_0
// Child effect (requires effect pool): false
//
// FX Version: fx_4_0
// Child effect (requires effect pool): false
//
// 2 local buffer(s)
//
cbuffer cbConstant
{
float4 g_MaterialAmbientColor; // Offset: 0, size: 16
float4 g_LightDiffuse; // Offset: 16, size: 16
float g_fNormalMapFactor; // Offset: 32, size: 4
float3 g_vSpotPos; // Offset: 36, size: 12
float g_fSpotFrustum; // Offset: 48, size: 4
float4 g_vSpotDiffuse; // Offset: 64, size: 16
float4 vWaterColor = { 0.0199999996, 0.0250000004, 0.0350000001, 1 };// Offset: 80, size: 16
float3 vFogDirection = { 0, 0, -1 };// Offset: 96, size: 12
float4 vFogColor = { 0.600000024, 0.600000024, 0.600000024, 1 };// Offset: 112, size: 16
}
cbuffer cbDynamic
{
float3 g_LightDir; // Offset: 0, size: 12
float3 g_CameraPos; // Offset: 16, size: 12
float3 g_CameraForward; // Offset: 32, size: 12
float g_fTime; // Offset: 44, size: 4
float4x4 g_mWorld; // Offset: 48, size: 64
float4x4 g_mWorldViewProjection; // Offset: 112, size: 64
float3 g_vSpotDir; // Offset: 176, size: 12
float g_fSpotIntensity; // Offset: 188, size: 4
float4x4 g_mSpotWorld; // Offset: 192, size: 64
bool underwater; // Offset: 256, size: 4
float g_FogDensity; // Offset: 260, size: 4
float4 ClipPlane; // Offset: 272, size: 16
}
//
20
float4 g_MaterialAmbientColor; // Offset: 224, size: 16
float4 g_LightDiffuse; // Offset: 240, size: 16
float g_fNormalMapFactor; // Offset: 256, size: 4
float3 g_vSpotPos; // Offset: 260, size: 12
float g_fSpotFrustum; // Offset: 272, size: 4
float4 g_vSpotDiffuse; // Offset: 288, size: 16
float3 g_LightDir; // Offset: 304, size: 12
float3 g_CameraPos; // Offset: 320, size: 12
float3 g_CameraForward; // Offset: 336, size: 12
float g_fTime; // Offset: 348, size: 4
float3 g_vSpotDir; // Offset: 352, size: 12
float g_fSpotIntensity; // Offset: 364, size: 4
float4x4 g_mSpotWorld; // Offset: 368, size: 64
float g_FogDensity; // Offset: 432, size: 4
float g_SunAlpha; // Offset: 436, size: 4
float g_SunTheta; // Offset: 440, size: 4
float g_SunShininess; // Offset: 444, size: 4
float g_SunStrength; // Offset: 448, size: 4
float4 g_mViewProjection; // Offset: 464, size: 16
float g_fFogDensity; // Offset: 480, size: 4
float4 g_fSpotDiffuse; // Offset: 496, size: 16
float3 g_vSpotCenter; // Offset: 512, size: 12
bool underwater; // Offset: 524, size: 4
float3 watercolour; // Offset: 528, size: 12
float sun_shininess; // Offset: 540, size: 4
float sun_strength; // Offset: 544, size: 4
float3 sun_vec; // Offset: 548, size: 12
float4x4 mWorld; // Offset: 560, size: 64
float4x4 mWorldViewProj; // Offset: 624, size: 64
float scale; // Offset: 688, size: 4
float inv_mapsize_x; // Offset: 692, size: 4
float inv_mapsize_y; // Offset: 696, size: 4
float4 corner00; // Offset: 704, size: 16
float4 corner01; // Offset: 720, size: 16
float4 corner10; // Offset: 736, size: 16
float4 corner11; // Offset: 752, size: 16
float amplitude; // Offset: 768, size: 4
}
cbuffer cbConstant
{
float4 vWaterColor = { 0.0199999996, 0.0250000004, 0.0350000001, 1 };// Offset: 0, size: 16
float3 vFogDirection = { 0, 0, -1 };// Offset: 16, size: 12
float4 vFogColor = { 0.600000024, 0.600000024, 0.600000024, 1 }; // Offset: 32, size: 16
}
TABLE 14 NOT OPTIMIZED ONE CONSTANT BUFFER PER FX FILE
// fogmesh
// FX Version: fx_4_0
// Child effect (requires effect pool): false
//
// 1 local buffer(s)
//
cbuffer cbConstant
{
float4 g_MaterialAmbientColor; // Offset: 0, size: 16
float4 g_LightDiffuse; // Offset: 16, size: 16
float g_fNormalMapFactor; // Offset: 32, size: 4
float3 g_vSpotPos; // Offset: 36, size: 12
float g_fSpotFrustum; // Offset: 48, size: 4
float4 g_vSpotDiffuse; // Offset: 64, size: 16
float4 vWaterColor = { 0.0199999996, 0.0250000004, 0.0350000001, 1 };// Offset:80, size: 16
float3 vFogDirection = { 0, 0, -1 };// Offset: 96, size: 12
float4 vFogColor = { 0.600000024, 0.600000024, 0.600000024, 1 };// Offset: 112, size: 16
float4x4 g_mWorld; // Offset: 128, size: 64
float4x4 g_mWorldViewProjection; // Offset: 192, size: 64
float4 ClipPlane; // Offset: 256, size: 16
21
float3 g_LightDir; // Offset: 272, size: 12
float g_fSpotIntensity; // Offset: 284, size: 4
float3 g_vSpotDir; // Offset: 288, size: 12
float4x4 g_mSpotWorld; // Offset: 304, size: 64
float g_FogDensity; // Offset: 368, size: 4
float g_fTime; // Offset: 372, size: 4
float3 g_CameraPos; // Offset: 384, size: 12
float3 g_CameraForward; // Offset: 400, size: 12
bool underwater; // Offset: 412, size: 4
}
//
// 6 local object(s)
EXAMPLE 2: SKINNING10
Figure 3 uses Skinning10 SDK app to show the impact of using a single buffer vs. multiple
buffers. This sample renders the app multiple times and skins it each time when it is rendered
without Stream Out.
FIGURE 3
The sample was run on 2 different platforms and measured. The chart Figure 4 below shows
anywhere from 5-8% in the minimum case to 15-20% frame rate impact of utilizing multiple
constant buffers.
22
FIGURE 4
NOTE 3: These measurements were taking on a HP Pavillon with Mobile Intel® 4
Series Express Chipset Family
OPTIMIZING DIRECTX 10
From hardware perspective pushing immediate constants has the highest performance vs.
indexed constants which normally incur a high latency path. In the latter case, Indexed Constant
buffers with literal indices have higher performance than those with computed indices. Finally,
performance of indexed constant buffers with computed scalar index (independent of
pixel/vertex position) has higher performance than those with computed vector index. The higher
access latency can also be amortized by the # of instructions in the shader using the constants.
Another optimization is to use immediate constant buffers (dcl_immediateConstantBuffer) where
possible.
In general building smaller packed constant buffers grouped by frequency of update and access
pattern are ideal for higher performance. As an example: Organize PerFrame/ Per Pass/ Per
Instance constant buffers first which tend to be smaller in size and have a low update rate
followed by Per Draw/Per Material constant buffers which may also be small but have a higher
update rate. Finally, define large constant buffers like skinning constants.
Another optimization that could be made is to breakup constant buffers based on features that are
optional in games (e.g. shadows, post-processing effects, etc.). Essentially due to performance
constraints for integrated platforms some of these features are either going to be disabled or run
with a lower setting – given this it would beneficial to breakup constants into separate buffers
1.00
1.05
1.10
1.15
1.20
1.25
1 5 10 20 30
# of Soldiers
Performance Improvement with Cb
GM45
23
and then disabling the updates to these constant buffers based on the settings selected by the
gamer/user.
For indexed Constant buffers it is recommended to keep the buffer size tailored to actual needs.
For example, if the shader iterates over 5 elements only, declare 5-element constant buffer for
this shader rather than a general purpose 50-element constant buffer shared among shaders.
This allows the driver to optimize placement so it incurs a low latency path.
24
SUMMARY
Using the above tips and tricks to optimize your application for Intel Integrated Graphics will
help to ensure that your application will run well on the largest volume graphics platforms. For
any issues in implementing these tips, please visit the links below. We welcome feedback and
ways to enhance this guide with more information. See the Legal Information section of this
document (page 2) regarding any feedback provided to Intel.
WEB SITE AND ENGINEERING SUPPORT
Software developers can go to the forum at http://software.intel.com/en-us/forums/user-
community-for-intel-graphics-technology/ and post questions/comments about the complete line
of Intel® ’s Integrated Graphics chipset solutions. If you are a game developer, many useful
documents including topics from multithreading to audio, are available at http://www.Intel
.com/software/games.
REFERENCES
Intel® Graphics Media Accelerator Developer Guide http://software.intel.com/en-
us/articles/intel-graphics-media-accelerator-developers-guide
MICROSOFT DIRECTX SDK http://msdn.microsoft.com/en-us/directx/default.aspx
Wenzel, C.: Porting Game Engines to Direct3D 10: Crysis / CryEngine™ 2 Syggraph 2007.
Oceanfog Demo http://software.intel.com/en-us/articles/ocean-fog-using-direct3d-10/
25
APPENDIX
TABLES
Table 1 immediate, immediate indexed, dynamic indexed constants ..........................................5
Table 2 DirectX 10 inmediate constant ........................................................................................6
Table 3 dIRECTX 9 SHADER MODEL 3 .....................................................................................7
Table 5 Samplecode from microsoft direct3d sdk skinned mesh .................................................8
Table 6 Samplecode from microsoft direct3d sdk reflective lighting model ..................................9
Table 7 sKINNING 10 EXAMPLE ..............................................................................................12
Table 8 BasicHLSL10 ...............................................................................................................12
Table 9 CONSTANT UPDATE ..................................................................................................13
Table 10 BYTES UPDATED UBER BUFFER VS MULTIPLE BUFFERS ..................................13
Table 11 CONSTANT ORDER ..................................................................................................14
Table 12 Skinning 10 example ..................................................................................................15
Table 14 OCEAN FOG CBUFFERS PER SHADER AND PER FRAME ....................................19
Table 15 OCEAN FOG gLOBAL CONSTANT BUFFER ............................................................19
Table 16 NOT OPTIMIZED ONE CONSTANT BUFFER PER FX FILE .....................................20