Optimizing Direct X On Multi Core Architectures
-
Upload
psteinb -
Category
Technology
-
view
5.884 -
download
4
description
Transcript of Optimizing Direct X On Multi Core Architectures
1
Game Developers Conference 2008
Optimizing DirectX on Multi-core architectures
Leigh DaviesSenior Application Engineer, INTEL
February 2008
Contributions from;David Potages Grin*
Jeff Andrews Intel®®
Rita Turkowski Intel®®
Kev Gee Microsoft**Other names and brands may be claimed as the property of others*Other names and brands may be claimed as the property of others
3
Agenda
Graphics and the CPU
Profiling Graphics and Drivers
Threading the render thread
Case Study GRIN*
Summary
*Other names and brands may be claimed as the property of others*Other names and brands may be claimed as the property of others
4
Graphics is CPU Intensive.World in Conflict*World in Conflict*
Bionic Commando*Bionic Commando*
D3D Runtime and Driver account for 25-40% of CPU cycles per frame
D3D Runtime and Driver account for 25-40% of CPU cycles per frame
Application
D3D Runtime
Driver
Other
Application
D3D Runtime
Driver
Other
LegendLegend
*Other names and brands may be claimed as the property of others**Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance Data taken on Intel® QX6700® Processor at 2.67 GHz, NVIDIA 8800GTX GPU, 2Gig memory.
*Other names and brands may be claimed as the property of others**Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance Data taken on Intel® QX6700® Processor at 2.67 GHz, NVIDIA 8800GTX GPU, 2Gig memory.
Crysis* CPU BenchmarkCrysis* CPU Benchmark
Crysis* GPU BenchmarkCrysis* GPU Benchmark
5
Designing the Rendering Pipeline.
•Analyze the whole programAnalyze the whole program– Your ApplicationYour Application– Direct API usage and Direct API usage and
overheadsoverheads– Video card driverVideo card driver
•Have Defined Performance GoalsHave Defined Performance Goals- Use key game play targeted Use key game play targeted
scenarios for perf analysisscenarios for perf analysis- Build benchmarks / test levelsBuild benchmarks / test levels
ApplicationApplicationDirect3D*
Runtime
Direct3D*
RuntimeCommand
Buffer
Command
Buffer
Software
Driver
Software
DriverVideo
Card
Video
Card
World in Conflict*World in Conflict*World in Conflict*World in Conflict*
**Timings taken from msdn2.microsoft.com/en-us/library/bb172234(VS.85).aspx**Timings taken from msdn2.microsoft.com/en-us/library/bb172234(VS.85).aspx
510-700ZFUNC
1050-1150DrawPrimative
2500-3100SetTexture
1500-9000SetPixelShaderConstant
3000-12100SetVertexShader
Cycles countDX9 API Call**
Render
Functions
Render
Functions
*Other names and brands may be claimed as the property of others*Other names and brands may be claimed as the property of others
6
Balancing Future Workloads
Compaction/DerivativeIntel Core™ Duo · Pentium-D
Intel Core™ MicroarchitectureIntel Core™2 Duo,
DC Intel Xeon® 5100
65nm
2 Y
EA
RS
45nm
2
YEA
RS
Compaction/DerivativePENRYN
New MicroarchitectureNEHALEM
Tick
Tick
Tock
Tock
Scalable & Scalable &
Configurable Configurable
Cache, Cache,
Interconnects & Interconnects &
Memory Memory
ControllersControllers
Scalable & Scalable &
Configurable Configurable
Cache, Cache,
Interconnects & Interconnects &
Memory Memory
ControllersControllers
Scalable Scalable
Performance: Performance: 1 to 8 Threads 1 to 8 Threads
& &
1 to 4 Cores1 to 4 Cores
Scalable Scalable
Performance: Performance: 1 to 8 Threads 1 to 8 Threads
& &
1 to 4 Cores1 to 4 Cores
Intel®® Roadmap Graphics
7
Be realistic, Rendering Costs CPU Be realistic, Rendering Costs CPU TimeTime
Rendering thread potential bottleneck Rendering thread potential bottleneck for N-Core scalingfor N-Core scaling
Rendering costs likely to increase as Rendering costs likely to increase as you add more physics, effects or you add more physics, effects or even AI objectseven AI objects
Runtime and driver costs are Runtime and driver costs are significantly higher on the PC than significantly higher on the PC than the consolesthe consoles
Use Performance Analysis results to Use Performance Analysis results to focus development effortsfocus development efforts
Analyze regularly and catch Analyze regularly and catch regressions earlyregressions early
Time is Money
Optimise the graphics thread.Offload as much as possible.
Optimise the graphics thread.Offload as much as possible.
8
Agenda
Graphics and the CPU
Profiling Graphics and Drivers
Threading the render thread
Case Study GRIN
Summary
9
Overview of Graphics Driver Models
WindowsWindows** XP Display Model XP Display Model XPDM - DX* - DX9- The Kernel mode driver controls threading
Windows VistaWindows Vista** Display Driver Model Display Driver Model WDDM - DX9- The D3D9 runtime manages creation of threads
- One is created specifically for the User Mode Driver (UMD)
Windows Vista Display Driver ModelWindows Vista Display Driver Model WDDM - DX10- The Driver is responsible for creating threads
- Currently released drivers don’t thread
- Could change in the near future
Graphics driver can have a major impact on performance and multi-core scaling.
Graphics driver can have a major impact on performance and multi-core scaling.
*Other names and brands may be claimed as the property of others*Other names and brands may be claimed as the property of others
10
Profiling Tools
Need to use a variety of tools;Need to use a variety of tools;- Use repeatable workloadUse repeatable workload
CPU Tools;CPU Tools;- VTuneVTune™ Performance Analyser. Performance Analyser.
- Intel®Intel® Thread ProfilerThread Profiler
- PIX for PIX for WindowsWindows**
- AMD Code AnalystAMD Code Analyst™
GPU Tools;GPU Tools;- PIX for PIX for WindowsWindows with vendor pluginswith vendor plugins
- NVIDIANVIDIA** Perfhud Perfhud
- ATIATI** PerfStudio PerfStudio
*Other names and brands may be claimed as the property of others*Other names and brands may be claimed as the property of others
11
Profiling Graphics with VTune™ Analyzer
Select Counter Monitor for a quick overview;Select Counter Monitor for a quick overview; Not necessary to launch the appNot necessary to launch the app Disable display of counter data unless running windowedDisable display of counter data unless running windowed Profile across a selection of configurationsProfile across a selection of configurations- Identify different bottlenecks based on h/w limitationsIdentify different bottlenecks based on h/w limitations
- ““Works great on my machine” isn’t good enoughWorks great on my machine” isn’t good enough
12
VTune™ Performance Analyzer - Sampling
•Calibration isn’t needed for gamesCalibration isn’t needed for games•Delay sampling allows alt-tab or bypass loadingDelay sampling allows alt-tab or bypass loading•Tracking core usage needs to be addedTracking core usage needs to be added•Privileged time shows time inside KernelPrivileged time shows time inside Kernel
13
VTune™ Analyzer Views
•Processor Usage•Memory Usage•Context Switching•CPU Frequency
•Processor Usage•Memory Usage•Context Switching•CPU Frequency
VTune™™ Analyzer allows you to add your own counters.
VTune™™ Analyzer allows you to add your own counters.
14
Sampling - Display Model XPDM
Application D3D Runtime
Win32k & Dxg
Display DriverMiniport Driver
Videoport
Kernel Mode
User Mode
Session Space
15
Sampling - Display Model WDDM
ApplicationApplication D3D RuntimeD3D Runtime
Win32kWin32k
User Mode Driver
User Mode Driver
Kernel DriverKernel Driver
DxgkrnlDxgkrnlKernel Mode
User Mode
DWM Process
DWMDWM
Application Process
CDDCDDSession Space
16
Associating Symbols in VTune™ Analyzer
Configure->Options->Directories->Symbol RepositoryConfigure->Options->Directories->Symbol Repository View Symbol Repository->Delete unassociated modulesView Symbol Repository->Delete unassociated modules In Tuning Browser select "Results" -> "Module Associations..." In Tuning Browser select "Results" -> "Module Associations..."
Edit symbol associationsEdit symbol associations
17
Symbol Information for DX10Core.dll
Symbols Taken while profiling SoftParticle Sample on SDK
Symbols Taken while profiling SoftParticle Sample on SDK
18
PIX for Windows
CPU
GPU
Gathering GPU events requires Windows VistaCross over between PIX and VTune™ ™ CountersEasy to see CPU/GPU headroom
Gathering GPU events requires Windows VistaCross over between PIX and VTune™ ™ CountersEasy to see CPU/GPU headroom
19
Intel® PIX Plug-in: Beta Available Now
Provides access to Intel®® Counters in PIX Rollout now to support IIG Profiling
# Metric Name Description1 Frame Time Instantaneous frame time in milliseconds.
2 Frames per Second Instantaneous frame rate normalized to seconds. (inverted frame time).
3 Driver Time The amount of time spent in the display driver, normalized to milliseconds.
4 Driver Time Stalled The amount of time spent in the display driver either busy stalled or in a sleep state, normalized to milliseconds.
5 Graphics Memory Used – MB The amount of graphics memory currently utilized, normalized to MB.
6 Graphics Memory Used - bytes The amount of graphics memory currently utilized, normalized to bytes.
7 Texture Memory Used The amount of texture memory currently utilized, normalized to MB.
8 GPU Busy The percent utilization of the front end of the GPU. This metric shall describe the incoming command stream and does NOT describe the utilization of the array of execution units (cores).
9 Cores Busy The percentage of time that any core in the array is either actively executing instructions or stalled.
10 Cores Active The percentage of time that the core array is actively executing instructions.
11 Vertex Count The number of vertices that entered the pipeline.
12 Triangle Count The number of triangles that flowed through the pipeline prior to any clipping or culling.
13 Texel Count The number of texels that were fetched by the pipeline.
14 Pixels Drawn The number of pixels that were actually written to the render target.
15 Mathbox Utilization The aggregated percentage of time that the mathbox was actively executing instructions.
16 Texture Unit(s) Utilization The aggregated percentage of time that the texture units were actively processing texels.
20
Agenda
Graphics and the CPU
Profiling Graphics and Drivers
Threading the render thread
Case Study GRIN
Summary
21
Starting Points
Common Issues:Common Issues:- Naive Ports to WindowsNaive Ports to Windows from console modelsfrom console models- Excessive context switching/synchronization overheadExcessive context switching/synchronization overhead- Work starvation due to thread sync dependenciesWork starvation due to thread sync dependencies
General RulesGeneral Rules- Use only 1 heavy weight thread per Core on WindowsUse only 1 heavy weight thread per Core on Windows - Manage Job distributionManage Job distribution- The OS scheduler knows bestThe OS scheduler knows best- Consider memory bandwidth Consider memory bandwidth
Multi-core and D3D UsageMulti-core and D3D Usage- Avoid Use of the D3DCREATE_MULTITHREADED flagAvoid Use of the D3DCREATE_MULTITHREADED flag- You You CAN CAN manage synch costs bettermanage synch costs better- Design around a single threaded D3D Device Access modelDesign around a single threaded D3D Device Access model- Lock resources from main thread, manually protect accessLock resources from main thread, manually protect access
*Other names and brands may be claimed as the property of others*Other names and brands may be claimed as the property of others
22
Making the Drivers Work for You!
Pack your DrawPrimitive2 calls togetherPack your DrawPrimitive2 calls together
Frequently creating & destroying shaders, VB, IB, and Frequently creating & destroying shaders, VB, IB, and surfaces will impact performancesurfaces will impact performance
Avoid allocating too many system memory resourcesAvoid allocating too many system memory resources
DrawPrimitiveUP or DrawIndexedPrimitiveUPDrawPrimitiveUP or DrawIndexedPrimitiveUP
App
App
D3D Runtime
D3D DriverD3D Driver
Potential 20%+ speed gain.Potential 20%+ speed gain.
Can be disabled by application Can be disabled by application behaviour.behaviour.
Producer & Consumer threads dispatch Producer & Consumer threads dispatch commands to GPUcommands to GPU
23
Avoid any calls that return GPU state information, requires Avoid any calls that return GPU state information, requires a CPU thread synchronizationa CPU thread synchronization
Driver Queries are OK (calls are asynchronous)Driver Queries are OK (calls are asynchronous)
Do not lock threads to a specific CPU!Do not lock threads to a specific CPU!
Group all resource updates (Texture and Vertex) together Group all resource updates (Texture and Vertex) together once per frame beginning or end is fine, just don’t scatter once per frame beginning or end is fine, just don’t scatter them among drawing callsthem among drawing calls
Minimize use of any locks/unlocksMinimize use of any locks/unlocks
System Memory Vertex BuffersSystem Memory Vertex Buffers- D3DUSAGE_DYNAMIC, use with D3DUSAGE_WRITEONLYD3DUSAGE_DYNAMIC, use with D3DUSAGE_WRITEONLY- Lock with D3DLOCK_DISCARD or D3DLOCK_NOOVERWRITELock with D3DLOCK_DISCARD or D3DLOCK_NOOVERWRITE
Making the Drivers Work for You!
24
Threading Issues
Race Conditions between threads.Race Conditions between threads.
- Object UpdatesObject Updates
- Creation/deletion of objectsCreation/deletion of objects False sharingFalse sharing of data between threads. of data between threads. Accessing hardware resources.Accessing hardware resources.
Render Thread
Main Thread
Time
(Frame n)
(Frame n-1)
Move Object X
Render Object X
Delete Object Y
Render Object Y
25
Threading Options
Front-End
Logic
EOF
EOF
Front-end
LogicBack-end
Render
CmdQueue
Back-end
Render
• Avoiding the IssuesAvoiding the Issues• Use an update queue, lightweight (lock-free?)Use an update queue, lightweight (lock-free?)• Make duplicate objects/Make duplicate objects/double-buffereddouble-buffered• Reference count objectsReference count objects
PipelinePipeline Consumer threadConsumer thread
26
Buffering Dynamic Data
Partially buffered locks consume more video memory.Partially buffered locks consume more video memory. Fully Buffered consume more system memory and have an Fully Buffered consume more system memory and have an
associated CPU cost for memory copying.associated CPU cost for memory copying.
Render Thread
Main Thread (Frame n)
(Frame n-1)
Modify Vertex Buffer0
Render Object from Vertex Buffer1
Render Thread
Main Thread
Modify Vertex Buffer1
Render Object from Vertex Buffer0
(Frame n+1)
(Frame n)
Main Thread Render Thread
Lock Buffer
Modify Buffer
Local Buffer
Unlock Buffer
Data Queue0
Lock Buffer
Copy Data
Unlock Buffer
Data Queue1
Video Buffer
Fully buffered locks Fully buffered locks
Partially buffered locks Partially buffered locks
27
Sub Threading Options
Front-End
Logic
EOF
Back-end
Render
Job
Job
Job
Job QueueJob Queue• Job Queue offloadsJob Queue offloads
•Software Visibility CullingSoftware Visibility Culling•Particle generationParticle generation•Character SkinningCharacter Skinning•Procedural updatesProcedural updates
•Reduces path size through Reduces path size through both front and back endsboth front and back ends
Job
Job
Job
Job QueueJob Queue
28
Threading the DX API
D3D9WrapperD3DVertexBuffer9
Wrapper
D3DDevice9
Wrapper
DX9 Render System
D3D9 D3DDevice9 D3DVertexBuffer9
Graphics Driver
Graphics Device
Main Thread 46.46(15.82%) in DX9
NVIDIA driver 23.02
Physics 10.91
Other threads 19.35
Main Thread 63.84(28.39% in DX10+Driver)
Physics 13.95
Other threads 21.88
DX9DX9 DX10DX10
Main Thread 39.08
DX API Thread 7.38
NVIDIA driver 23.02
Physics 10.91
Other threads 19.35
Main Thread 45.72
DX API Thread 18.12
Physics 13.95
Other threads 21.88
16% increase*16% increase*39% increase*39% increase*
Similar to DX9 threading in Similar to DX9 threading in the runtimethe runtime- Potentially repeating the Potentially repeating the
same worksame work Potential to move simple Potential to move simple
API code out of main API code out of main thread, i.e. state thread, i.e. state managementmanagement
DX10 has lower runtime DX10 has lower runtime costscosts
* Theoretical increase based on amount of API work offloaded, does not include threading overhead****Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance Data taken on Intel® QX6700® Processor at 2.67 GHz, NVIDIA 8800GTX GPU, 2Gig memory.
* Theoretical increase based on amount of API work offloaded, does not include threading overhead****Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance Data taken on Intel® QX6700® Processor at 2.67 GHz, NVIDIA 8800GTX GPU, 2Gig memory.
29
Agenda
Graphics and the CPU
Profiling Graphics and Drivers
Threading the render thread
Case Study GRIN
Summary
*Other names and brands may be claimed as the property of others*Other names and brands may be claimed as the property of others
30
Case study: Grin’s engine*
*Other names and brands may be claimed as the property of others*Other names and brands may be claimed as the property of others
David PotagesSenior Engine Architect, GRIN
February [email protected]
*Performance figures discussed in this case study refer to a pre release version of the game.They are subject to change before release and are for illustration only.
*Performance figures discussed in this case study refer to a pre release version of the game.They are subject to change before release and are for illustration only.
31
Quick Engine Overview
33rdrd generation of threaded engine generation of threaded engine 22ndnd generation of threaded renderer generation of threaded renderer Used in several gamesUsed in several games
32
Quick Engine Overview
Not game specific: game code in Lua scriptsNot game specific: game code in Lua scriptsAllows hot-reload, no link time, custom debuggerBut single threaded, a lot of memory allocations
Deferred renderingDeferred renderingDX9 – DX10 being implemented
Libraries: Libraries:
- PhysXPhysX™
- OpenALOpenAL
- Bink*Bink*
All the technology choices have great impact on the possible parallelization!All the technology choices have great impact on the possible parallelization!
*Other names and brands may be claimed as the property of others*Other names and brands may be claimed as the property of others
33
Why multi-threading?
Poor CPU usagePoor CPU usage- Can go down to 30%Can go down to 30%
A lot of time spent in A lot of time spent in D3D/driverD3D/driver- 35-45%*35-45%*
But a lot of the But a lot of the application time is application time is dedicated to renderingdedicated to rendering- Up to 37%*Up to 37%*
- Grand total of 53%* of Grand total of 53%* of frame with D3D/driverframe with D3D/driver
Application
D3D Runtime
Driver
Other
Application
D3D Runtime
Driver
Other
LegendLegend
46%
17%
29%
*Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance Data taken on Intel® QX9650® Processor at 2.33 GHz, NVIDIA 8800GTX GPU, 2Gig memory, Windows Vista™ Ultimate.
*Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance Data taken on Intel® QX9650® Processor at 2.33 GHz, NVIDIA 8800GTX GPU, 2Gig memory, Windows Vista™ Ultimate.
34
Why multi-threading the renderer?
Simplified pipeline (ST version)Simplified pipeline (ST version)
Rendering is an easy target for multithreading: low system dependencies, 53% of frame time
But easier said than done!
Rendering is an easy target for multithreading: low system dependencies, 53% of frame time
But easier said than done!
Culling
Particles batch optimizations
RenderingWorld
update
Script
updateSound Network
Lua* PhysX™ OpenAL*
Some systems or the drivers they use can take advantage of multi-coresRendering has low dependencies with other systems, but big data dependencies
*Other names and brands may be claimed as the property of others*Other names and brands may be claimed as the property of others
35
Implementation Details
Main threadMain thread
Entity/World updates, Animations, Input, Network, Lua, Entity/World updates, Animations, Input, Network, Lua, SoundSystem, Physics (main)SoundSystem, Physics (main)
Renderer threadRenderer thread
Culling (including software occlusion queries)Culling (including software occlusion queries)
Particle effects batch optimizationsParticle effects batch optimizations
RenderDevice (D3D)RenderDevice (D3D)
Win32 messagingWin32 messaging OtherOther
File streamingFile streaming
PhysXPhysX™ threads threads
Driver threadsDriver threads
36
Implementation Details Messages sent to the renderer- Non blocking:
render_scene render_frame update_window Etc
- Blocking:
flush_pipe flush_pipe forces the renderer to
execute all the queued jobs => synchronization point- Used between frames on main thread
- Can be used to ensure that data (eg Textures) is ready
Front-end
Logic Back-end
Render
Flush
Back-end
Render
Idle
Front-end
Logic
Sync
Idle Flush
37
Implementation Details
States needs to be mirrored States needs to be mirrored States changes are queued, and updated in the States changes are queued, and updated in the
freezefreeze The proper state is returned depending on the The proper state is returned depending on the
calling threadcalling thread
This will avoid contention when data is accessed in the renderer, but mirror only what is requiredThis will avoid contention when data is accessed in the renderer, but mirror only what is required
38
Results
Better CPU usageBetter CPU usage40-60%*
Better threads Better threads workloadworkload
*Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance Data taken on Intel® QX6700® Processor at 2.67 GHz, NVIDIA 8800GTX GPU, 2Gig memory.
*Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance Data taken on Intel® QX6700® Processor at 2.67 GHz, NVIDIA 8800GTX GPU, 2Gig memory.
39
Results: Rendering Performance
Better FPSBetter FPS- 4C MT is 1.88x faster than 1C*
- 4C MT is 1.20x faster than 4C ST*
AnalysisAnalysis- Remember that the drivers are
partially threaded: we save up to 17% + %of D3D/driver time that is not threaded
- Close to 1.20xif D3D/driver were completely threaded, new frame time would be 1-0.17=83% less, and the scale-up :
fpsnew/fpsold=timeold/timenew
=timeold/(timeold*0.83)=1.20Maximum scale-up vs. 1C is 2.12x
- Context switches, cache misses and contention slow us down.
- Render-thread bound
0102030405060708090
100
CPU FPS
1C2C ST2C MT4C ST4C MT
*Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance Data taken on Intel® QX9650® Processor at 2.33 GHz, NVIDIA 8800GTX GPU, 2Gig memory, Windows Vista™ Ultimate.
*Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance Data taken on Intel® QX9650® Processor at 2.33 GHz, NVIDIA 8800GTX GPU, 2Gig memory, Windows Vista™ Ultimate.
• Effect on a low physics/gameplay workloadEffect on a low physics/gameplay workload• Effect on a low physics/gameplay workloadEffect on a low physics/gameplay workload
40
Improvements
Threading some parts of the render threadThreading some parts of the render threadE.g.: culling (~9-25%* of the render thread)
Reducing contentionsReducing contentionsMainly memory
Batch moreBatch moreE.g.: Effects
Triple buffering?Triple buffering?
*Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance Data taken on Intel® QX9650® Processor at 2.33 GHz, NVIDIA 8800GTX GPU, 2Gig memory, Windows Vista™ Ultimate.
*Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance Data taken on Intel® QX9650® Processor at 2.33 GHz, NVIDIA 8800GTX GPU, 2Gig memory, Windows Vista™ Ultimate.
41
Scalability
We can push for instance more physics/effects, while we are render-thread bound, or more AI
But hard to find the right balance between CPU and GPU workload!
Example: falling carsaka pushing more physics
42
Scalability
- ~256 cars falling and bouncing~256 cars falling and bouncing
- 4C MT is 1.42x* faster than 4C 4C MT is 1.42x* faster than 4C ST, and 3.23x* faster than 1CST, and 3.23x* faster than 1C
- PhysXPhysX™ helped us a lot to helped us a lot to propagate the workload, but propagate the workload, but occupies the other cores quite occupies the other cores quite heavily, thus preventing heavily, thus preventing D3D/drivers to take advantage D3D/drivers to take advantage of them.of them.
- Rendering overhead was not Rendering overhead was not that big with the additional that big with the additional units since they batch well.units since they batch well.
0
10
20
30
40
50
FPS
1C4C ST4C MT
*Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance Data taken on Intel® QX9650® Processor at 2.33 GHz, NVIDIA 8800GTX GPU, 2Gig memory, Windows Vista™ Ultimate.
*Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance Data taken on Intel® QX9650® Processor at 2.33 GHz, NVIDIA 8800GTX GPU, 2Gig memory, Windows Vista™ Ultimate.
43
Issues
A proper benchmark system is requiredA proper benchmark system is requiredA fly-through benchmark is not enough!The CPU & GPU workloads vary a lot on different maps
Easy to forget a data that needs to be mirroredEasy to forget a data that needs to be mirrored Lockfree algorithm are nice, but to be used with careLockfree algorithm are nice, but to be used with care Memory contention + cache misses + false sharingMemory contention + cache misses + false sharing Behaviour of drivers varies quite alot…Behaviour of drivers varies quite alot…
44
Agenda
Graphics and the CPU
Profiling Graphics and Drivers
Threading the render thread
Case Study GRIN
Summary
*Other names and brands may be claimed as the property of others*Other names and brands may be claimed as the property of others
45
Summary/Conclusion
Graphic pipeline is still very CPU intensiveGraphic pipeline is still very CPU intensive Future CPUs will have increasing logical processorsFuture CPUs will have increasing logical processors It is worth threading your renderer as much as possible if It is worth threading your renderer as much as possible if
you want to be able to push more things in your gameyou want to be able to push more things in your game Hard to balance the workloads though, need to profile whole Hard to balance the workloads though, need to profile whole
systemsystem Making the most of the graphics driver essentialMaking the most of the graphics driver essential
46
References:
Accurately Profiling Direct3D API Calls.- msdn2.microsoft.com/en-us/library/bb172234(VS.85).aspx
Debugging Tools and Symbols: Getting Started- www.microsoft.com/whdc/devtools/debugging/debugstart.mspx
Threading the OGRE3D Render System- www.intel.com/cd/ids/developer/asmo-na/eng/dc/games/331359.htm
47