Post on 17-Jan-2017
Video Game Optimization Workshop
Amir H. Fassihi Fanafzar Game Studio
Aug 2012
Fanafzar Game Studio
System Design Requirements
• Functional • Non Functional
Fanafzar Game Studio
Non Functional Requirements
• Maintainability • Extensibility • Security • Scalability • Intellectual Manageability • Availability • Portability • Usability • Performance
Fanafzar Game Studio
Performance The amount of work accomplished by a computer system compared to the time and resources used. • Short response time • High throughput • Low utilization of computer resources • High availability of applications • Fast data compression and decompression • High bandwidth/ Short data transmission time
Fanafzar Game Studio
Video Games
• Most x-abilities are important – Even more so for game engines. (As in
enterprise applications) • Performance is REALLY important!
– For any game or game engine.
Fanafzar Game Studio
System Design
• Solution for Functional Requirements • Solution for Non-Functional Requirements
– Bulk of the technical efforts – Conflicts in Design! – Performance as the bad boy in the group – Performance as the cream of the crop – Performance being directly experienced by
end user
Fanafzar Game Studio
Can you make this?
Fanafzar Game Studio
Optimization
• “The process of modifying a software system to make some aspects of it work more efficiently or use fewer resources.”
Fanafzar Game Studio
Optimization Lifecycle
1. Benchmark 2. Detect (Hotspots and Bottlenecks)
3. Solve 4. Check 5. Goto 1
Fanafzar Game Studio
Levels of Optimization
• System Level • Algorithmic Level • Micro Level
– Branch prediction – Instruction throughput – Latency
Fanafzar Game Studio
Project Lifecycle and Optimization
• Pre-production • Production • Post-production Optimization from High Level to Low Level Quake Story: High level architectural optimization before low level triangle draw function (Carmack and Abrash) http://www.bluesnews.com/abrash/
Fanafzar Game Studio
Measuring Performance in Games
1. Set Specification 1. Performance Goal (FPS, time) 2. Hardware Specification
2. Define Line Items 1. CPU time, RAM, GPU time, Video Mem 2. Rendering, Physics, Sound, Gameplay, Misc.
Fanafzar Game Studio
Memory Management (God of War)
32 Meg memory
16 Meg for Levels, split into 2 4*1 Meg Enemies
1.5 Meg Exe
Run Time Data
Perm Data
• Establish Hard Rules. – 16 Meg for Level Data (Split into 2 Levels) – 4 * 1 Meg for Enemies
• Maintain 60fps From: Tim Moss 2006 GDC Talk
Fanafzar Game Studio
Tools
• Profilers (Intel VTune, VS Profiler, …)
– Total time – Self time – Calls
• System Monitors (Nvidia PerfHud, MS PIX,…)
• System Adjusters (Intel GPA, …)
Fanafzar Game Studio
Holistic Optimization
• Optimization Process • CPU Bound • GPU Bound
Fanafzar Game Studio
CPU Bound, Memory
• Prefetching Memory • Memory Cache
Fanafzar Game Studio
Memory Optimization
• Cache Miss – Instruction Cache – Data Cache
Fanafzar Game Studio
Memory Hierarchy
source: Memory Optimization, Christer Ericson, GDC 2003 Fanafzar Game Studio
Data Access Patterns
• Linear Access Forward for (i = 0; i < numData; ++i) memArray[i];
• Linear Access Backward
Fanafzar Game Studio
Data Access Patterns Ctd. • Periodic Access
struct vertex {
float pos[3]; float norm[3]; float textCoord[3];
} for (i = 0; i < num; ++i)
vertexArray[i].pos • Random Access
Fanafzar Game Studio
AOS vs. SOA
Fanafzar Game Studio
Critical Stride
• Stride size in memory read can cause cache thrashing
Fanafzar Game Studio
Strip Mining for { access pos; } for {
access norm; } ------------------------------------------------------ for {
access pos; access norm;
}
Fanafzar Game Studio
Memory
• Stack – Temporal coherence, spatial locality
• Global – No fragmentation, freed at end
• Heap – new, delete, malloc, free – No spatial locality, no temporal coherence,
fragmentation
Fanafzar Game Studio
Load-Hit-Store
• Write data to address x and then read the data from address x -> Large stall
• Writing data all the way to the main memory through all caches -> 40 to 80 CPU cycle delay
• http://assemblyrequired.crashworks.org/2008/07/08/load-hit-stores-and-the-__restrict-keyword/
Fanafzar Game Studio
Load-Hit-Store
Fanafzar Game Studio
Memory Solutions • Don’t allocate • Linearize allocations
– Use arrays • Memory pools
– Coherent – No fragmentation – No construction/destruction
• Don’t construct or destruct – Plain Old Structures (POS)
Fanafzar Game Studio
Memory Solutions
• Time scoped pools – Frame allocator – Pool for one level content, discarded at the
end
Fanafzar Game Studio
Memory Manager
“If you don’t have a custom memory manager in your game, you’re a fool (or a PC game developer)” Christer Ericson, Director of Tools and Technology, Sony Santa Monica
Fanafzar Game Studio
Memory Related Solutions • Reducing memory footprint at compile time and
runtime • Algorithms that reduce memory fetching • Reduce cache miss
– Spatial Locality – Proper Stride – Correct Alignment
• Increase Temporal Coherence • Utilize Pre-fetching • Avoid worst-case access patterns that break
caching
Fanafzar Game Studio
Pitfalls of Object Oriented Programming
Summary of study (Tony Albrecht, 2009) • Case study for CPU side rendering code • Just re-organizing data locations was a win • + pre-fetching is more win • Can you decouple data from objects? • Be aware of what the compiler and hardware
are doing, watch the generated assembly!
Fanafzar Game Studio
Pitfalls of OOP
• Optimize for data first, then code – Memory access is going to be your biggest
bottleneck • Simplify Systems
– KISS – Easier to optimize, Easier to parallelize
• Keep code and data homogeneous • Not everything needs to be an object
Fanafzar Game Studio
Pitfalls of OOP
• You are writing a game – You have control over the input data – Don’t be afraid to pre-format it if needed
• Design for specifics, not generics
Fanafzar Game Studio
Data Oriented Design
• Better performance • Better realization of code optimization • Often simpler code • More parallelizable code
Fanafzar Game Studio
CPU Bound: Compute
• Lots of arithmetic operations not load and store
Fanafzar Game Studio
CPU Compute: Solutions • Compiler flags (float: precise/fast) • Time against Space
– Use of lookup tables • Memoization • Function Inlining • Branch prediction, out of order execution
– Branch mis-prediction is much less costly than cache miss
• Make branches more predictable
Fanafzar Game Studio
CPU Computer: Solutions
• Remove Branches – If (a) z=c; else z=d; – Z = a * c + (1 – a) * d
• Profile Guided Optimization • Loop unrolling
Fanafzar Game Studio
Loop Unrolling for (i = 0; i < 100; ++i)
sum += intArray[i]; ------------------------------------------------------ for (i = 0; i < 100; i+=4) {
sum1 += intArray[i]; sum2 += intArray[i+1]; sum3 += intArray[i+2]; sum4 += intArray[i+3];
} sum = sum1+sum2+sum3+sum4;
Fanafzar Game Studio
Virtual Functions
• How slow are virtual functions really? http://assemblyrequired.crashworks.org/2009/01/19/how-slow-are-virtual-functions-really/
• 1000 iterations over 1024 vectors • 12,288,000 function calls • Virtual: 159.856 ms • Direct: 67.962 • Inline: 8.040 ms
Fanafzar Game Studio
Slow Virtual Functions
• Problem is not the cost of looking up the indirect function pointer from vtable.
• The issue lies in “branch prediction” and the way marshalling parameters for the calling convention can get in the way of good instruction scheduling.
Fanafzar Game Studio
Micro Optimization • Bit Tricks
– Bitwise Swap • X^=Y; Y^=X; X^=Y;
– Bitmasks • isFlagSet = someInt & MY_FLAG, someInt |= Flag2; • Example use: Collisions in Physics
– Fast Modulo • X%Y = X & (Y -1) iff Y is a power of 2
– Even and Odd • (X & 1) == 0; // same as X%2==0
Fanafzar Game Studio
Book on Bit Tricks
• Hacker’s Delight (Henry S. Warren, Addison Wesley, 2003)
Fanafzar Game Studio
Other Micro Optimization
• Data type conversion • SSE Instructions • Removing loop invariant code • Loop unrolling • Cross-.obj optimization
– Whole program optimization • Hardware Specific Optimizations
Fanafzar Game Studio
Vector vs. List
• Random data insertion and deletion into a c++ vector and list compared
• Data kept sorted in the containers
Fanafzar Game Studio
Vector vs. List Results
Fanafzar Game Studio
Vector vs. List Ctd.
Fanafzar Game Studio
STL iterator debugging
STL Iterator Debugging and Secure SCL http://channel9.msdn.com/Shows/Going+Deep/STL-Iterator-Debugging-and-Secure-SCL
Fanafzar Game Studio
Copy vs. Move
• Vector of strings with 4 dimensions • 100 x 100 x 100 x 500 • Construction: 564 ms • Copy Construction: 537 ms • Move Construction: 0.001 ms • Empty Destruction: 0.001 ms • Destruction: 285 ms
Fanafzar Game Studio
GPU Bound • GPU related issues
– Synchronization – Capabilities Management – Resource Management – Global Ordering
• Reflections/Shadows before scene • Opaque front to back/Translucent back to front • Sort by material or texture to reduce state changes
– Instrumentation – Debugging
Fanafzar Game Studio
GPU Optimization Tricks • State Changes • Draw Call (Most common issue) • Instancing and Batching
– Shader Instancing – Hardware Instancing
• Video RAM – Device Resets – Resource uploads/locks
• Minimize Copies • Minimize Locks • Double Buffer
Fanafzar Game Studio
GPU Optimization Ctd.
• Fragmentation – Power of 2 allocations help
• Lock culling – Debug visualization for those culled
• Texture debugging – Different texture for each mip level
Fanafzar Game Studio
GPU Bound?
• Spend a long time in API calls (Draw calls or swap/present frame buffer)
• Front End / Back End – Triangles/Geometry – Pixels/Shaders – Vary each workload and measure
performance
Fanafzar Game Studio
Back End • Fill Rate (ex. 1000 MP/sec)
– FPS, Overdraw, resolution – Fill Rate / FPS = overdraw * resolution – Render Target Format (16 / 32 bit) – Blending
• Transparency instead of translucency – Shading
• Pixel shaders – Texture Sampling
• Format, Filter Mode, Count (DXT1)
Fanafzar Game Studio
Front End
• Bottlenecks – Vertex Transformation
• Lighting calculations, skinning, …
– Vertex Fetching and caching • Vertex format, indexes (16/32 bit)
– Tessellation
Fanafzar Game Studio
Other GPU factors
• Multi-sample antialiasing (MSAA) – Downsample from high-res render – Can significantly affect fill-rate
• Lights and Shadows – CPU, vertex processing, pixel processing
Fanafzar Game Studio
Forward VS. Deferred
• Multiple render targets needed for deferred
• Lot of fill-rate needed for deferred • Performance is flattened
Fanafzar Game Studio
Shaders
• Memory • Inter-shader communication • Texture sampling (biggest problem with
memory) • Computation
Fanafzar Game Studio
Other shader notes • Shader compilation • Shader count
– Penalty for many shaders in one scene – Limits on GPU for shader execution
• Effect framework – CgFX, ColladaFX (by tools like Nvidia FX
composer) – Oriented towards ease of use than performance – Engines have their own (Unreal 3, Unity, Source,
torque, Gamebryo)
Fanafzar Game Studio
Networking
• Throughput • Latency • Reliability
– Out of order packets – Corrupted – Truncated – Lost
Fanafzar Game Studio
Reliability
• User Datagram Protocol (UDP) • Transmission Control Protocol (TCP)
Fanafzar Game Studio
Game Networking Data
• Events – Guaranteed, Ordered
• State data – Unordered, Not Guaranteed (opportunities for
optimization) – Unless using lock step simulation
Fanafzar Game Studio
Bandwidth
• Bitstreams and Bit packing – Flag -> one bit – Health -> 7 bits
• Encoding on streams
TCP/UDP
BitStream
Decimation LZW Huffman
Most Recent State Events
Fanafzar Game Studio
Prioritizing Data
• Fill packet with most important data first • Heuristic for most recent data (ex. how
close to player) • Only send what you must
– ex. Cull enemy behind the wall
Fanafzar Game Studio
Packets
• Smaller than 1400 bytes • Send packets regularly (Routers allocate
bandwidth to those who use it)
Fanafzar Game Studio
Smooth Experience
• Interpolation • Extrapolation
– Client Side Prediction – Dead Reckoning
Fanafzar Game Studio
Profiling Networking
• Make sure networking code is efficient – Measure compute and memory
• Expose what the networking layer is doing – Number of packets – Bandwidth for each packet
• Be aware of situations that client and server get out of sync.
Fanafzar Game Studio
Mass Storage
• Hard Drives • CD, DVD • Blu-Ray • Flash Drives
Fanafzar Game Studio
Performance Issues • Seek Time • Transfer Rate (ex. 75MB/sec)
• Worst Case – 8ms delay between blocks on disk – 4KB blocks – Loading 1MB -> (1024/4) * 8 = 2048 ms = 2
secs – Loading 1GB -> 34 min
Fanafzar Game Studio
Rule
• No disk IO in the inner loops
Fanafzar Game Studio
IO Profiling is hard • File systems optimize themselves based on
access patterns • Disk will rebalance data based on load and
sector failure • Disk, disk controller, file system and OS will
cache and reorder requests • User software may intercept the disk access
for virus scanning • Good idea to test on fresh machines from
time to time
Fanafzar Game Studio
Disk IO performance tips
• Limit disk access • Minimize reads and writes
– Read larger chunks • Asynchronous Access • Optimize file order • Optimize data for fast loading
– Space on disk vs. Time to load (ex. decompressing a JPG file)
Fanafzar Game Studio
Disk IO Tips • Support development and runtime formats • Support dynamic reloading • Automate resource processing • Centralize resource loading
– Resource Managers • Preload when appropriate • Stream
– First second of sound in memory – Small texture mip levels in memory – Small mesh LODs in memory
Fanafzar Game Studio
Concurrent Programming
• Data Parallelism – Scatter Phase – Gather Phase
• Task Parallelism
Fanafzar Game Studio
Threading Performance Problems
• Scalability • Contention • Balancing
Fanafzar Game Studio
Scalability
• High performance is proportional to the parallelizable section of an algorithm
• Amdahl’s Law – S(N) = 1 / ((1 – P) + P/N) – N: Processors, P: Parallelizable Portion
Fanafzar Game Studio
Contention
• More than one thread accessing the same resource
• Some solutions – Thread Safety (Mutex) – Redundant Data – Efficient Synchronization (Locks, Atomic
Operations, …)
Fanafzar Game Studio
Balancing
• Ensure all cores are busy • Eliminate starving
Fanafzar Game Studio
False Sharing
Fanafzar Game Studio
False Sharing Ctd. Struct vertex {
float xyz[3]; // data 1 float tutuv[2]; // data 2
}; vertex triList[N]; ------------------------------------------------------------ Struct vertices {
float xyz[3][N]; float tutuv[3][N];
}; vertices triList;
Fanafzar Game Studio
Multi-threaded Profiling
• Look for time spent on synchronization primitives
• Look out for Heisenbugs! • Assess Amdahl’s Law • Use multi-threaded profilers
Fanafzar Game Studio
No Synchronization is best
• Lock-free algorithms are great. • Wait-free algorithms are event better!!
Mike Acton notes on wait free coding: http://cellperformance.beyond3d.com/articles/2009/08/roundup-recent-sketches-on-concurrency-data-design-and-performance.html
Fanafzar Game Studio
Managed Languages
• Execute on a runtime • C#, Java, Javascript, lua, python, php,
Actionscript
Fanafzar Game Studio
Concerns for Profiling
• Garbage Collector • Just in Time compiler • No high accuracy timers • Allocation can be costly, usually no stack
Fanafzar Game Studio
Managed/Unmanaged
• Gameplay code is usually not performance critical
• Bottlenecks can be replaced with native code
Fanafzar Game Studio
Dealing with GC
• Memory pressure causes GC to run frequently and cause sudden hitches
• Memory pressure causes big memory footprint and hurts cache efficiency
• Big total working set needs the GC to check all the pointers
• Incremental GC behavior is helpful but high pressure can force GC to collect all
Fanafzar Game Studio
Strategies for dealing with GC
• Less data on heap • Your own memory management • Memory pooling • Using temporary objects that are instances
as class members instead of local variable creation
Fanafzar Game Studio
Dealing with JIT
• JIT activation time is important for performance (startup, after a few function calls, …)
• Constructors usually left out (Heavy initialization code needs to be in a helper function)
• JIT might not be available on all platforms
Fanafzar Game Studio
Optimizing Animation
• Channel Omission • Quantization • Sample Frequency and Key Omission • Curve Based Compression • Selective Loading and Streaming • Hardware Skinning
Fanafzar Game Studio
Misc. Optimization Related Topics
• Mesh LOD • Animation LOD • AI LOD • Collision Detection Spatial Partitioning • Physics Optimizations (GPU, Sleeps, …)
Fanafzar Game Studio
PIX Test Case
• PIX (Performance Investigator for Xbox • Part of DirectX SDK • Used for DirectX based applications • Used for analyzing Garshasp 1 and
Garshasp: Temple of the Dragon (Expansion)
Fanafzar Game Studio
Using PIX to Analyze Garshasp
Fanafzar Game Studio
Selecting Measurement Attributes
Fanafzar Game Studio
In-Game HUD
Fanafzar Game Studio
PIX Report
Fanafzar Game Studio
Garshasp Performance Post-Mortem
• Animation skinning (Intel VTune) – Switched to Hardware Skinning
• Asset Loading – Used background thread
• Draw Calls – Dynamic Far-Clip distance
• High RAM consumption – Reduced particle quotas – Reduced Area arrangement (changes in camera
system needed) – Reduced Texture size – Better strategies for audio loading/unloading
Fanafzar Game Studio
Garshasp Ctd. • Large Video memory usage
– Changed mesh geometry – Better seamlessness strategy
• Frame rate drops – Better use of particles – Modifications to camera angles and
seamlessness strategy – Smaller areas for more even distribution of
resource loading.
Fanafzar Game Studio
Some un-resolved issues • Un-optimized animation system • Overdraw • Slow Game Object update loop • No static batching
– Use of vertex color for baked color • Huge game save data • In-efficient texture size usage • No sound/video streaming • + may more!
Fanafzar Game Studio
Biggest Optimization Related Problem
No internal resource consciousness!
Fanafzar Game Studio
Unity Editor Profiler
Fanafzar Game Studio
Profiler Views
Fanafzar Game Studio
CPU
Fanafzar Game Studio
Deep Calls
Fanafzar Game Studio
Rendering Information
Fanafzar Game Studio
Memory
Fanafzar Game Studio
CPU vs. GPU
Fanafzar Game Studio
References • Video Game Optimization, Ben Garney and Eric Preisz • “How the left and right brain learned to love one another”, Tim Moss
http://timmoss.blogspot.com/2007/02/it-seems-reasonable-that-my-very-first.html
• “Optimization is a Full time job”, Maciej Sinilo http://msinilo.pl/blog/?p=483
• “Memory Optimizaton”, Christer Ericson, http://www.research.scea.com/research/pdfs/GDC2003_Memory_Optimization_18Mar03.pdf
• “A pragmatic approach to optimization”, Niklas Frykholm, http://bitsquid.blogspot.com/2011/12/pragmatic-approach-to-performance.html
Fanafzar Game Studio
References Ctd. • Hacker’s Delight (Henry S. Warren, Addison
Wesley 2003) • Advanced Bit Manipulation-fu, Christer Ericson
http://realtimecollisiondetection.net/blog/?p=78 • Networking for Programmers, Glenn Fiedler,
http://gafferongames.com/networking-for-game-programmers/
• Source Multiplayer Networking, Valve Software, https://developer.valvesoftware.com/wiki/Source_Multiplayer_Networking
Fanafzar Game Studio
References Ctd. • False sharing and its effect on memory performance,
William J. Bolosky, http://static.usenix.org/publications/library/proceedings/sedms4/full_papers/bolosky.txt
• Concurrency, Data Design and Performance, Mike Acton, http://cellperformance.beyond3d.com/articles/2009/08/roundup-recent-sketches-on-concurrency-data-design-and-performance.html
• Diving down the concurrency rabbit hole, Mike Acton, http://www.insomniacgames.com/tech/articles/0809/files/concurrency_rabit_hole.pdf
Fanafzar Game Studio
References Ctd. • Scalar Quantization, Jonathan Blow,
http://number-none.com/product/Scalar%20Quantization/index.html
• Are we out of memory, Christian Gyrling, http://www.swedishcoding.com/2008/08/31/are-we-out-of-memory/
• Practical Efficient Memory Management, Jesus De Santos, http://entland.homelinux.com/blog/2008/08/19/practical-efficient-memory-management/
• Fanafzar Game Studio
References Ctd. • Load Hit Store and the restrict keyword, Elan
Ruskin, http://assemblyrequired.crashworks.org/2008/07/08/load-hit-stores-and-the-__restrict-keyword/
• How slow are virtual functions really, Elan Ruskin, http://assemblyrequired.crashworks.org/2009/01/19/how-slow-are-virtual-functions-really/
• Current Generation Parallelism in Games, Jon Olick, http://s08.idav.ucdavis.edu/olick-current-and-next-generation-parallelism-in-games.pdf
Fanafzar Game Studio
References Ctd. • Real Life Performance Pitfalls, Alan Murphy,
http://www.microsoft.com/en-us/download/confirmation.aspx?id=3539
• Graphics Programming Black Book, Michael Abrash
• Zen of Code Optimization, Michael Abrash • The Free Lunch is Over, Herb Sutter,
http://www.gotw.ca/publications/concurrency-ddj.htm
Fanafzar Game Studio
References Ctd. • Intel Software Optimization Cookbook,
http://www.intel.com/intelpress/sum_swcb2.htm • Pitfalls of Objects Oriented Programming, Tony
Albrecht, http://www.reddit.com/r/programming/comments/ag43j/pitfalls_of_object_oriented_programming_pdf/
• Microsoft PIX, http://msdn.microsoft.com/en-us/library/ee663275(v=vs.85).aspx
Fanafzar Game Studio
References Ctd.
• Top 10 Myths of Video Game Optimization, http://www.gamasutra.com/view/feature/130296/the_top_10_myths_of_video_game_.php?print=1
Fanafzar Game Studio
Questions?
fassihi@fanafzar.com
Fanafzar Game Studio