Advanced Linux Game Programming
-
Upload
leszek-godlewski -
Category
Software
-
view
74.002 -
download
9
Transcript of Advanced Linux Game Programming
Advanced Linux Game Programming
Advanced Linux
Game Programming
Leszek Godlewski
Programmer, Nordic Games
Hi everyone, my name is Leszek Godlewski and I'm here today to talk about some of the more advanced aspects of Linux game programming.
Or, porting your game engine to Linux, to be more precise.
These notes will only supplement the slides, they are not a transcript.
Nordic Games GmbH
Started in 2011 as a sister company to Nordic Games Publishing (We Sing)
Base IP acquired from JoWooD and DreamCatcher (SpellForce, The Guild, Aquanox, Painkiller)
Initially focusing on smaller, niche games
Acquired THQ IPs in 2013 (Darksiders, Titan Quest, Red Faction, MX vs. ATV)
Now shifting towards being a production company with internal devs
Since fall 2013: internal studio in Munich, Germany (Grimlore Games)
Leszek Godlewski
Programmer, Nordic Games
PortsPainkiller Hell & Damnation (The Farm 51)
Deadfall Adventures (The Farm 51)
Darksiders (Nordic Games)
Formerly generalist programmer on PKHD & DA at TF51
I'm a programmer with Nordic Games. A bit unusual role with the company since it's mostly a publisher, but I'm the resident Linux guy and generalist programmer where necessary.
I've got three ports under my belt: Painkiller Hell & Damnation and Deadfall Adventures for The Farm 51, and Darksiders for Nordic.
I was also a generalist programmer at TF51 on the original teams for those two games.
Objective of this talk
Your game engine on Linux, before porting:
Missing!
Objective of this talk (cont.)
Your first working Linux port:
Oops. Bat-Signal!
Objective of this talk (cont.)
Where I want to try helping you get to:
In other words, from this:
To this:
And that's mostly debugging
All sorts of debuggers!
apitrace
Who can guess what these are, apart from the obvious ones? :)
Answers, clockwise from top left: GDB, Valgrind, apitrace (doh), Nsight (doh!), GPU PerfStudio, VOGL (doh...).
Demo code available
is.gd/GDCE14Linux
Intended takeaway
Build system improvements
Signal handlers
Memory debugging with Valgrind
OpenGL debugging techniques
The focus of this talk lies in these 4 areas. I'd like to share my findings about optimizing native code builds, signal (crash) handling, jump on to memory debugging stuff using the Valgrind tool, and finish off with OpenGL debugging techniques. There are some bonus slides if there is still time, too.
Intended takeaway Agenda
Build system improvements
Signal handlers
Memory debugging with Valgrind
OpenGL debugging techniques
Actually, let's adopt this as the agenda. :) The upcoming element will be in bold.
Build systems
What I had initially with UE3:Copy/paste of the Mac OS X toolchain
It worked, but...Slow
Huge binaries because of debug symbols
Problematic linking of circular dependencies
This is based on the two UE3 titles, but also applied to Darksiders to some extent (there was no Mac toolchain, had to make one from scratch). It displayed the same problems.
Build systems (cont.)
32-bit binaries required for feature/hardware parity with Windows
Original solution: a chroot jail with an entire 32-bit Ubuntu system just for building
As much as I would love 32-bit to die, I had to support it to achieve full parity with Windows.
A chroot is a bit like a VM a separate, sandboxed view of the entire system environment (i.e. file system), but running on the same kernel.
Host system is Debian amd64.
Cross-compiling for 32/64-bit
gcc -m32/-m64 is not enough!Only sets target code generation
Not headers & libraries (CRT, OpenMP, libgcc etc.)
Fixed by installing gcc-multilibDependency package for non-default architectures (i.e. i386 on an amd64 system and vice versa)
Fortunately, I could get rid of that chroot jail thanks to gcc-multilib.
Clang (ad nauseam)
Clang is fastergcc:3m47s
Clang:3m05s
More benchmarks at Phoronix [LARABEL13]
Clang has different diagnostics than gcc
Did you mean x? and colours for the win.
Clang (cont.)
Preprocessor macro compatibilityDeclares __GNUC__ etc.
Command line compatibilityEasily switch back & forth between Clang & gcc
Clang caveats
C++ object files may be incompatible with gcc & fail to link (need full rebuilds)
Clang is not as mature as gccOccasionally has generated faulty code for me (YMMV)
C object files are unaffected.
Clang seems to have slightly different C++ symbol name generation rules (pre-mangling).
Clang caveats (cont.)
Slight inconsistencies in C++ standard strictnessTemplates
Anonymous structs/unions
May need to add this-> in some places
May need to name some anonymous types
So: Clang or gcc?
Both:Clang quick iterations during development
gcc final shipping binaries
Linking GNU ld
Default linker on Linux
Ancient
Single-threaded
Requires specification of libraries in the order of reverse dependency...
We are not doomed to use it!
Linking GNU gold
Multi-threaded linker for ELF binariesld:18s
gold:5s
Developed at Google, now officially part of GNU binutils
Officially in binutils ships as an optional part of default toolchain on all modern distros.
Linking GNU gold (cont.)
Drop-in replacement for ldMay need an additional parameter or toolchain setupclang++ -B/usr/lib/gold-ld ...
g++ -fuse-ld=gold ...
Still needs libs in the order of reverse dependency...
Linking reverse dependency
Major headache/game-breaker with circular dependencies
Proper fix: re-specify the same libraries over and over againgcc app.o -lA -lB -lA
And not only circular sometimes it's tricky to even get regular dependencies in the right order.
Circular actually happened to me with the PhysX 2.8.4 loader. That was even worse because even though it seemed to have linked, the dynamic linker at runtime still failed to resolve everything...
Linking reverse dep. (cont.)
app
A
B
Here's how this reverse dependency thing works: you first declare the dependent object, then its dependency. Starting the construction of a building from the roof, if you will.
First, an app object file is linked. It has unsatisfied imports for symbols from library A.
Linking reverse dep. (cont.)
app
A
B
Library A's exports satisfy the imports of app, but they import symbols from B.
Linking reverse dep. (cont.)
app
A
B
Library B's exports match library A's imports, but B now has unsatisfied imports of A's symbols.
Linking reverse dep. (cont.)
app
A
B
A
Just the missing symbols
By re-specifying A, the linker satisfies the A imports in library B.
Linking library groups
Declare library groups insteadWrap library list with --start-group, --end-groupShorthand: -(, -)
g++ foo.obj -Wl,-\( -lA -lB -Wl,-\)
Results in exhaustive search for symbols
Linking library groups (cont.)
Actually used for non-library objects (TUs)
Caveat: the exhaustive search!Manual warns of possible performance hit
Not observed here, but keep that in mind!
Running the binary in debugger
inequation@spearhead:~/projects/largebinary$ gdb -silent largebinaryReading symbols from /home/inequation/projects/largebinary/largebinary...[zzz... several minutes later...]done.(gdb)
This is your typical GDB session start at the beginning of porting.
Caching the gdb-index
Large codebases generate heavy debug symbols (hundreds of MBs)
GDB does symbol indexing at every single startup Massive waste of time!
Caching the gdb-index (cont.)
Solution: fold indexing into the build process
Old linkers: as described in [GNU01]
New linkers (i.e. gold): --gdb-indexMay need to forward from compiler driver: -Wl,--gdb-index
Old linkers need some shell script magic run GDB in batch mode etc.
Here's what I had been using before discovering --gdb-index:
@echo Generating gdb-index in $(OUTPUT_PATH)/gdb-index/$(notdir $@).gdb-index@mkdir -p $(OUTPUT_PATH)/gdb-index@$(GDB) -batch -ex "save gdb-index $(OUTPUT_PATH)/gdb-index" $@@echo Merging gdb-index into $@@$(OBJCOPY) --add-section .gdb_index=$(OUTPUT_PATH)/gdb-index/$(notdir $@).gdb-index --set-section-flags .gdb_index=readonly $@ $@
Agenda
Build system improvements
Signal handlers
Memory debugging with Valgrind
OpenGL debugging techniques
Signal handlers
Unix signals are async notifications
Sources can be:the process itself
another process
user
kernel
Signal handlers (cont.)
A lot like interruptsJump to handler upon first non-atomic op
Normal programflowNormal flowresumesSignalhandlerexecution
SIGNAL
A program can possibly resume normal execution after a signal.
Signal handlers (cont.)
System installs a default handlerUsually terminates and/or dumps coreCore minidump in Windows parlance, but entire mapped address range is dumped (truncated to RLIMIT_CORE bytes)
See signal(7) for default actions
Signal handler can do different things by default: terminate, dump core or stop (suspend).
I'll talk about RLIMIT_CORE a bit later.
It's actually preferable to perform a clean exit (instead of aborting) in response to SIGTERM and SIGQUIT.
Signal handlers (cont.)
Can (should!) specify custom handlers
Get/set handlers via sigaction(2)void handler(int, siginfo_t *, void *);
Needs SA_SIGINFO flag in sigaction() call
Extensively covered in [BENYOSSEF08]
Interesting siginfo_t fields
si_code reason for sending the signalExamples: signal source, FP over/underflow, memory permissions, unmapped address
si_addr memory location (if relevant)SIGILL, SIGFPE, SIGSEGV, SIGBUS and SIGTRAP
These fields are a good reason for using the extended information handlers (with siginfo_t), as opposed to the traditional signal(2) calls with void sighandler(int) handlers.
si_code is an enumeration, not a bit field. Includes reasons general (sent by user, kernel, async I/O operation etc.) and specific to the signal (e.g. whether SIGILL was caused by invalid opcode, operand or addressing mode etc.).
Interesting signals
Worth catchingSIGSEGV, SIGILL, SIGHUP, SIGQUIT, SIGTRAP, SIGIOT, SIGBUS, SIGFPE, SIGTERM, SIGINT
Worth ignoringsignal(signum, SIG_IGN);
SIGCHLD, SIGPIPE
Not all signals are interesting.
SIGTRAP is a programmatical breakpoint (you can also raise(SIGTRAP) from code to have the debugger stop). It's up to you to decide how to handle that.
SIGCHLD is only of note if you plan on spawning sub-processes. Similarly for SIGPIPE, if you're interested in notifications for the other end breaking a FIFO pipe.
Signal handling caveats
Prone to race conditionsSignals may be nested
NormalflowSignalhandler
SIGNAL
SignalhandlerSIGNAL
SIGNAL
In reality, this is slightly more complex than that because of masking and deferring of signals. See sigaction(2) for details. Worth noting that by default, further signalling of the same signal type will be deferred until the return of the handler, unless you specify SA_NODEFER or explicitly unmask that signal.
Ben-Yossef recommends using a spinlock, but warns of deadlocking if contending against a thread with higher priority. In that case it's better to just try locking and sleep with pselect(), but that's a overkill in most cases.
Signal handling caveats (cont.)
Prone to race conditionsCan't share locks with the main program
LockmutexLock mutexDeadlock Signalhandler
SIGNAL
Normalflow
What if your code is interrupted right after grabbing the lock?
Unless your mutex is re-entrant and the lock happened on the signalled thread, of course.
Signal handling caveats (cont.)
Prone to race conditionsCan't call async-unsafe/non-reentrant functionsSee signal(7) for a list of safe ones
Notable functions not on the list:printf() and friends (formatted output)
malloc() and free()
Speaking of re-entrant
Signal handling caveats (cont.)
Not safe to allocate or free heap memory
STOMPED?STOMPED?STOMPED?STOMPED?STOMPED?STOMPED?STOMPED?STOMPED?STOMPED?STOMPED?STOMPED?Source: [LEA01]
Allocator's data structures may have been damaged in the fault.
Signal handling caveats (cont.)
Custom handlers do not dump coreAt handler installation time:Raise RLIMIT_CORE to desired core size
Inside handler, after custom logging:Restore default handler using signal(2) or sigaction(2)
raise(signum);
When using sigaction(2) to install your handler, you will need a stash of struct sigaction in static memory and populate it when installing your handler.
You might not be able to bump RLIMIT_CORE due to permissions. You may also want to limit it, as core dumps are often huge because they include a dump of all the mapped memory tranges. However, smaller core size truncation not all structures/ranges can be dumped dump may actually be useless (stack not in the dumped memory range). Important to get a stack trace in the handler anyway.
The raise(3) call may not work unless you unmask the signal. This is demonstrated in the code example.
Also worth noting that apps are not really in control of core dump filename (/proc/sys/kernel/core_pattern or sysctl).
Safe stack walking
glibc provides backtrace(3) and friends
Symbols are read from the dynamic symbol tablePass -rdynamic at compile-time to populate
The backtrace() function is easy to use, just pass it a static array of void pointers and its size.
Safe stack walking (cont.)
backtrace_symbols() internally calls malloc()Not safe...
Still, can get away with it most of the time (YMMV)
Personally, I haven't had any problems with just calling backtrace_symbols() and calling it a day, but I imagine it may be a bad idea in memory-constrained situations or when nested signalling happens.
Example proper solution
Fork a watchdog process in main()Communicate over a FIFO pipe
In signal handler:Collect & send information down the pipe
backtrace_symbols_fd() down the pipe
Demo code: is.gd/GDCE14Linux
Pipe communication protocol up to the developer.
Agenda
Build system improvements
Signal handlers
Memory debugging with Valgrind
OpenGL debugging techniques
Is this even related to porting?
Yes! Portability bugs easily overlooked
Hardcoded struct sizes/offsets
OpenGL buffers
Incorrect binary packing/unpacking
How did we/they manage to ship that?!
One such case was not setting the right OpenGL pixel packing/unpacking rules when uploading textures, resulting in reading beyond the buffer.
What is Valgrind?
Framework for dynamic, runtime analysis
Dynamic recompilationmachine code IR tool machine code
Performance typically at 25-20% of unmodified codeWorse if heavily threaded execution is serialized
What is Valgrind? (cont.)
Many tools in it:Memory error detectors (Memcheck, SGcheck)
Cache profilers (Cachegrind, Callgrind)
Thread error detectors (Helgrind, DRD)
Heap profilers (Massif, DHAT)
Memcheck doesn't watch the stack or static variables, only the heap. SGcheck the reverse.
Memcheck basics
Basic usage extremely simpleas long as you use the vanilla libc malloc()
valgrind ./app
Will probably report a ton of errors on the first run!Again: How did they manage to ship that?!
Memcheck basics (cont.)
Many false positives, esp. in 3rd partiesXlib, NVIDIA driver
Can suppress them via suppress filesCall Valgrind with --gen-suppressions=yes to generate suppression definitions
Be careful with that! Can let OpenGL bugs slip!
Valgrind will also watch out for allocations made by the driver.
Contrived example
#include int main(int argc, char *argv[]) { int foo, *ptr1 = &foo; int *ptr2 = malloc(sizeof(int)); if (*ptr1) ptr2[1] = 0xabad1dea; else ptr2[1] = 0x15bad700; ptr2[0] = ptr2[2]; return *ptr1;}
Demo code:
is.gd/GDCE14Linux
This very bad code makes several mistakes:Branch on uninitialized variable
Overwrite memory past malloced area
Read past malloced area
Make a syscall with uninitialized memory as a parameter (return code)
Leak memory allocated to ptr2.
Actual demo code in repo also has an example for use-after-free.
Valgrind output for such
==8925== Conditional jump or move depends on uninitialised value(s)==8925== Invalid write of size 4==8925== Invalid read of size 4==8925== Syscall param exit_group(status) contains uninitialised byte(s)==8925== LEAK SUMMARY:==8925== definitely lost: 4 bytes in 1 blocks
Memcheck easily identifies all the bugs. Actual output includes call stacks, both for when the errors occur and for the allocation/freeing of the relevant memory chunk.
Running with --leak-check=full also pinpoints the source of the leak.
What about custom allocators?
Custom memory pool & allocation algo
Valgrind only sees mmap()/munmap() of multiples of entire memory pages
All access within those pages now valid!
How to track errors?
Valgrind has no way of knowing about individual allocations and frees from the pool. Thus it won't protect control structures etc.
Worth noting that mmap() calls for mapping large address ranges will cause Valgrind to issue a warning (not sure about the threshold value, but a 512MB heap does trigger it).
Client requests
Allow annotation of custom allocators
~20 C macros defined in valgrind.hCommon and per-tool requests exist
Can be cut out with -DNVALGRIND
Detailed description in [VALGRIND01]
Client requests have negligible performance impact and are no-ops if Valgrind is not present.
Example: Instrumenting dlmalloc
2.8.4 instrumentation from [CRYSTAL01]
Demo code: is.gd/GDCE14LinuxCompile the sample with -DDLMALLOC
Similar results to libc malloc()
This instrumentation could be improved by using the MEMPOOL family of client requests instead, but it's still quite verbose and useful.
Other uses of client requests
Pointer validationIs address mapped? Is it defined?
Mid-session leak checksLevel transitions
Pointer validation could be used in garbage collectors, for instance.
Other uses of client req. (cont.)
Poisoning memory regionsEnsuring signal handlers don't touch the heap
Ensuring geometry buffers aren't read on CPU
Source: [LEA01]
I'm sure you'll find a ton of other uses for making sure some data stays untouched.
Debugging inside Valgrind
A gdbserver for remote debugging
SIGTRAP (breakpoint) on every error
Unlimited memory watchpoints!Data breakpoints in Visual Studio parlance
Cf. 4 single-word hardware debug registers on x86
Debugging inside Valgrind (cont.)
Terminal A:valgrind --vgdb=yes --vgdb-error=0 ./MyGame
Terminal B:gdb ./MyGame
target remote | vgdb
Agenda
Build system improvements
Signal handlers
Memory debugging with Valgrind
OpenGL debugging techniques
I will deliberately not speak of IHV tools because they are complex enough to warrant separate talks. I'll leave that to the IHV's dev relations teams. :)
Ye Olde Way
Call glGetError() after each OpenGL call
Get 1 of 8 (sic!) error codes
Look up the call in the manual
See what this particular error means in this particular context
Ye Olde Way (cont.)
Then check what was actually the case6 possible reasons for GL_INVALID_VALUE in glTexImage*() alone! See [OPENGL01]
Usually: attach a debugger, replay the scenario
This sucks!
Ye Olde Way (cont.)
Then check what was actually the case6 possible reasons for GL_INVALID_VALUE in glTexImage*() alone! See [OPENGL01]
Usually: attach a debugger, replay the scenario
This sucks! used to suck
No longer does, because we no longer need to do that.
Debug callback
Never call glGetError() again!
Much more detailed informationIncl. performance tips from the driver
Good to check what different drivers say
May not work without a debug OpenGL context (GLX_CONTEXT_DEBUG_BIT_ARB)
The extension spec says that it's only suggested that driver implementers make it available only in debug contexts. It might just work in a regular one with your driver, but better stay on the safe side.
Debug callback (cont.)
Provided by either of (ABI-compatible): GL_KHR_debug [OPENGL02], GL_ARB_debug_output [OPENGL03]
OpenGLOpenGL ESNVIDIA (official)AMD (official)Intel (Mesa)AMD (Mesa)
ARB_debug_output
KHR_debug
State of Mesa at the time of writing this, on my ThinkPad T500 with Debian amd64 testing.
Debug callback (cont.)
void callback(GLenum source, GLenum type, GLuint id, GLenum severity, GLsizei length, const GLchar* message, const void* userParam);
Filter by source, type, severity or individual messages
You don't get just the message, but also some metadata and a user pointer.
IDs are unregulated, so drivers may use them as they see fit. They will definitely vary from vendor to vendor.
Debug callback (cont.)
Verbosity can be controlled (filtering)glDebugMessageControl[ARB]()
[OPENGL02][OPENGL03]
Turn to 11 (GL_DONT_CARE) for valuable perf information!Memory type for buffers, unused mip levels
You can also filter the messages programatically with ifs.
Intel is pretty verbose by default; NVIDIA needs setting all channels to GL_DONT_CARE to spew out more performance hints.
API call tracing
Record a trace of the run of the application
Replay and review the traceLook up OpenGL state at a particular call
Inspect state variables, resources and objects: textures, shaders, buffers...
apitrace or VOGL
Well, this is not helpful...
It's a flat list. Good luck finding a faulty call in a multi-pass renderer in it.
Much better!
Does not a tree view improve things?
KHR_debugEXT_debug_markerEXT_debug_labelGREMEDY_string_markerGREMEDY_frame_terminator
One-off messages
Call grouping
Object labels
Frame terminators
SupportGoodLimitedLimitedLimitedLimited
Annotating the call stream
Annotating the call stream (cont.)
All aforementioned extensions supported by apitrace regardless of driver
Recommended: GL_KHR_debug
Annotating the call stream (cont.)
Call groupingglPushDebugGroup()/glPopDebugGroup()
One-off messagesglDebugMessageInsert[ARB]()
glStringMarkerGREMEDY()
Object labelling
glObjectLabel(), glGetObjectLabel()Buffer, shader, program, vertex array, query, program pipeline, transform feedback, sampler, texture, render buffer, frame buffer, display list
glObjectPtrLabel(), glGetObjectPtrLabel()Sync objects
Annotation caveats
Multi-threaded grouping may break hierarchy
glDebugMessageInsert() calls the debug callback, polluting error streamsWorkaround: drop if source == GL_DEBUG_SOURCE_APPLICATION
In multi-threaded apps, look at the thread ID from which you're trying to push or pop a debug group.
Callback filtering can be achieved via glDebugMessageControl[ARB]() or programmatically. Just remember to always set source to GL_DEBUG_SOURCE_APPLICATION in your glDebugMessageInsert[ARB]() calls.
Example 1: PIX events emulation
#define D3DPERF_BeginEvent(colour, name) \ if (GLEW_KHR_debug && threadOwnsDevice()) \ glPushDebugGroup(GL_DEBUG_SOURCE_APPLICATION,\ (GLuint)colour, -1, name)#define D3DPERF_EndEvent()\ if (GLEW_KHR_debug && threadOwnsDevice())\ glPopDebugGroup()
There's a little trick here since there are no set rules about message IDs, I'm casting the D3DCOLOR (32-bit pixel) to GLuint and using that as an ID.
This was extremely cool to do, because I could instantly take advantage of the existing instrumentation on the whole engine.
Example 2: Game tech demo
University assignment from 2009
Annotated OpenGL 1.4
Demo code: is.gd/GDCE14Linux
The reason this demo was written for such an old version of OpenGL is that it needed to run on an old mobile Intel GPU (i915 I think).
Still works nicely with KHR_debug, though.
Takeaway
gcc-multilib is the prerequisite for 32/64-bit cross-compilation
Switching back and forth between Clang and gcc is easy and useful
Link times can be greatly improved by using gold
Takeaway (cont.)
Caching the gdb-index improves debugging experience
Crash handling is easy to do, tricky to get right
Takeaway (cont.)
Valgrind is an enormous aid in memory debugging
Even when employing custom allocators
OpenGL debugging experience can be vastly improved using some extensions
Questions?
@[email protected]@TheIneQuationKinequation.org
Thank you!
Further Nordic Games information:Kwww.nordicgames.atDevelopment information:Kwww.grimloregames.com
References
LARABEL13 Larabel, M. Clang 3.4 Performance Very Strong Against GCC 4.9 [link]
GNU01 Index Files Speed Up GDB [link]
GNU02 Options for Debugging Your Program or GCC [link]
BENYOSSEF08 Ben-Yossef, G. Crash N' Burn: Writing Linux application fault handlers [link]
LEA01 Lea, D. A Memory Allocator [link]
VALGRIND01 The Client Request mechanism [link]
CRYSTAL01 Crystal Space 3D SDK [link]
OPENGL01 glTexImage2D [link]
OPENGL02 ARB_debug_output [link]
OPENGL03 KHR_debug [link]
XDG01 XDG Base Directory Specification [link]
(), e.g. sigaction(2) Linux Programmer's Manual; to view, type man into a terminal or a web search engine
Special thanks
Fabian GiesenKatarzyna GriksaDamian KobiakaJetro LauhaEric LengyelKrzysztof NarkowiczReinhard PolliceBartomiej WroskiKacper Zber
Shout out to people who helped me (in various ways) to prepare this talk!
Bonus slides!
OpenGL resource leak checking
Intel i965 driver vs stack
Locating user data according to FreeDesktop.org guidelines
Thread priorities in Linux
Additional/new debug features
FREE!!!
OpenGL resource leak checking
Courtesy of Eric Lengyel & Fabian Giesenstatic void check_for_leaks(){ GLuint max_id = 10000; // better idea would be to keep track of assigned names. GLuint id; // if brute force doesn't work, you're not applying it hard enough for ( id = 1 ; id