Optimizing Game Architectures with Task-based Parallelism

39
1 Optimizing Game Architectures with Task-based Parallelism Brad Werth Intel Senior Software Engineer

description

Optimizing Game Architectures with Task-based Parallelism. Brad Werth Intel Senior Software Engineer. Parallelism in games is no longer optional. The unending quest for realism in games is causing game content and gameplay to become increasingly complex. - PowerPoint PPT Presentation

Transcript of Optimizing Game Architectures with Task-based Parallelism

Page 1: Optimizing Game Architectures with Task-based Parallelism

1

Optimizing Game Architectures with Task-based Parallelism

Brad WerthIntel Senior Software Engineer

Page 2: Optimizing Game Architectures with Task-based Parallelism

2

Parallelism in games is no longer optional

The unending quest for realism in games is causing game content and gameplay to become increasingly complex.

More complicated scenes + more complicated behavior = increased computation.

CPUs and GPUs are no longer competing on clock speed, but on degree of parallelism.

High-end games require threading.

You can't go home again.

Page 3: Optimizing Game Architectures with Task-based Parallelism

3

Threaded architectures for games are challenging to designTechniques for threading individual computations/systems are

well-known, but... the techniques often have inefficient interactions. games rely on middleware to provide some functionality – more

potential conflict. the moment-to-moment workload can change dramatically. the variety of CPU topologies complicates tuning.

Task-based parallelism is a viable way out of this mess.But first, let's gaze into the abyss...

Page 4: Optimizing Game Architectures with Task-based Parallelism

4

A threaded game architecture –full of pain and oversubscription

Particles array could be partitioned. One-off jobs run on "job threads". Physics threads are created by middleware. Sound mixing is on a dedicated thread. Bones/skinning is a Directed Acyclic Graph.

Particles Jobs Bones

Physics Sound

#?#?

@$#!!

?

? + ? = ???

Page 5: Optimizing Game Architectures with Task-based Parallelism

5

Tasks are an efficient method for handling all of this parallel work

With a task scheduler, and with all of this work decomposed into tasks, then...

one thread pool can process all work. oversubscription will be avoided by using the same

threads for all parallel work. the game will scale well to different core topologies

without painful tuning.

Tasks can do it!

Page 6: Optimizing Game Architectures with Task-based Parallelism

6

Task-based parallelism is agile threading

A thread is... run on the OS. able to be pre-empted. expected to wait on

events. most efficient with some

oversubscription. optimized for a specific

core topology.

A task is... run on a thread pool. run to completion. heavily penalized for

blocking. efficient by avoiding

oversubscription. able to adapt to any

number of threads/cores.

Page 7: Optimizing Game Architectures with Task-based Parallelism

7

Tasks are the uninterrupted portions of threaded work

Texture Lookup

Data Parallelism

Processing

Setup

Page 8: Optimizing Game Architectures with Task-based Parallelism

8

Tasks can be arranged in a dependency graph

Texture Lookup

Data Parallelism

Processing

Setup

Page 9: Optimizing Game Architectures with Task-based Parallelism

9

Dependency graph can be mapped to a thread pool

Lots of work means lots of tasks which fill in the gaps in the thread pool.

The decomposition of tasks and mapping to threads is the job of the task scheduler.

Page 10: Optimizing Game Architectures with Task-based Parallelism

10

Task schedulers have similar ingredients but different flavorsCilk scheduler has been extremely influential.Most have task queues per thread to avoid contention (often multiple

queues per thread).Cache-aware distribution of work is a key performance feature.Most prevent direct manipulation of queues.

The APIs vary in some ways: Constructive schedulers define tasks a priori. Reductive schedulers subdivide tasks in flight. Event-driven schedulers trigger off of I/O. Computation schedulers are triggered manually.

Page 11: Optimizing Game Architectures with Task-based Parallelism

11

Threading Building Blocks is Intel's Open Source task-based schedulerTBB is a reductive, computation scheduler designed to... be cross-platform (Windows*, OS X*, Linux, XBox360*). simplify data parallelism coding. provide scalability and high performance.

TBB has a high-level API for easy parallelism, low-level API for control.

API is not so low-level that it exposes threads or queues.

*Other names and brands may be claimed as the property of others.

Page 12: Optimizing Game Architectures with Task-based Parallelism

12

Enough! Let's look at code

This talk shows code solutions to threaded game architecture problems.

Common threading patterns in games are decomposed into tasks, using the TBB API.

The code is available:

http://software.intel.com/file/14997

Page 13: Optimizing Game Architectures with Task-based Parallelism

13

Start with the easy stuff – turn independent loops into tasks

The TBB high-level API provides parallel_for().

Behold, the humble for loop:

for(int i = 0; i < ELEMENT_MAX; ++i){

doSomethingWith(element[i]);}

Page 14: Optimizing Game Architectures with Task-based Parallelism

14

Using parallel_for() is a 2-step process; step 1 is objectify the loop

class DoSomethingContext{

void operator()(const tbb::blocked_range<int> &range

){

for(int i = range.begin(); i != range.end(); ++i)

{doSomethingWith(element[i]);

}}

}

Page 15: Optimizing Game Architectures with Task-based Parallelism

15

parallel_for() step 2: invoke the objectified loop with a range

tbb::parallel_for(tbb::blocked_range<int>(0, ELEMENT_MAX),

*pDoSomethingContext);

Particles

For more general task decomposition problems, we need a low-level API...

Page 16: Optimizing Game Architectures with Task-based Parallelism

16

TBB low-level API: work trees with blocking calls and/or continuations

Root

Task

More

Callback

Spawn & Wait

Root

Task

More

Spawn

Wait

Blocking calls go down

Continuations go up

Root

Page 17: Optimizing Game Architectures with Task-based Parallelism

17

Work trees can implement common game threading patternsThe TBB low-level API creates and processes trees of work – each node is a

task.

Work trees of tasks can be made to process: Callbacks Promises Synchronized callbacks Long, low priority operations Directed acyclic graph

We'll look at how these patterns can be decomposed into tasks using the TBB low-level API.

Page 18: Optimizing Game Architectures with Task-based Parallelism

18

Callbacks – send it off and never wait

Callbacks are function pointers executed on another thread.

Execution begins immediately.No waiting on individual callbacks - can wait

in aggregate.

void doCallback(FunctionPointer fFunc, void *pParam);

Page 19: Optimizing Game Architectures with Task-based Parallelism

19

void doCallback(FunctionPointer fCallback, void *pParam)

{// allocation with "placement new" syntaxCallbackTask *pCallbackTask = new( s_pCallbackRoot->allocate_additional_child_of( *s_pCallbackRoot )) CallbackTask(fCallback, pParam); s_pCallbackRoot->spawn(*pCallbackTask);

}

Code and tree: Callback

Page 20: Optimizing Game Architectures with Task-based Parallelism

20

Callbacks are simple and powerful, but have limits

No waiting! Callbacks are run on demand.No waiting? Callback has to report its own

completion.No waiting?! Need special case code to run on 1-

core system.

If this is a deal-breaker, there are other options...

Page 21: Optimizing Game Architectures with Task-based Parallelism

21

Promises – come back for it later

Promises are an evolution of Callbacks.

Like Callbacks: Promises are function pointers executed on another

thread. Execution begins immediately.

Unlike Callbacks: Promises provide a method for efficient waiting.

Promise *doPromise(FunctionPointer fFunc, void *pParam);

Page 22: Optimizing Game Architectures with Task-based Parallelism

22

void doPromise(FunctionPointer fCallback, void *pParam, Promise *pPromise){ // allocation with "placement new" syntax tbb::task *pParentTask = new( tbb::task::allocate_root() ) tbb::empty_task();

pPromise->setRoot(pParentTask);

PromiseTask *pPromiseTask = new( pParentTask->allocate_child() ) PromiseTask(fCallback, pParam, pPromise); pParentTask->set_ref_count(2); pParentTask->spawn(*pPromiseTask);}

Code and tree: Promise setup

Page 23: Optimizing Game Architectures with Task-based Parallelism

23

Code and tree: Promise execution

void Promise::waitUntilDone(){if(m_pRoot != NULL){ tbb::spin_mutex::scoped_lock(m_tMutex); if(m_pRoot != NULL) { m_pRoot->wait_for_all(); m_pRoot->destroy(*m_pRoot); m_pRoot = NULL; }}

}

Page 24: Optimizing Game Architectures with Task-based Parallelism

24

Promises seem almost too good to be true

Blocking wait only if result is not available when requested.

If wait blocks, the current thread actively contributes to completion.

2 files, 3 classes, ~150 lines of code.

Robust Promise systems can also: Cancel jobs in progress Get partial progress updates

Page 25: Optimizing Game Architectures with Task-based Parallelism

25

Synchronized Call – wait until all threads call it exactly once

Synchronized Calls can be useful for: Initialization of thread-specific data Coordination with some middleware Instrumentation and profiling

Trivial if you have direct access to threads, but trickier with a task-based system.

void doSynchronizedCallback( FunctionPointer fFunc, void *pParam);

Page 26: Optimizing Game Architectures with Task-based Parallelism

26

Code and tree: Synchronized Call setupvoid doSynchronizedCallback(FunctionPointer fCallback, void *pParam,

int iThreads){

tbb::atomic<int> tAtomicCount;tAtomicCount = iThreads;

tbb::task *pRootTask = new(tbb::task::allocate_root()) tbb::empty_task;tbb::task_list tList;for(int i = 0; i < iThreads; i++){ tbb::task *pSynchronizeTask = new( pRootTask->allocate_child() ) SynchronizeTask(fCallback, pParam, &tAtomicCount); tList.push_back(*pSynchronizeTask);}

pRootTask->set_ref_count(iThreads + 1);pRootTask->spawn_and_wait_for_all(tList);pRootTask->destroy(*pRootTask);

}

Page 27: Optimizing Game Architectures with Task-based Parallelism

27

Code and tree: Synchronized Call executiontbb::task *SynchronizeTask::execute(){ m_fCallback(m_pParam);m_pAtomicCount->fetch_and_decrement();

while(*m_pAtomicCount > 0){ // yield while waiting tbb::this_tbb_thread::yield();}

return NULL;

}

Page 28: Optimizing Game Architectures with Task-based Parallelism

28

Synchronized Calls are useful, but not efficient

Don't make Synchronized Calls in the middle of other work.

Performance penalty is negated if work queue is empty.

Page 29: Optimizing Game Architectures with Task-based Parallelism

29

Long, low priority operation – hide some time-slicingMany games have long operations that run in parallel to the main

computation: Asset loading/decompression Sound mixing Texture tweaking AI pathfinding

It's not necessary to create a new thread to handle these operations!

Use the time-honored technique of time-slicing.

Page 30: Optimizing Game Architectures with Task-based Parallelism

30

Code and tree: Long, low priority operationtbb::task *BaseTask::execute(){

if(s_tLowPriorityTaskFlag.compare_and_swap(false, true) == true){ // allocation with "placement new" syntax tbb::task *pLowPriorityTask = new( this->allocate_additional_child_of( *s_pLowPriorityRoot ) ) LowPriorityTask();

spawn(*pLowPriorityTask);}// spawn other children...

}

Page 31: Optimizing Game Architectures with Task-based Parallelism

31

Long, low priority operations are tricky to get right

Task-based schedulers won’t swap out a task that runs a long time.

A low-priority task can’t reschedule itself naively, or it will create an infinite loop.

Even if scheduler designed with priority in mind, it only matters when a thread runs dry.

This approach doesn’t guarantee any minimum frequency of execution.

Page 32: Optimizing Game Architectures with Task-based Parallelism

32

Directed Acyclic Graph – everyone's favorite paradigm

Directed Acyclic Graphs are popular for executing workflows and kernels in games.

Interface varies, but generally construct a DAG and then execute and wait.

How can work trees represent a DAG?

Page 33: Optimizing Game Architectures with Task-based Parallelism

33

Tree: Directed Acyclic Graph

Root

More

Root

More

Root

More

Root

More

Root

More

Spawn

Spawn

Spawn

Page 34: Optimizing Game Architectures with Task-based Parallelism

34

Directed Acyclic Graph gets the job done

The DAGs created by this approach are destroyed by waiting on them.

Persistent DAGs are possible, for re-use across several frames.

A scheduler could be DAG-based to begin with, making this trivial.

Remember, get the code from:http://software.intel.com/file/14997

Page 35: Optimizing Game Architectures with Task-based Parallelism

35

Soon, rendering may also be decomposable into tasks

DirectX* 11 is designed for use on multi-core CPUs.Multiple threads can draw to local DirectX contexts ("devices"),

and those draw calls are aggregated once per frame.All those draw calls can be done as tasks!All the threads can be initialized with a DirectX context using

Synchronized Callbacks!

This is an extremely positive development; Intel will produce lots of samples to help promote to the industry.

*Other names and brands may be claimed as the property of others.

Page 36: Optimizing Game Architectures with Task-based Parallelism

36

Our sample architecture can be handled by tasks top-to-bottom

Particles partitioning handled by parallel_for(). One-off jobs using Callbacks or Promises. Physics uses job threads via Synchronized Calls. Sound mixing is time-sliced as Low-Priority job. Bones/skinning DAG uses the job threads, too.

Particles Jobs Bones

Physics Sound

√√

Page 37: Optimizing Game Architectures with Task-based Parallelism

37

TBB has other helpful features we didn't coverBeyond the high-level and low-level threading APIs, TBB

has: Atomic variables Scalable memory allocators Efficient thread-safe containers (vector, hash, etc.) High-precision time intervals Core count detection Tunable thread pool sizes Hardware thread abstraction

Page 38: Optimizing Game Architectures with Task-based Parallelism

38

Using task parallelism will ensure continued game performance

Task-based parallelism scales performance on varying architectures.

Break loops into tasks for the maximum performance benefit.

Use tasks to implement a game's preferred threading paradigms.

Page 39: Optimizing Game Architectures with Task-based Parallelism

39

Want more? Here's more

http://www.threadingbuildingblocks.org

[email protected]