Intel Threading Building Blocks TBB

Intel Threading Building BlocksTBB

http://www.threadingbuildingblocks.org

Overview

• TBB enables you to specify tasks instead of threads

•TBB targets threading for performance

•TBB is compatible with other threading packages

•TBB emphasize scalable, data parallel programming

•TBB relies on generic programming

void SerialAverage(float *output,float *input, size_t){ for(size_t i=0 ; I < n ; i++) { output[i] = (input[i-1]+input[i]+input[i+1])*(1/3.0f);

}}

Example 1: Compute Average of 3 numbers

#include "tbb/parallel_for.h"#include "tbb/blocked_range.h"#include "tbb/task_scheduler_init.h"

using namespace tbb;

class Average {public: float* input; float* output; void operator()( const blocked_range<int>& range ) const { for( int i=range.begin(); i!=range.end(); ++i ) output[i] = (input[i-1]+input[i]+input[i+1])*(1/3.0f); }};

// Note: The input must be padded such that input[-1] and input[n] // can be used to calculate the first and last output values.void ParallelAverage( float* output, float* input, size_t n ) { Average avg; avg.input = input; avg.output = output; parallel_for( blocked_range<int>( 0, n, 1000 ), avg );}

//! Problem sizeconst int N = 100000;

int main( int argc, char* argv[] ) {

float output[N]; float raw_input[N+2]; raw_input[0] = 0; raw_input[N+1] = 0; float* padded_input = raw_input+1;

task_scheduler_init ;

............ ............ ParallelAverage(output, padded_input, N);

}

Serial

Parallel

Notes on Grain Size

Effect of Grain Size on A[i]=B[i]*c Computation (one million indices)

Auto Partitioner

Reduction

Serial

Parallel

Class for use by Parallel Reduce

Split-Join Sequence

Parallel_scan (parallel prefix)

Parallel Scan

Parallel Scan with Partitioner

• Parallel_scan, breaks a range into subranges and computes a partial result in each subrange in parallel.

• Then, the partial result for subrange k is used to update the information in subrange k+1, starting from k=0 and proceeding sequentially up to the last subrange.

• Finally, each subrange uses its updated information to compute its final result in parallel with all the other subranges.

Parallel Scan Requirements

Parallel Scan

Parallel While and Pipeline

Linked List Example

• Assume Foo takes at least a few thousand instructions to run, then it is possible to get speedup by parallelizing

Parallel_while

• Requires two user-defined objects1. Object that defines the stream of objects - must have pop_if_present - pop_if_present need not be thread safe - nonscalable, since fetching is serialized - possible to get useful speedup 2. Object that defines the loop body i.e. the operator()

Stream of Objects

Loop Body (Operator)

Parallelized Foo Acting on Linked List

• Note: the body of parallel_while can add more work by calling

w.add(item)

Notes on parallel_while scaling

Pipelining

•Single pipeline

Pipelining

•Parallel pipeline

Pipelining Example

Character Buffer

Top Level Code for for Building and Running the Pipeline

Non-Linear Pipes

Trick soln: Use Topologically Sorted Pipeline

•Note that the latency is increased

parallel_sort

CONTAINERS

concurrent_queue<T>

•The template class concurrent_queue<T> implements a concurrent queue with values of type T.

•Multiple threads may simultaneously push and pop elements from the queue.

•In a single-threaded program, a queue is a first-in first-out structure.

•But if multiple threads are pushing and popping concurrently, the definition of first is uncertain.

•The only guarantee of ordering offered by concurrent_queue is that if a thread pushes multiple values, and another thread pops those same values, they will be popped in the same order that they were pushed.

•Pushing is provided by the push method. There are blocking and nonblocking flavors of pop:

•pop_if_present : This method is nonblocking: it attempts to pop a value, and if it cannot because the queue is empty, it returns anyway.

•pop : This method blocks until it pops a value. If a thread must wait for an item to become available and it has nothing else to do, it should use pop(item) and notwhile(!pop_if_present(item)) continue; because pop uses processor resources more efficiently than the loop.

concurrent_queue<T>

concurrent_queue<T>

•Unlike STL, TBB containers are not templated with an allocator argument. The library retains control over memory allocation.

•Unlike most STL containers, concurrent_queue::size_type is a signed integral type, not unsigned. This is because concurrent_queue::size( ) is defined as the number of push operations started minus the number of pop operations started.

•By default, a concurrent_queue<T> is unbounded. It may hold any number of values until memory runs out. It can be bounded by setting the queue capacity with the set_capacity method. Setting the capacity causes push to block until there is room in the queue.

concurrent_queue Example

concurrent_vector<T>

•A concurrent_vector<T> is a dynamically growable array of items of type T for which it is safe to simultaneously access elements in the vector while growing it.

•However, be careful not to let another task access an element that is under construction or is otherwise being modified.

•A concurrent_vector<T> never moves an element until the array is cleared, which can be an advantage over the STL std::vector (which can move elements to resize the vector), even for single-threaded code.

concurrent_vector<T>

concurrent_vector Example

concurrent_hash_map<Key,T,HashCompare>


•A concurrent_hash_map acts as a container of elements of type std::pair<const Key,T>.

•Typically, when accessing a container element, you are interested in either updating it or reading it.

•The template class concurrent_hash_map supports these two operations with the accessor and const_accessor classes

•An accessor represents update (write) access. As long as it points to an element, all other attempts to look up that key in the table block until the accessor is done.

•const_accessor is similar, except that it represents read-only access. Therefore, multiple const_accessors can point to the same element at the same time.


• find and insert methods take an accessor or const_accessor as an argument.

•The choice tells concurrent_hash_map whether you are asking for update or read-only access, respectively.

•Once the method returns, the access lasts until the accessor or const_accessor is destroyed.

•Because having access to an element can block other threads, try to shorten the lifetime of the accessor or const_accessor. To do so, declare it in the innermost block possible.

•To release access even sooner than the end of the block, use the release method.

•The method remove(key) can also operate concurrently. It implicitly requests write access. Therefore, before removing the key, it waits on any other accesses on the key.

Use of release method

MUTUAL EXCLUSION in TBB

•In TBB, you program in terms of tasks, not threads, therefore, you will probably think of mutual exclusion of tasks.

•TBB offers two kinds of mutual exclusion:-Mutexes :

These will be familiar to anyone who has used locks in other environments, and they include common variants such as reader-writer locks. - Atomic operations These are based on atomic operations offered by hardware processors, and they provide a solution that is simpler and faster than mutexes in a limited set of situations.

MUTEXES

•In TBB, mutual exclusion is implemented by classes known as mutexes and locks.

•A mutex is an object on which a task can acquire a lock. Only one task at a time can have a lock on a mutex; other tasks have to wait their turn.

MUTEX EXAMPLE

•With the object-oriented interface, destruction of the scoped_lock object causes the lock to be released, no matter whether the protected region was exited by normal control flow or an exception.

MUTEX EXAMPLE

FLAVORS of MUTEXES

•The simplest mutex is the spin_mutex. A task trying to acquire a lock on a busy spin_mutex waits until it can acquire the lock. A spin_mutex is appropriate when the lock is held for only a few instructions.

•Some mutexes are called scalable. In a strict sense, this is not an accurate name because a mutex limits execution to one task at a time and is therefore necessarily a drag on scalability.

•A scalable mutex is rather one that does no worse than forcing single- threaded performance.

•A mutex actually can do worse than serialize execution if the waiting tasks consume excessive processor cycles and memory bandwidth, reducing the speed of tasks trying to do real work.

•Scalable mutexes are often slower than nonscalable mutexes under light contention, so a nonscalable mutex may be better. When in doubt, use a scalable mutex.

SCALABLE MUTEX

•Mutexes can be fair or unfair.

•A fair mutex lets tasks through in the order they arrive.

•Fair mutexes avoid starving tasks. Each task gets its turn.

•However, unfair mutexes can be faster because they let tasks that are running go through first, instead of the task that is next in line, which may be sleeping because of an interrupt.

FAIR MUTEX

•Mutexes can be reentrant or nonreentrant.

•A reentrant mutex allows a task that is already holding a lock on the mutex to acquire another lock on it.

•This is useful in some recursive algorithms, but it typically adds overhead to the lock implementation.

REENTRANT MUTEX

SLEEP or SPIN MUTEX

•Mutexes can cause a task to spin in user space or sleep while it is waiting.

•For short waits, spinning in user space is fastest because putting a task to sleep takes cycles.

•For long waits, sleeping is better because it causes the task to give up its processor to some task that needs it. Spinning is also undesirable in processors with multiple-task support in a single core, such as Intel processors with hyperthreading technology.

MUTEX TYPES

•A spin_mutex is nonscalable, unfair, nonreentrant, and spins in user space. It would seem to be the worst of all possible worlds, except that it is very fast in lightly contended situations. If you can design your program so that contention is somehow spread out among many spin mutexes, you can improve performance over other kinds of mutexes. If a mutex is heavily contended, your algorithm will not scale anyway. Consider redesigning the algorithm instead of looking for a more efficient lock.

•A queuing_mutex is scalable, fair, nonreentrant, and spins in user space. Use it when scalability and fairness are important.

• A spin_rw_mutex and a queuing_rw_mutex are similar to spin_mutex and queuing_mutex, but they additionally support reader locks.

•A mutex is a wrapper around the system’s native mutual exclusion mechanism. On Windows systems, it is implemented on top of a CRITICAL_SECTION. On Linux systems, it is implemented on top of a pthread mutex.

•Requests for a reader lock are distinguished from requests for a writer lock via an extra Boolean parameter in the constructor for scoped_lock. The parameter is false to request a reader lock and true to request a writer lock. It defaults to true when it is omitted

•It is also possible to upgrade a reader lock to a writer lock by using the method upgrade_to_writer.

Reader/Writer, Upgrade/Downgrade

Reader Writer Mutex Example

Note: upgrade_to_writer returns true if the upgrade happened withoutre-acquiring the lock and false if opposite

ATOMIC OPERATIONS

•Atomic operations are a fast and relatively easy alternative to mutexes.

•They do not suffer from the deadlock and convoying problems

•The main limitation of atomic operations is that they are limited in current computer systems to fairly small data sizes: the largest is usually the size of the largest scalar, often a double-precision floating-point number.

•Atomic operations are also limited to a small set of operations supported by the underlying hardware processor.

•The class atomic<T> implements atomic operations with C++ style.

Fundamental Operations on an atomic variable

SCALABLE MEMORY ALLOCATION

•The scalable memory allocator is cleanly separate from the rest of TBB so that your choice of memory allocator for concurrent usage is independent of your choice of parallel algorithm and container templates.

•When ordinary, nonthreaded allocators are used, memory allocation becomes a serious bottleneck in a multithreaded program because each thread competes for a global lock for each allocation and deallocation of memory from a single global heap. •TBB scalable allocator is built for scalability and speed. In some situations, this comes at a cost of wasted virtual space. Specifically, it wastes a lot of space when allocating blocks in the 9K to 12K range. It is also not yet terribly sophisticated about paging issues.

FALSE SHARING

•False sharing occurs when multiple threads use memory locations that are close together, even if they are not actually using the same memory locations.

•Because processor cores fetch and hold memory in chunks called cache lines, any memory accesses within the same cache line should be done only by the same thread.

•Otherwise, accesses to memory on the same cache line will cause unnecessary contention and swapping of cache lines back and forth, resulting in slowdowns which can easily be a hundred times worse for the affected memory accesses.

•Example: float A[1000] float B[1000]

Memory Allocators

• TBB scalable memory allocator utilizes a memory management algorithm divided on a per-thread basis to minimize contention associated with allocation from a single global heap.

•TBB offers two choices, both similar to the STL template class, std::allocator: - scalable_allocator This template offers just scalability, but it does not completely protect against false sharing. Memory is returned to each thread from a separate pool, which helps protect against false sharing if the memory is not shared with other threads.

-cache_aligned_allocator This template offers both scalability and protection against false sharing. It addresses false sharing by making sure each allocation is done on a cache line.

•Note that protection against false sharing between two objects is guaranteed only if both are allocated with cache_aligned_allocator.

Memory Allocators

•The functionality of cache_aligned_allocator comes at some cost in space because it allocates in multiples of cache-line-size memory chunks, even for a small object.

• The padding is typically 128 bytes. Hence, allocating many small objects with cache_ aligned_allocator may increase memory usage.

Library to Link

•Both the debug and release versions for Threading Building Blocks are divided into two dynamic shared libraries, one with general support and the other with scalable memory allocator.

•The latter is distinguished by malloc in its name (although it does not define a routine actually called malloc).

•For example, the release versions for Windows are tbb.dll and tbbmalloc.dll, respectively.

Using the Allocator Argument to C++ STL Template Classes

•The following code shows how to declare an STL vector that uses cache_aligned_allocator for allocation:

std::vector< int, cache_aligned_allocator<int> >;

MALLOC/FREE/REALLOC/CALLOC

Intel Threading Building Blocks TBB

Documents

Transcript of Intel Threading Building Blocks TBB