Intel Threading Building Blocks TBB
-
Upload
candace-myers -
Category
Documents
-
view
66 -
download
5
description
Transcript of Intel Threading Building Blocks TBB
![Page 1: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/1.jpg)
Intel Threading Building BlocksTBB
http://www.threadingbuildingblocks.org
![Page 2: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/2.jpg)
Overview
• TBB enables you to specify tasks instead of threads
•TBB targets threading for performance
•TBB is compatible with other threading packages
•TBB emphasize scalable, data parallel programming
•TBB relies on generic programming
![Page 3: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/3.jpg)
![Page 4: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/4.jpg)
![Page 5: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/5.jpg)
![Page 6: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/6.jpg)
![Page 7: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/7.jpg)
void SerialAverage(float *output,float *input, size_t){ for(size_t i=0 ; I < n ; i++) { output[i] = (input[i-1]+input[i]+input[i+1])*(1/3.0f);
}}
Example 1: Compute Average of 3 numbers
![Page 8: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/8.jpg)
#include "tbb/parallel_for.h"#include "tbb/blocked_range.h"#include "tbb/task_scheduler_init.h"
using namespace tbb;
class Average {public: float* input; float* output; void operator()( const blocked_range<int>& range ) const { for( int i=range.begin(); i!=range.end(); ++i ) output[i] = (input[i-1]+input[i]+input[i+1])*(1/3.0f); }};
// Note: The input must be padded such that input[-1] and input[n] // can be used to calculate the first and last output values.void ParallelAverage( float* output, float* input, size_t n ) { Average avg; avg.input = input; avg.output = output; parallel_for( blocked_range<int>( 0, n, 1000 ), avg );}
![Page 9: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/9.jpg)
//! Problem sizeconst int N = 100000;
int main( int argc, char* argv[] ) {
float output[N]; float raw_input[N+2]; raw_input[0] = 0; raw_input[N+1] = 0; float* padded_input = raw_input+1;
task_scheduler_init ;
............ ............ ParallelAverage(output, padded_input, N);
}
![Page 10: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/10.jpg)
Serial
Parallel
![Page 11: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/11.jpg)
![Page 12: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/12.jpg)
Notes on Grain Size
![Page 13: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/13.jpg)
Effect of Grain Size on A[i]=B[i]*c Computation (one million indices)
![Page 14: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/14.jpg)
Auto Partitioner
![Page 15: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/15.jpg)
Reduction
Serial
Parallel
![Page 16: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/16.jpg)
Class for use by Parallel Reduce
![Page 17: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/17.jpg)
Split-Join Sequence
![Page 18: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/18.jpg)
![Page 19: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/19.jpg)
![Page 20: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/20.jpg)
![Page 21: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/21.jpg)
![Page 22: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/22.jpg)
![Page 23: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/23.jpg)
![Page 24: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/24.jpg)
![Page 25: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/25.jpg)
![Page 26: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/26.jpg)
![Page 27: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/27.jpg)
Parallel_scan (parallel prefix)
![Page 28: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/28.jpg)
Parallel Scan
![Page 29: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/29.jpg)
Parallel Scan with Partitioner
• Parallel_scan, breaks a range into subranges and computes a partial result in each subrange in parallel.
• Then, the partial result for subrange k is used to update the information in subrange k+1, starting from k=0 and proceeding sequentially up to the last subrange.
• Finally, each subrange uses its updated information to compute its final result in parallel with all the other subranges.
![Page 30: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/30.jpg)
Parallel Scan Requirements
![Page 31: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/31.jpg)
Parallel Scan
![Page 32: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/32.jpg)
![Page 33: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/33.jpg)
Parallel While and Pipeline
![Page 34: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/34.jpg)
Linked List Example
• Assume Foo takes at least a few thousand instructions to run, then it is possible to get speedup by parallelizing
![Page 35: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/35.jpg)
Parallel_while
• Requires two user-defined objects1. Object that defines the stream of objects - must have pop_if_present - pop_if_present need not be thread safe - nonscalable, since fetching is serialized - possible to get useful speedup 2. Object that defines the loop body i.e. the operator()
![Page 36: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/36.jpg)
Stream of Objects
![Page 37: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/37.jpg)
Loop Body (Operator)
![Page 38: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/38.jpg)
Parallelized Foo Acting on Linked List
• Note: the body of parallel_while can add more work by calling
w.add(item)
![Page 39: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/39.jpg)
Notes on parallel_while scaling
![Page 40: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/40.jpg)
Pipelining
•Single pipeline
![Page 41: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/41.jpg)
Pipelining
•Parallel pipeline
![Page 42: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/42.jpg)
Pipelining Example
![Page 43: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/43.jpg)
Character Buffer
![Page 44: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/44.jpg)
Top Level Code for for Building and Running the Pipeline
![Page 45: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/45.jpg)
![Page 46: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/46.jpg)
![Page 47: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/47.jpg)
![Page 48: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/48.jpg)
Non-Linear Pipes
![Page 49: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/49.jpg)
Trick soln: Use Topologically Sorted Pipeline
•Note that the latency is increased
![Page 50: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/50.jpg)
parallel_sort
![Page 51: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/51.jpg)
CONTAINERS
![Page 52: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/52.jpg)
concurrent_queue<T>
•The template class concurrent_queue<T> implements a concurrent queue with values of type T.
•Multiple threads may simultaneously push and pop elements from the queue.
•In a single-threaded program, a queue is a first-in first-out structure.
•But if multiple threads are pushing and popping concurrently, the definition of first is uncertain.
•The only guarantee of ordering offered by concurrent_queue is that if a thread pushes multiple values, and another thread pops those same values, they will be popped in the same order that they were pushed.
![Page 53: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/53.jpg)
•Pushing is provided by the push method. There are blocking and nonblocking flavors of pop:
•pop_if_present : This method is nonblocking: it attempts to pop a value, and if it cannot because the queue is empty, it returns anyway.
•pop : This method blocks until it pops a value. If a thread must wait for an item to become available and it has nothing else to do, it should use pop(item) and notwhile(!pop_if_present(item)) continue; because pop uses processor resources more efficiently than the loop.
concurrent_queue<T>
![Page 54: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/54.jpg)
concurrent_queue<T>
•Unlike STL, TBB containers are not templated with an allocator argument. The library retains control over memory allocation.
•Unlike most STL containers, concurrent_queue::size_type is a signed integral type, not unsigned. This is because concurrent_queue::size( ) is defined as the number of push operations started minus the number of pop operations started.
•By default, a concurrent_queue<T> is unbounded. It may hold any number of values until memory runs out. It can be bounded by setting the queue capacity with the set_capacity method. Setting the capacity causes push to block until there is room in the queue.
![Page 55: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/55.jpg)
concurrent_queue Example
![Page 56: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/56.jpg)
concurrent_vector<T>
•A concurrent_vector<T> is a dynamically growable array of items of type T for which it is safe to simultaneously access elements in the vector while growing it.
•However, be careful not to let another task access an element that is under construction or is otherwise being modified.
•A concurrent_vector<T> never moves an element until the array is cleared, which can be an advantage over the STL std::vector (which can move elements to resize the vector), even for single-threaded code.
![Page 57: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/57.jpg)
concurrent_vector<T>
![Page 58: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/58.jpg)
concurrent_vector<T>
![Page 59: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/59.jpg)
concurrent_vector Example
![Page 60: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/60.jpg)
concurrent_hash_map<Key,T,HashCompare>
![Page 61: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/61.jpg)
![Page 62: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/62.jpg)
![Page 63: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/63.jpg)
concurrent_hash_map<Key,T,HashCompare>
•A concurrent_hash_map acts as a container of elements of type std::pair<const Key,T>.
•Typically, when accessing a container element, you are interested in either updating it or reading it.
•The template class concurrent_hash_map supports these two operations with the accessor and const_accessor classes
•An accessor represents update (write) access. As long as it points to an element, all other attempts to look up that key in the table block until the accessor is done.
•const_accessor is similar, except that it represents read-only access. Therefore, multiple const_accessors can point to the same element at the same time.
![Page 64: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/64.jpg)
concurrent_hash_map<Key,T,HashCompare>
• find and insert methods take an accessor or const_accessor as an argument.
•The choice tells concurrent_hash_map whether you are asking for update or read-only access, respectively.
•Once the method returns, the access lasts until the accessor or const_accessor is destroyed.
•Because having access to an element can block other threads, try to shorten the lifetime of the accessor or const_accessor. To do so, declare it in the innermost block possible.
•To release access even sooner than the end of the block, use the release method.
•The method remove(key) can also operate concurrently. It implicitly requests write access. Therefore, before removing the key, it waits on any other accesses on the key.
![Page 65: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/65.jpg)
Use of release method
![Page 66: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/66.jpg)
MUTUAL EXCLUSION in TBB
•In TBB, you program in terms of tasks, not threads, therefore, you will probably think of mutual exclusion of tasks.
•TBB offers two kinds of mutual exclusion:-Mutexes :
These will be familiar to anyone who has used locks in other environments, and they include common variants such as reader-writer locks. - Atomic operations These are based on atomic operations offered by hardware processors, and they provide a solution that is simpler and faster than mutexes in a limited set of situations.
![Page 67: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/67.jpg)
MUTEXES
•In TBB, mutual exclusion is implemented by classes known as mutexes and locks.
•A mutex is an object on which a task can acquire a lock. Only one task at a time can have a lock on a mutex; other tasks have to wait their turn.
![Page 68: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/68.jpg)
MUTEX EXAMPLE
•With the object-oriented interface, destruction of the scoped_lock object causes the lock to be released, no matter whether the protected region was exited by normal control flow or an exception.
![Page 69: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/69.jpg)
MUTEX EXAMPLE
![Page 70: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/70.jpg)
FLAVORS of MUTEXES
•The simplest mutex is the spin_mutex. A task trying to acquire a lock on a busy spin_mutex waits until it can acquire the lock. A spin_mutex is appropriate when the lock is held for only a few instructions.
![Page 71: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/71.jpg)
•Some mutexes are called scalable. In a strict sense, this is not an accurate name because a mutex limits execution to one task at a time and is therefore necessarily a drag on scalability.
•A scalable mutex is rather one that does no worse than forcing single- threaded performance.
•A mutex actually can do worse than serialize execution if the waiting tasks consume excessive processor cycles and memory bandwidth, reducing the speed of tasks trying to do real work.
•Scalable mutexes are often slower than nonscalable mutexes under light contention, so a nonscalable mutex may be better. When in doubt, use a scalable mutex.
SCALABLE MUTEX
![Page 72: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/72.jpg)
•Mutexes can be fair or unfair.
•A fair mutex lets tasks through in the order they arrive.
•Fair mutexes avoid starving tasks. Each task gets its turn.
•However, unfair mutexes can be faster because they let tasks that are running go through first, instead of the task that is next in line, which may be sleeping because of an interrupt.
FAIR MUTEX
![Page 73: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/73.jpg)
•Mutexes can be reentrant or nonreentrant.
•A reentrant mutex allows a task that is already holding a lock on the mutex to acquire another lock on it.
•This is useful in some recursive algorithms, but it typically adds overhead to the lock implementation.
REENTRANT MUTEX
![Page 74: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/74.jpg)
SLEEP or SPIN MUTEX
•Mutexes can cause a task to spin in user space or sleep while it is waiting.
•For short waits, spinning in user space is fastest because putting a task to sleep takes cycles.
•For long waits, sleeping is better because it causes the task to give up its processor to some task that needs it. Spinning is also undesirable in processors with multiple-task support in a single core, such as Intel processors with hyperthreading technology.
![Page 75: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/75.jpg)
MUTEX TYPES
•A spin_mutex is nonscalable, unfair, nonreentrant, and spins in user space. It would seem to be the worst of all possible worlds, except that it is very fast in lightly contended situations. If you can design your program so that contention is somehow spread out among many spin mutexes, you can improve performance over other kinds of mutexes. If a mutex is heavily contended, your algorithm will not scale anyway. Consider redesigning the algorithm instead of looking for a more efficient lock.
•A queuing_mutex is scalable, fair, nonreentrant, and spins in user space. Use it when scalability and fairness are important.
• A spin_rw_mutex and a queuing_rw_mutex are similar to spin_mutex and queuing_mutex, but they additionally support reader locks.
•A mutex is a wrapper around the system’s native mutual exclusion mechanism. On Windows systems, it is implemented on top of a CRITICAL_SECTION. On Linux systems, it is implemented on top of a pthread mutex.
![Page 76: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/76.jpg)
•Requests for a reader lock are distinguished from requests for a writer lock via an extra Boolean parameter in the constructor for scoped_lock. The parameter is false to request a reader lock and true to request a writer lock. It defaults to true when it is omitted
•It is also possible to upgrade a reader lock to a writer lock by using the method upgrade_to_writer.
Reader/Writer, Upgrade/Downgrade
![Page 77: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/77.jpg)
Reader Writer Mutex Example
Note: upgrade_to_writer returns true if the upgrade happened withoutre-acquiring the lock and false if opposite
![Page 78: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/78.jpg)
ATOMIC OPERATIONS
•Atomic operations are a fast and relatively easy alternative to mutexes.
•They do not suffer from the deadlock and convoying problems
•The main limitation of atomic operations is that they are limited in current computer systems to fairly small data sizes: the largest is usually the size of the largest scalar, often a double-precision floating-point number.
•Atomic operations are also limited to a small set of operations supported by the underlying hardware processor.
•The class atomic<T> implements atomic operations with C++ style.
![Page 79: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/79.jpg)
Fundamental Operations on an atomic variable
![Page 80: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/80.jpg)
SCALABLE MEMORY ALLOCATION
•The scalable memory allocator is cleanly separate from the rest of TBB so that your choice of memory allocator for concurrent usage is independent of your choice of parallel algorithm and container templates.
•When ordinary, nonthreaded allocators are used, memory allocation becomes a serious bottleneck in a multithreaded program because each thread competes for a global lock for each allocation and deallocation of memory from a single global heap. •TBB scalable allocator is built for scalability and speed. In some situations, this comes at a cost of wasted virtual space. Specifically, it wastes a lot of space when allocating blocks in the 9K to 12K range. It is also not yet terribly sophisticated about paging issues.
![Page 81: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/81.jpg)
FALSE SHARING
•False sharing occurs when multiple threads use memory locations that are close together, even if they are not actually using the same memory locations.
•Because processor cores fetch and hold memory in chunks called cache lines, any memory accesses within the same cache line should be done only by the same thread.
•Otherwise, accesses to memory on the same cache line will cause unnecessary contention and swapping of cache lines back and forth, resulting in slowdowns which can easily be a hundred times worse for the affected memory accesses.
•Example: float A[1000] float B[1000]
![Page 82: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/82.jpg)
Memory Allocators
• TBB scalable memory allocator utilizes a memory management algorithm divided on a per-thread basis to minimize contention associated with allocation from a single global heap.
•TBB offers two choices, both similar to the STL template class, std::allocator: - scalable_allocator This template offers just scalability, but it does not completely protect against false sharing. Memory is returned to each thread from a separate pool, which helps protect against false sharing if the memory is not shared with other threads.
-cache_aligned_allocator This template offers both scalability and protection against false sharing. It addresses false sharing by making sure each allocation is done on a cache line.
•Note that protection against false sharing between two objects is guaranteed only if both are allocated with cache_aligned_allocator.
![Page 83: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/83.jpg)
Memory Allocators
•The functionality of cache_aligned_allocator comes at some cost in space because it allocates in multiples of cache-line-size memory chunks, even for a small object.
• The padding is typically 128 bytes. Hence, allocating many small objects with cache_ aligned_allocator may increase memory usage.
![Page 84: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/84.jpg)
Library to Link
•Both the debug and release versions for Threading Building Blocks are divided into two dynamic shared libraries, one with general support and the other with scalable memory allocator.
•The latter is distinguished by malloc in its name (although it does not define a routine actually called malloc).
•For example, the release versions for Windows are tbb.dll and tbbmalloc.dll, respectively.
![Page 85: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/85.jpg)
Using the Allocator Argument to C++ STL Template Classes
•The following code shows how to declare an STL vector that uses cache_aligned_allocator for allocation:
std::vector< int, cache_aligned_allocator<int> >;
![Page 86: Intel Threading Building Blocks TBB](https://reader035.fdocuments.us/reader035/viewer/2022062304/56812e45550346895d93d47f/html5/thumbnails/86.jpg)
MALLOC/FREE/REALLOC/CALLOC