=1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1...
Transcript of =1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1...
Introduction Platform Support Algorithms Conclusion 1/49
MCSTL:Multi-Core Standard Template Library
Practical Implementation of Parallel Algorithms forShared-Memory Systems
Peter Sanders, Johannes Singler
Institute for Theoretical Computer ScienceUniversity of Karlsruhe
April 29th, 2008
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 2/49
Lecture Contents
Introduction
Platform Support
Algorithms
Conclusion
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 3/49
Outline
Introduction
Platform Support
Algorithms
Conclusion
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 4/49
ModelTheory Practice
I machine model concrete machine(s)I pseudo-code existing C++ library
Communication Network Shared MemoryI implicit communication
I cache hierarchy, NUMA, bandwidth sharing
Synchronous PRAM Asynchronous PEsI synchronization a problem itselfI n = p n pI core allocation not static, other processes interfere
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 4/49
ModelTheory Practice
I machine model concrete machine(s)I pseudo-code existing C++ library
Communication Network Shared MemoryI implicit communication
I cache hierarchy, NUMA, bandwidth sharing
Synchronous PRAM Asynchronous PEsI synchronization a problem itselfI n = p n pI core allocation not static, other processes interfere
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 4/49
ModelTheory Practice
I machine model concrete machine(s)I pseudo-code existing C++ library
Communication Network Shared MemoryI implicit communication
I cache hierarchy, NUMA, bandwidth sharing
Synchronous PRAM Asynchronous PEsI synchronization a problem itselfI n = p n pI core allocation not static, other processes interfere
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 5/49
Memory Models Refined
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 6/49
Programming Multicores
I automatic parallelization?only for simple loops
I explicitly parallel?too complicated for everyday use
I libraries of parallelized algorithms!
natural starting point:standard libraries of programming languages
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 6/49
Programming Multicores
I automatic parallelization?only for simple loops
I explicitly parallel?too complicated for everyday use
I libraries of parallelized algorithms!
natural starting point:standard libraries of programming languages
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 7/49
Basic Approach
Make Using Parallel Algorithms“as easy as winking”.Functionality of the C++ Standard Template Library
Why STL?I many efficient and useful algorithms includedI interface very well-known among developersI template mechanism is known to allow
low overhead algorithm librariesI recompilation of existing programs may sufficeI C++ accepted and efficient language
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 7/49
Basic Approach
Make Using Parallel Algorithms“as easy as winking”.Functionality of the C++ Standard Template Library
Why STL?I many efficient and useful algorithms includedI interface very well-known among developersI template mechanism is known to allow
low overhead algorithm librariesI recompilation of existing programs may sufficeI C++ accepted and efficient language
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 8/49
Goals
I parallelize all time consuming STL algorithmsI speedup already for small inputs scale downI high speedup for medium/large inputs
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 9/49
Special Requirements for a Library
GeneralityI genericity (templates)I only few assumptions about input data typesI good scalability in terms of use cases
Compatibility toI existing librariesI platforms
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 9/49
Special Requirements for a Library
GeneralityI genericity (templates)I only few assumptions about input data typesI good scalability in terms of use cases
Compatibility toI existing librariesI platforms
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 10/49
Layers
OpenMP
STL Interface
SerialSTL
Algorithms
Applications
Multi-Core Hardware
OS Thread Scheduling
Atomic Ops
Extensions
Parallel STL Algorithms
STL Interface
MC
ST
LPeter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 11/49
Implemented AlgorithmsI embarrassingly parallel (for each, transform,. . . )I find, find if, mismatch, . . .I partial sum (prefix sum)I partition
I nth element/partial sort
I merge
I sort, stable sort
I random shuffle
I (multi )set/map::insert
Extension to STLI multiway merge
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 12/49
Dependency Graph of Lecture Contents
for_each etc.
accumulatepartial_sum
sort (quick)sort (mwms)
random_shuffle
(multiway_)merge
multi-sequence selection
exact splitting
atomic operations
lock-free DS
partition
partial_sort nth_element
load balancing
fine-grained communication
consistency
placement
placement
init
initial splits
Pla
tform
S
up
po
rtS
eq
ue
ntia
l He
lpe
rA
lgorith
ms
Pa
ralle
l Algo
rithm
s
tournament tree
find etc.
mismatch
equal, lex_comp
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 13/49
Outline
Introduction
Platform Support
Algorithms
Conclusion
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 14/49
Shared-Memory Hardware
I cache coherency protocol makes memory viewconsistent, introduces implicit communication
I cores invalidate entries in cache when other corewrites (snooping)
I overhead only for actual transfer of dataI granularity is one cache-line: avoid false sharing!
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 15/49
Threading SupportI OpenMP: basic primitives.
I example
#pragma omp parallel num threads(p) num threads = omp get num threads();iam = omp get thread num(); ...#pragma omp barrier/single/master
...I quite elegantI no permanent separation possible (asynchrony)I still works when compiler ignores pragmasI good compiler support (GCC, Sun, Intel, MS)
I atomic operationsI fetch-and-add, compare-and-swap
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 15/49
Threading SupportI OpenMP: basic primitives.
I example
#pragma omp parallel num threads(p) num threads = omp get num threads();iam = omp get thread num(); ...#pragma omp barrier/single/master
...I quite elegantI no permanent separation possible (asynchrony)I still works when compiler ignores pragmasI good compiler support (GCC, Sun, Intel, MS)
I atomic operationsI fetch-and-add, compare-and-swap
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 16/49
Atomic Operationsa few operations are executed without any chance ofinterference atomically
I fetch and add(x, i)I t := x; r := x; r := r + i; x := r;return t;
I allows concurrent iteration over sequence
I compare and swap(x, c, r)I if(x = c) x := r; return c [true]; else return r [false];
I secure state transition, can emulate fetch and addand others by using in a loop
I slower than usual operation, in particular whenconcurrent
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 16/49
Atomic Operationsa few operations are executed without any chance ofinterference atomically
I fetch and add(x, i)I t := x; r := x; r := r + i; x := r;return t;
I allows concurrent iteration over sequenceI compare and swap(x, c, r)
I if(x = c) x := r; return c [true]; else return r [false];
I secure state transition, can emulate fetch and addand others by using in a loop
I slower than usual operation, in particular whenconcurrent
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 17/49
Outline
Introduction
Platform Support
Algorithms
Conclusion
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 18/49
find, find if, mismatch,. . .
find the first position in a sequence satisfying a predicate
AnalysisI O(n) sequential time if first hit is at position n
(unknown)I naıve parallel algorithm needs Ω(m/p).I parallelization not worthwhile for small n
mn1st hit
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 19/49
find: Algorithm
I start sequentially up to position m0
I dynamic load balancing using fetch-and-addI scale up and down
using geometrically growing block sizesI first successful thread grabs remaining work
p0 p1 p2 p3
. . . . . .
sequential parallelm0
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 20/49
0
1
2
3
4
108107106105100001000100101
Spe
edup
Position of found element
Find n in the sequence [1,...,108] of integers on 4-way Opteron
MCSTL find, 4 threadsMCSTL find, 3 threadsMCSTL find, 2 threads
sequentialnaive parallel, 4 threadsnaive parallel, 3 threadsnaive parallel, 2 threads
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 21/49
partial sum
ProblemFor a sequence S, compute the prefix sums Ai =
∑ij=1 Sj
Discrimination on Theoretical PRAM AlgorithmsI each PE has its dataI shared-memory advantage: can split data arbitrarilyI assume that p is relatively small
Practical Algorithm for Shared MemoryI divide input into p + 1 piecesI reason: double calculation for first part can be
avoided
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 21/49
partial sum
ProblemFor a sequence S, compute the prefix sums Ai =
∑ij=1 Sj
Discrimination on Theoretical PRAM AlgorithmsI each PE has its dataI shared-memory advantage: can split data arbitrarilyI assume that p is relatively small
Practical Algorithm for Shared MemoryI divide input into p + 1 piecesI reason: double calculation for first part can be
avoided
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 22/49
partial sum: Algorithm
Processor i ∈ 0 . . . p − 1
1. i = 0: compute partial sums of part 0, S[0] := last onei > 0: compute S[i ] := sum of part i
2. i = 0: compute partial sums of S[i ] sequentially3. i ≥ 0: compute partial sums of part i + 1 using S[i ]
AnalysisI only 3 synchronizations (constant)I time complexity O(n/p + p)
speedup p+12 for n p
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 22/49
partial sum: Algorithm
Processor i ∈ 0 . . . p − 1
1. i = 0: compute partial sums of part 0, S[0] := last onei > 0: compute S[i ] := sum of part i
2. i = 0: compute partial sums of S[i ] sequentially3. i ≥ 0: compute partial sums of part i + 1 using S[i ]
AnalysisI only 3 synchronizations (constant)I time complexity O(n/p + p)
speedup p+12 for n p
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 23/49
partial sum: Scheme
t0 t1 t2
input
t0
t0 t1 t2
1.
2.
3.
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 24/49
partial sum: Results
0
1
2
3
4
5
6
7
8
9
10
10810710610510000
Spe
edup
n
Prefix sum of integers on Sun T1
sequential1 thread
2 threads3 threads4 threads8 threads
16 threads32 threads
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 25/49
partition
>pivot<
Sequential AlgorithmI scan from both endsI swap to desired order when contrary
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 26/49
Parallel Partitioning[Tsigas Zhang 2003]
1. scan blocks of size B from both ends1.1 claim new blocks when running out of data
2. swap the unfinished blocks to the “middle”3. recurse on the middle
p0 p1 p2
rest recursive or sequential
swap in parallel
input
I time complexity O(n/p + B log p)
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 27/49
partition: Example
361 91 429 81 4317 93 521 51 215 60 7388 40448 77 3467 53361 91 429 81 4317 93 1 51 215 7388 40448 77 3467 53
3 6191 429 814317 93521 5121 5 60 738840 448 7734 67 53
3185
3185
3 6191429 814317 93521 5121 5 60 738840 44 8 7734 67 533185
3 6191429 814317 93521 5121 5 60 738840 44 8 7734 67 533185
3 6191429 814317 93521 5121 5 60 738840 44 8 7734 67 5331 85
3 processors, B=3, pivot 50, no special cases
p0 p1 p2
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 28/49
0
2
4
6
8
10
12
14
16
10810710610510000
Spe
edup
n
Partitioning of 32-bit integers on Sun T1
sequential1 thread
2 threads3 threads4 threads8 threads
16 threads32 threads
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 29/49
nth element, partial sort, quicksort
<n
rank
partial_sortnth_element
n
AlgorithmsI nth element: quickselect—
linear recursion using partition
I partial sort: nth element then sort
I quicksort: recursion using partition
parallel implementations profit from each other
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 30/49
Multi-Sequence Selection
Problem Definitionfind element with global rank r in k sorted sequences Si
and corresponding splitters.
Usagesplit at elements with global rankn/p 2n/p 3n/p . . . (p − 1)n/pand redistribute elements sequences of the same length (±1) on each PE
I guaranteed even for many equal elements
Solution[Varman et al. 1991], used as black box here
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 30/49
Multi-Sequence Selection
Problem Definitionfind element with global rank r in k sorted sequences Si
and corresponding splitters.
Usagesplit at elements with global rankn/p 2n/p 3n/p . . . (p − 1)n/pand redistribute elements sequences of the same length (±1) on each PE
I guaranteed even for many equal elements
Solution[Varman et al. 1991], used as black box here
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 30/49
Multi-Sequence Selection
Problem Definitionfind element with global rank r in k sorted sequences Si
and corresponding splitters.
Usagesplit at elements with global rankn/p 2n/p 3n/p . . . (p − 1)n/pand redistribute elements sequences of the same length (±1) on each PE
I guaranteed even for many equal elements
Solution[Varman et al. 1991], used as black box here
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 31/49
Sequential multiway merge
Problem Definitionmerge k sorted sequences into one sorted sequence
Solutionuse a tournament tree, usually implemented as loser tree
I binary tree in arrayI efficient computation of indicesI optimal O(log k) running time per merge stepI downside: tricky without sentinels and/or
k not being a power of 2
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 31/49
Sequential multiway merge
Problem Definitionmerge k sorted sequences into one sorted sequence
Solutionuse a tournament tree, usually implemented as loser tree
I binary tree in arrayI efficient computation of indicesI optimal O(log k) running time per merge stepI downside: tricky without sentinels and/or
k not being a power of 2
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 32/49
Loser Tree
6 2 7 9 1 4 74
6 7 9 7
4 4
2
1
36 2 7 9 4 74
6 7 9 7
4 4
3
2deleteMin+insertNext
3
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 33/49
Parallel (multiway )merge
How to divide the problem?I find slabs, i. e. consistent sets
of sections from thesequences
I exact splitting into parts ofequal size (usingmulti-sequence selection)
· · · · · · k
t0
t1
t2
t3
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 34/49
Parallel (multiway )merge: Analysis
I time complexity O(np log k + k log k · log maxj |Sj |)
I one multi-sequence partition per PEI no full linear speedupI good in practice
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 35/49
Parallel (multiway )merge: Results
0
5
10
15
20
25
107106105100001000100
Spe
edup
n/k
16-way merging of pairs of 64-bit integers on Sun T1
sequential1 thread
2 threads3 threads4 threads8 threads
16 threads32 threads
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 36/49
Parallel Multiway MergesortProcedure
1. divide sequence intop parts of equal size
2. in parallel sort theparts locally
3. use parallel p-waymerging to computethe final sequence
4. copy result back tooriginal position
t0 t1 t2 t3
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 37/49
sort, stable sort
Parallel Multiway Mergesort
+ few, cache-efficient local memory accesses+ stable variant easy– needs twice the space temporarily
Quicksort+ in-place± dynamic load-balancing due to unequal splitting– more global memory access– not stable
both variants implemented in the MCSTL
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 37/49
sort, stable sort
Parallel Multiway Mergesort
+ few, cache-efficient local memory accesses+ stable variant easy– needs twice the space temporarily
Quicksort+ in-place± dynamic load-balancing due to unequal splitting– more global memory access– not stable
both variants implemented in the MCSTL
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 38/49
Parallel Multiway Mergesort: Analysis
Running TimeI time complexity O(n log n
p + p log p · log np )
I one multi-sequence partition per PE
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 39/49
Parallel Multiway Mergesort: Practical Issues
I copy to temporary memory first? or merge totemporary memory and copy back later?
I compute starting positions sequentially
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 40/49
Parallel Multiway Mergesort: Results on T1
0
5
10
15
20
25
107106105100001000100
Spe
edup
Number of elements
sequential1 thread
2 threads3 threads4 threads8 threads
16 threads32 threads
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 41/49
Random Permutation (random shuffle)Standard Sequential Algorithm (e. g. STL)for 0 ≤ i < n swap (a[i ], a[rand(i + 1, n − 1)])
Cache-efficient (parallel) algorithm
1. distribute randomly to (local) buckets1b. (copy local buckets to global buckets)2. permute buckets
... ... ... ...
... ...
... ...
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 42/49
Random Permutation (random shuffle)I time complexity O(n
p + p),global communication volume n
I cache efficiency very important (factor 2)
0
1
2
3
4
5
6
7
8
108107106
Spe
edup
n
Cache-aware random shuffling of integers on 4-way Opteron
sequential1 thread
2 threads3 threads4 threads
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 43/49
Embarrassingly Parallel ComputationI semantics
I process a set of elements completely independentlyI atomic units called jobs, running time unknown
I parallelizationI easy in principle (uniform workload) static load-balancing
I interesting for non-uniform workload dynamic load-balancing
I possible solutionsI equal splitting: perfect for uniform workloadI master-worker: possibly considerable overhead
(communication in each step)I dynamic load-balancing:
more general problem, see upcoming lectures
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 44/49
Outline
Introduction
Platform Support
Algorithms
Conclusion
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 45/49
Conclusion
I MCSTL provides a very easy way to incorporateparallelism into programs on an algorithmic level
I performance is excellent for large inputsI basic algorithms known but detailed design and
performance engineering nontrivialI successfully integrated into STXXL
(external memory)I integrated into GCC 4.3 (parallel mode)
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 46/49
Code Sample
#include <algorithm>int a[10000000];int main() std::sort(a, a+10000000);
g++-4.3.0 -D GLIBCXX PARALLEL -fopenmpsort.cpp
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 47/49
Future Work
I complete STL functionalityI better automatic algorithm and parameter selectionI machine model adequate for design and analysis of
multithreaded algorithmsI beyond STL
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 48/49
Algorithms & DS to be Implemented
I priority queuesI some embarrassingly parallel functions
(e. g. valarray)I memory transfer operations (reverse, copy)?
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism
Introduction Platform Support Algorithms Conclusion 49/49
More About All That
I MCSTL website:http://algo2.iti.uni-karlsruhe.de/singler/mcstl/
I libstdc++ parallel mode:http://gcc.gnu.org/onlinedocs/libstdc++/manual/parallel_mode.html
Peter Sanders, Johannes Singler MCSTL - Practical Parallelism