=1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1...

62
Introduction Platform Support Algorithms Conclusion 1/49 MCSTL: Multi-Core Standard Template Library Practical Implementation of Parallel Algorithms for Shared-Memory Systems Peter Sanders, Johannes Singler Institute for Theoretical Computer Science University of Karlsruhe April 29 th , 2008 Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Transcript of =1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1...

Page 1: =1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1 ...algo2.iti.kit.edu/sanders/courses/paralg08/singler.pdf · I core allocationnot static,other processes interfere Peter Sanders,

Introduction Platform Support Algorithms Conclusion 1/49

MCSTL:Multi-Core Standard Template Library

Practical Implementation of Parallel Algorithms forShared-Memory Systems

Peter Sanders, Johannes Singler

Institute for Theoretical Computer ScienceUniversity of Karlsruhe

April 29th, 2008

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 2: =1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1 ...algo2.iti.kit.edu/sanders/courses/paralg08/singler.pdf · I core allocationnot static,other processes interfere Peter Sanders,

Introduction Platform Support Algorithms Conclusion 2/49

Lecture Contents

Introduction

Platform Support

Algorithms

Conclusion

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 3: =1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1 ...algo2.iti.kit.edu/sanders/courses/paralg08/singler.pdf · I core allocationnot static,other processes interfere Peter Sanders,

Introduction Platform Support Algorithms Conclusion 3/49

Outline

Introduction

Platform Support

Algorithms

Conclusion

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 4: =1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1 ...algo2.iti.kit.edu/sanders/courses/paralg08/singler.pdf · I core allocationnot static,other processes interfere Peter Sanders,

Introduction Platform Support Algorithms Conclusion 4/49

ModelTheory Practice

I machine model concrete machine(s)I pseudo-code existing C++ library

Communication Network Shared MemoryI implicit communication

I cache hierarchy, NUMA, bandwidth sharing

Synchronous PRAM Asynchronous PEsI synchronization a problem itselfI n = p n pI core allocation not static, other processes interfere

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 5: =1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1 ...algo2.iti.kit.edu/sanders/courses/paralg08/singler.pdf · I core allocationnot static,other processes interfere Peter Sanders,

Introduction Platform Support Algorithms Conclusion 4/49

ModelTheory Practice

I machine model concrete machine(s)I pseudo-code existing C++ library

Communication Network Shared MemoryI implicit communication

I cache hierarchy, NUMA, bandwidth sharing

Synchronous PRAM Asynchronous PEsI synchronization a problem itselfI n = p n pI core allocation not static, other processes interfere

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 6: =1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1 ...algo2.iti.kit.edu/sanders/courses/paralg08/singler.pdf · I core allocationnot static,other processes interfere Peter Sanders,

Introduction Platform Support Algorithms Conclusion 4/49

ModelTheory Practice

I machine model concrete machine(s)I pseudo-code existing C++ library

Communication Network Shared MemoryI implicit communication

I cache hierarchy, NUMA, bandwidth sharing

Synchronous PRAM Asynchronous PEsI synchronization a problem itselfI n = p n pI core allocation not static, other processes interfere

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 7: =1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1 ...algo2.iti.kit.edu/sanders/courses/paralg08/singler.pdf · I core allocationnot static,other processes interfere Peter Sanders,

Introduction Platform Support Algorithms Conclusion 5/49

Memory Models Refined

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 8: =1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1 ...algo2.iti.kit.edu/sanders/courses/paralg08/singler.pdf · I core allocationnot static,other processes interfere Peter Sanders,

Introduction Platform Support Algorithms Conclusion 6/49

Programming Multicores

I automatic parallelization?only for simple loops

I explicitly parallel?too complicated for everyday use

I libraries of parallelized algorithms!

natural starting point:standard libraries of programming languages

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 9: =1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1 ...algo2.iti.kit.edu/sanders/courses/paralg08/singler.pdf · I core allocationnot static,other processes interfere Peter Sanders,

Introduction Platform Support Algorithms Conclusion 6/49

Programming Multicores

I automatic parallelization?only for simple loops

I explicitly parallel?too complicated for everyday use

I libraries of parallelized algorithms!

natural starting point:standard libraries of programming languages

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 10: =1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1 ...algo2.iti.kit.edu/sanders/courses/paralg08/singler.pdf · I core allocationnot static,other processes interfere Peter Sanders,

Introduction Platform Support Algorithms Conclusion 7/49

Basic Approach

Make Using Parallel Algorithms“as easy as winking”.Functionality of the C++ Standard Template Library

Why STL?I many efficient and useful algorithms includedI interface very well-known among developersI template mechanism is known to allow

low overhead algorithm librariesI recompilation of existing programs may sufficeI C++ accepted and efficient language

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 11: =1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1 ...algo2.iti.kit.edu/sanders/courses/paralg08/singler.pdf · I core allocationnot static,other processes interfere Peter Sanders,

Introduction Platform Support Algorithms Conclusion 7/49

Basic Approach

Make Using Parallel Algorithms“as easy as winking”.Functionality of the C++ Standard Template Library

Why STL?I many efficient and useful algorithms includedI interface very well-known among developersI template mechanism is known to allow

low overhead algorithm librariesI recompilation of existing programs may sufficeI C++ accepted and efficient language

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 12: =1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1 ...algo2.iti.kit.edu/sanders/courses/paralg08/singler.pdf · I core allocationnot static,other processes interfere Peter Sanders,

Introduction Platform Support Algorithms Conclusion 8/49

Goals

I parallelize all time consuming STL algorithmsI speedup already for small inputs scale downI high speedup for medium/large inputs

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 13: =1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1 ...algo2.iti.kit.edu/sanders/courses/paralg08/singler.pdf · I core allocationnot static,other processes interfere Peter Sanders,

Introduction Platform Support Algorithms Conclusion 9/49

Special Requirements for a Library

GeneralityI genericity (templates)I only few assumptions about input data typesI good scalability in terms of use cases

Compatibility toI existing librariesI platforms

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 14: =1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1 ...algo2.iti.kit.edu/sanders/courses/paralg08/singler.pdf · I core allocationnot static,other processes interfere Peter Sanders,

Introduction Platform Support Algorithms Conclusion 9/49

Special Requirements for a Library

GeneralityI genericity (templates)I only few assumptions about input data typesI good scalability in terms of use cases

Compatibility toI existing librariesI platforms

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 15: =1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1 ...algo2.iti.kit.edu/sanders/courses/paralg08/singler.pdf · I core allocationnot static,other processes interfere Peter Sanders,

Introduction Platform Support Algorithms Conclusion 10/49

Layers

OpenMP

STL Interface

SerialSTL

Algorithms

Applications

Multi-Core Hardware

OS Thread Scheduling

Atomic Ops

Extensions

Parallel STL Algorithms

STL Interface

MC

ST

LPeter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 16: =1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1 ...algo2.iti.kit.edu/sanders/courses/paralg08/singler.pdf · I core allocationnot static,other processes interfere Peter Sanders,

Introduction Platform Support Algorithms Conclusion 11/49

Implemented AlgorithmsI embarrassingly parallel (for each, transform,. . . )I find, find if, mismatch, . . .I partial sum (prefix sum)I partition

I nth element/partial sort

I merge

I sort, stable sort

I random shuffle

I (multi )set/map::insert

Extension to STLI multiway merge

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 17: =1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1 ...algo2.iti.kit.edu/sanders/courses/paralg08/singler.pdf · I core allocationnot static,other processes interfere Peter Sanders,

Introduction Platform Support Algorithms Conclusion 12/49

Dependency Graph of Lecture Contents

for_each etc.

accumulatepartial_sum

sort (quick)sort (mwms)

random_shuffle

(multiway_)merge

multi-sequence selection

exact splitting

atomic operations

lock-free DS

partition

partial_sort nth_element

load balancing

fine-grained communication

consistency

placement

placement

init

initial splits

Pla

tform

S

up

po

rtS

eq

ue

ntia

l He

lpe

rA

lgorith

ms

Pa

ralle

l Algo

rithm

s

tournament tree

find etc.

mismatch

equal, lex_comp

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 18: =1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1 ...algo2.iti.kit.edu/sanders/courses/paralg08/singler.pdf · I core allocationnot static,other processes interfere Peter Sanders,

Introduction Platform Support Algorithms Conclusion 13/49

Outline

Introduction

Platform Support

Algorithms

Conclusion

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 19: =1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1 ...algo2.iti.kit.edu/sanders/courses/paralg08/singler.pdf · I core allocationnot static,other processes interfere Peter Sanders,

Introduction Platform Support Algorithms Conclusion 14/49

Shared-Memory Hardware

I cache coherency protocol makes memory viewconsistent, introduces implicit communication

I cores invalidate entries in cache when other corewrites (snooping)

I overhead only for actual transfer of dataI granularity is one cache-line: avoid false sharing!

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 20: =1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1 ...algo2.iti.kit.edu/sanders/courses/paralg08/singler.pdf · I core allocationnot static,other processes interfere Peter Sanders,

Introduction Platform Support Algorithms Conclusion 15/49

Threading SupportI OpenMP: basic primitives.

I example

#pragma omp parallel num threads(p) num threads = omp get num threads();iam = omp get thread num(); ...#pragma omp barrier/single/master

...I quite elegantI no permanent separation possible (asynchrony)I still works when compiler ignores pragmasI good compiler support (GCC, Sun, Intel, MS)

I atomic operationsI fetch-and-add, compare-and-swap

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 21: =1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1 ...algo2.iti.kit.edu/sanders/courses/paralg08/singler.pdf · I core allocationnot static,other processes interfere Peter Sanders,

Introduction Platform Support Algorithms Conclusion 15/49

Threading SupportI OpenMP: basic primitives.

I example

#pragma omp parallel num threads(p) num threads = omp get num threads();iam = omp get thread num(); ...#pragma omp barrier/single/master

...I quite elegantI no permanent separation possible (asynchrony)I still works when compiler ignores pragmasI good compiler support (GCC, Sun, Intel, MS)

I atomic operationsI fetch-and-add, compare-and-swap

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 22: =1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1 ...algo2.iti.kit.edu/sanders/courses/paralg08/singler.pdf · I core allocationnot static,other processes interfere Peter Sanders,

Introduction Platform Support Algorithms Conclusion 16/49

Atomic Operationsa few operations are executed without any chance ofinterference atomically

I fetch and add(x, i)I t := x; r := x; r := r + i; x := r;return t;

I allows concurrent iteration over sequence

I compare and swap(x, c, r)I if(x = c) x := r; return c [true]; else return r [false];

I secure state transition, can emulate fetch and addand others by using in a loop

I slower than usual operation, in particular whenconcurrent

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 23: =1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1 ...algo2.iti.kit.edu/sanders/courses/paralg08/singler.pdf · I core allocationnot static,other processes interfere Peter Sanders,

Introduction Platform Support Algorithms Conclusion 16/49

Atomic Operationsa few operations are executed without any chance ofinterference atomically

I fetch and add(x, i)I t := x; r := x; r := r + i; x := r;return t;

I allows concurrent iteration over sequenceI compare and swap(x, c, r)

I if(x = c) x := r; return c [true]; else return r [false];

I secure state transition, can emulate fetch and addand others by using in a loop

I slower than usual operation, in particular whenconcurrent

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 24: =1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1 ...algo2.iti.kit.edu/sanders/courses/paralg08/singler.pdf · I core allocationnot static,other processes interfere Peter Sanders,

Introduction Platform Support Algorithms Conclusion 17/49

Outline

Introduction

Platform Support

Algorithms

Conclusion

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 25: =1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1 ...algo2.iti.kit.edu/sanders/courses/paralg08/singler.pdf · I core allocationnot static,other processes interfere Peter Sanders,

Introduction Platform Support Algorithms Conclusion 18/49

find, find if, mismatch,. . .

find the first position in a sequence satisfying a predicate

AnalysisI O(n) sequential time if first hit is at position n

(unknown)I naıve parallel algorithm needs Ω(m/p).I parallelization not worthwhile for small n

mn1st hit

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 26: =1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1 ...algo2.iti.kit.edu/sanders/courses/paralg08/singler.pdf · I core allocationnot static,other processes interfere Peter Sanders,

Introduction Platform Support Algorithms Conclusion 19/49

find: Algorithm

I start sequentially up to position m0

I dynamic load balancing using fetch-and-addI scale up and down

using geometrically growing block sizesI first successful thread grabs remaining work

p0 p1 p2 p3

. . . . . .

sequential parallelm0

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 27: =1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1 ...algo2.iti.kit.edu/sanders/courses/paralg08/singler.pdf · I core allocationnot static,other processes interfere Peter Sanders,

Introduction Platform Support Algorithms Conclusion 20/49

0

1

2

3

4

108107106105100001000100101

Spe

edup

Position of found element

Find n in the sequence [1,...,108] of integers on 4-way Opteron

MCSTL find, 4 threadsMCSTL find, 3 threadsMCSTL find, 2 threads

sequentialnaive parallel, 4 threadsnaive parallel, 3 threadsnaive parallel, 2 threads

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 28: =1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1 ...algo2.iti.kit.edu/sanders/courses/paralg08/singler.pdf · I core allocationnot static,other processes interfere Peter Sanders,

Introduction Platform Support Algorithms Conclusion 21/49

partial sum

ProblemFor a sequence S, compute the prefix sums Ai =

∑ij=1 Sj

Discrimination on Theoretical PRAM AlgorithmsI each PE has its dataI shared-memory advantage: can split data arbitrarilyI assume that p is relatively small

Practical Algorithm for Shared MemoryI divide input into p + 1 piecesI reason: double calculation for first part can be

avoided

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 29: =1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1 ...algo2.iti.kit.edu/sanders/courses/paralg08/singler.pdf · I core allocationnot static,other processes interfere Peter Sanders,

Introduction Platform Support Algorithms Conclusion 21/49

partial sum

ProblemFor a sequence S, compute the prefix sums Ai =

∑ij=1 Sj

Discrimination on Theoretical PRAM AlgorithmsI each PE has its dataI shared-memory advantage: can split data arbitrarilyI assume that p is relatively small

Practical Algorithm for Shared MemoryI divide input into p + 1 piecesI reason: double calculation for first part can be

avoided

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 30: =1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1 ...algo2.iti.kit.edu/sanders/courses/paralg08/singler.pdf · I core allocationnot static,other processes interfere Peter Sanders,

Introduction Platform Support Algorithms Conclusion 22/49

partial sum: Algorithm

Processor i ∈ 0 . . . p − 1

1. i = 0: compute partial sums of part 0, S[0] := last onei > 0: compute S[i ] := sum of part i

2. i = 0: compute partial sums of S[i ] sequentially3. i ≥ 0: compute partial sums of part i + 1 using S[i ]

AnalysisI only 3 synchronizations (constant)I time complexity O(n/p + p)

speedup p+12 for n p

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 31: =1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1 ...algo2.iti.kit.edu/sanders/courses/paralg08/singler.pdf · I core allocationnot static,other processes interfere Peter Sanders,

Introduction Platform Support Algorithms Conclusion 22/49

partial sum: Algorithm

Processor i ∈ 0 . . . p − 1

1. i = 0: compute partial sums of part 0, S[0] := last onei > 0: compute S[i ] := sum of part i

2. i = 0: compute partial sums of S[i ] sequentially3. i ≥ 0: compute partial sums of part i + 1 using S[i ]

AnalysisI only 3 synchronizations (constant)I time complexity O(n/p + p)

speedup p+12 for n p

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 32: =1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1 ...algo2.iti.kit.edu/sanders/courses/paralg08/singler.pdf · I core allocationnot static,other processes interfere Peter Sanders,

Introduction Platform Support Algorithms Conclusion 23/49

partial sum: Scheme

t0 t1 t2

input

t0

t0 t1 t2

1.

2.

3.

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 33: =1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1 ...algo2.iti.kit.edu/sanders/courses/paralg08/singler.pdf · I core allocationnot static,other processes interfere Peter Sanders,

Introduction Platform Support Algorithms Conclusion 24/49

partial sum: Results

0

1

2

3

4

5

6

7

8

9

10

10810710610510000

Spe

edup

n

Prefix sum of integers on Sun T1

sequential1 thread

2 threads3 threads4 threads8 threads

16 threads32 threads

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 34: =1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1 ...algo2.iti.kit.edu/sanders/courses/paralg08/singler.pdf · I core allocationnot static,other processes interfere Peter Sanders,

Introduction Platform Support Algorithms Conclusion 25/49

partition

>pivot<

Sequential AlgorithmI scan from both endsI swap to desired order when contrary

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 35: =1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1 ...algo2.iti.kit.edu/sanders/courses/paralg08/singler.pdf · I core allocationnot static,other processes interfere Peter Sanders,

Introduction Platform Support Algorithms Conclusion 26/49

Parallel Partitioning[Tsigas Zhang 2003]

1. scan blocks of size B from both ends1.1 claim new blocks when running out of data

2. swap the unfinished blocks to the “middle”3. recurse on the middle

p0 p1 p2

rest recursive or sequential

swap in parallel

input

I time complexity O(n/p + B log p)

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 36: =1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1 ...algo2.iti.kit.edu/sanders/courses/paralg08/singler.pdf · I core allocationnot static,other processes interfere Peter Sanders,

Introduction Platform Support Algorithms Conclusion 27/49

partition: Example

361 91 429 81 4317 93 521 51 215 60 7388 40448 77 3467 53361 91 429 81 4317 93 1 51 215 7388 40448 77 3467 53

3 6191 429 814317 93521 5121 5 60 738840 448 7734 67 53

3185

3185

3 6191429 814317 93521 5121 5 60 738840 44 8 7734 67 533185

3 6191429 814317 93521 5121 5 60 738840 44 8 7734 67 533185

3 6191429 814317 93521 5121 5 60 738840 44 8 7734 67 5331 85

3 processors, B=3, pivot 50, no special cases

p0 p1 p2

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 37: =1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1 ...algo2.iti.kit.edu/sanders/courses/paralg08/singler.pdf · I core allocationnot static,other processes interfere Peter Sanders,

Introduction Platform Support Algorithms Conclusion 28/49

0

2

4

6

8

10

12

14

16

10810710610510000

Spe

edup

n

Partitioning of 32-bit integers on Sun T1

sequential1 thread

2 threads3 threads4 threads8 threads

16 threads32 threads

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 38: =1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1 ...algo2.iti.kit.edu/sanders/courses/paralg08/singler.pdf · I core allocationnot static,other processes interfere Peter Sanders,

Introduction Platform Support Algorithms Conclusion 29/49

nth element, partial sort, quicksort

<n

rank

partial_sortnth_element

n

AlgorithmsI nth element: quickselect—

linear recursion using partition

I partial sort: nth element then sort

I quicksort: recursion using partition

parallel implementations profit from each other

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 39: =1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1 ...algo2.iti.kit.edu/sanders/courses/paralg08/singler.pdf · I core allocationnot static,other processes interfere Peter Sanders,

Introduction Platform Support Algorithms Conclusion 30/49

Multi-Sequence Selection

Problem Definitionfind element with global rank r in k sorted sequences Si

and corresponding splitters.

Usagesplit at elements with global rankn/p 2n/p 3n/p . . . (p − 1)n/pand redistribute elements sequences of the same length (±1) on each PE

I guaranteed even for many equal elements

Solution[Varman et al. 1991], used as black box here

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 40: =1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1 ...algo2.iti.kit.edu/sanders/courses/paralg08/singler.pdf · I core allocationnot static,other processes interfere Peter Sanders,

Introduction Platform Support Algorithms Conclusion 30/49

Multi-Sequence Selection

Problem Definitionfind element with global rank r in k sorted sequences Si

and corresponding splitters.

Usagesplit at elements with global rankn/p 2n/p 3n/p . . . (p − 1)n/pand redistribute elements sequences of the same length (±1) on each PE

I guaranteed even for many equal elements

Solution[Varman et al. 1991], used as black box here

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 41: =1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1 ...algo2.iti.kit.edu/sanders/courses/paralg08/singler.pdf · I core allocationnot static,other processes interfere Peter Sanders,

Introduction Platform Support Algorithms Conclusion 30/49

Multi-Sequence Selection

Problem Definitionfind element with global rank r in k sorted sequences Si

and corresponding splitters.

Usagesplit at elements with global rankn/p 2n/p 3n/p . . . (p − 1)n/pand redistribute elements sequences of the same length (±1) on each PE

I guaranteed even for many equal elements

Solution[Varman et al. 1991], used as black box here

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 42: =1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1 ...algo2.iti.kit.edu/sanders/courses/paralg08/singler.pdf · I core allocationnot static,other processes interfere Peter Sanders,

Introduction Platform Support Algorithms Conclusion 31/49

Sequential multiway merge

Problem Definitionmerge k sorted sequences into one sorted sequence

Solutionuse a tournament tree, usually implemented as loser tree

I binary tree in arrayI efficient computation of indicesI optimal O(log k) running time per merge stepI downside: tricky without sentinels and/or

k not being a power of 2

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 43: =1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1 ...algo2.iti.kit.edu/sanders/courses/paralg08/singler.pdf · I core allocationnot static,other processes interfere Peter Sanders,

Introduction Platform Support Algorithms Conclusion 31/49

Sequential multiway merge

Problem Definitionmerge k sorted sequences into one sorted sequence

Solutionuse a tournament tree, usually implemented as loser tree

I binary tree in arrayI efficient computation of indicesI optimal O(log k) running time per merge stepI downside: tricky without sentinels and/or

k not being a power of 2

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 44: =1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1 ...algo2.iti.kit.edu/sanders/courses/paralg08/singler.pdf · I core allocationnot static,other processes interfere Peter Sanders,

Introduction Platform Support Algorithms Conclusion 32/49

Loser Tree

6 2 7 9 1 4 74

6 7 9 7

4 4

2

1

36 2 7 9 4 74

6 7 9 7

4 4

3

2deleteMin+insertNext

3

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 45: =1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1 ...algo2.iti.kit.edu/sanders/courses/paralg08/singler.pdf · I core allocationnot static,other processes interfere Peter Sanders,

Introduction Platform Support Algorithms Conclusion 33/49

Parallel (multiway )merge

How to divide the problem?I find slabs, i. e. consistent sets

of sections from thesequences

I exact splitting into parts ofequal size (usingmulti-sequence selection)

· · · · · · k

t0

t1

t2

t3

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 46: =1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1 ...algo2.iti.kit.edu/sanders/courses/paralg08/singler.pdf · I core allocationnot static,other processes interfere Peter Sanders,

Introduction Platform Support Algorithms Conclusion 34/49

Parallel (multiway )merge: Analysis

I time complexity O(np log k + k log k · log maxj |Sj |)

I one multi-sequence partition per PEI no full linear speedupI good in practice

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 47: =1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1 ...algo2.iti.kit.edu/sanders/courses/paralg08/singler.pdf · I core allocationnot static,other processes interfere Peter Sanders,

Introduction Platform Support Algorithms Conclusion 35/49

Parallel (multiway )merge: Results

0

5

10

15

20

25

107106105100001000100

Spe

edup

n/k

16-way merging of pairs of 64-bit integers on Sun T1

sequential1 thread

2 threads3 threads4 threads8 threads

16 threads32 threads

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 48: =1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1 ...algo2.iti.kit.edu/sanders/courses/paralg08/singler.pdf · I core allocationnot static,other processes interfere Peter Sanders,

Introduction Platform Support Algorithms Conclusion 36/49

Parallel Multiway MergesortProcedure

1. divide sequence intop parts of equal size

2. in parallel sort theparts locally

3. use parallel p-waymerging to computethe final sequence

4. copy result back tooriginal position

t0 t1 t2 t3

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 49: =1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1 ...algo2.iti.kit.edu/sanders/courses/paralg08/singler.pdf · I core allocationnot static,other processes interfere Peter Sanders,

Introduction Platform Support Algorithms Conclusion 37/49

sort, stable sort

Parallel Multiway Mergesort

+ few, cache-efficient local memory accesses+ stable variant easy– needs twice the space temporarily

Quicksort+ in-place± dynamic load-balancing due to unequal splitting– more global memory access– not stable

both variants implemented in the MCSTL

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 50: =1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1 ...algo2.iti.kit.edu/sanders/courses/paralg08/singler.pdf · I core allocationnot static,other processes interfere Peter Sanders,

Introduction Platform Support Algorithms Conclusion 37/49

sort, stable sort

Parallel Multiway Mergesort

+ few, cache-efficient local memory accesses+ stable variant easy– needs twice the space temporarily

Quicksort+ in-place± dynamic load-balancing due to unequal splitting– more global memory access– not stable

both variants implemented in the MCSTL

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 51: =1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1 ...algo2.iti.kit.edu/sanders/courses/paralg08/singler.pdf · I core allocationnot static,other processes interfere Peter Sanders,

Introduction Platform Support Algorithms Conclusion 38/49

Parallel Multiway Mergesort: Analysis

Running TimeI time complexity O(n log n

p + p log p · log np )

I one multi-sequence partition per PE

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 52: =1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1 ...algo2.iti.kit.edu/sanders/courses/paralg08/singler.pdf · I core allocationnot static,other processes interfere Peter Sanders,

Introduction Platform Support Algorithms Conclusion 39/49

Parallel Multiway Mergesort: Practical Issues

I copy to temporary memory first? or merge totemporary memory and copy back later?

I compute starting positions sequentially

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 53: =1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1 ...algo2.iti.kit.edu/sanders/courses/paralg08/singler.pdf · I core allocationnot static,other processes interfere Peter Sanders,

Introduction Platform Support Algorithms Conclusion 40/49

Parallel Multiway Mergesort: Results on T1

0

5

10

15

20

25

107106105100001000100

Spe

edup

Number of elements

sequential1 thread

2 threads3 threads4 threads8 threads

16 threads32 threads

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 54: =1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1 ...algo2.iti.kit.edu/sanders/courses/paralg08/singler.pdf · I core allocationnot static,other processes interfere Peter Sanders,

Introduction Platform Support Algorithms Conclusion 41/49

Random Permutation (random shuffle)Standard Sequential Algorithm (e. g. STL)for 0 ≤ i < n swap (a[i ], a[rand(i + 1, n − 1)])

Cache-efficient (parallel) algorithm

1. distribute randomly to (local) buckets1b. (copy local buckets to global buckets)2. permute buckets

... ... ... ...

... ...

... ...

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 55: =1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1 ...algo2.iti.kit.edu/sanders/courses/paralg08/singler.pdf · I core allocationnot static,other processes interfere Peter Sanders,

Introduction Platform Support Algorithms Conclusion 42/49

Random Permutation (random shuffle)I time complexity O(n

p + p),global communication volume n

I cache efficiency very important (factor 2)

0

1

2

3

4

5

6

7

8

108107106

Spe

edup

n

Cache-aware random shuffling of integers on 4-way Opteron

sequential1 thread

2 threads3 threads4 threads

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 56: =1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1 ...algo2.iti.kit.edu/sanders/courses/paralg08/singler.pdf · I core allocationnot static,other processes interfere Peter Sanders,

Introduction Platform Support Algorithms Conclusion 43/49

Embarrassingly Parallel ComputationI semantics

I process a set of elements completely independentlyI atomic units called jobs, running time unknown

I parallelizationI easy in principle (uniform workload) static load-balancing

I interesting for non-uniform workload dynamic load-balancing

I possible solutionsI equal splitting: perfect for uniform workloadI master-worker: possibly considerable overhead

(communication in each step)I dynamic load-balancing:

more general problem, see upcoming lectures

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 57: =1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1 ...algo2.iti.kit.edu/sanders/courses/paralg08/singler.pdf · I core allocationnot static,other processes interfere Peter Sanders,

Introduction Platform Support Algorithms Conclusion 44/49

Outline

Introduction

Platform Support

Algorithms

Conclusion

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 58: =1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1 ...algo2.iti.kit.edu/sanders/courses/paralg08/singler.pdf · I core allocationnot static,other processes interfere Peter Sanders,

Introduction Platform Support Algorithms Conclusion 45/49

Conclusion

I MCSTL provides a very easy way to incorporateparallelism into programs on an algorithmic level

I performance is excellent for large inputsI basic algorithms known but detailed design and

performance engineering nontrivialI successfully integrated into STXXL

(external memory)I integrated into GCC 4.3 (parallel mode)

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 59: =1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1 ...algo2.iti.kit.edu/sanders/courses/paralg08/singler.pdf · I core allocationnot static,other processes interfere Peter Sanders,

Introduction Platform Support Algorithms Conclusion 46/49

Code Sample

#include <algorithm>int a[10000000];int main() std::sort(a, a+10000000);

g++-4.3.0 -D GLIBCXX PARALLEL -fopenmpsort.cpp

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 60: =1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1 ...algo2.iti.kit.edu/sanders/courses/paralg08/singler.pdf · I core allocationnot static,other processes interfere Peter Sanders,

Introduction Platform Support Algorithms Conclusion 47/49

Future Work

I complete STL functionalityI better automatic algorithm and parameter selectionI machine model adequate for design and analysis of

multithreaded algorithmsI beyond STL

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 61: =1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1 ...algo2.iti.kit.edu/sanders/courses/paralg08/singler.pdf · I core allocationnot static,other processes interfere Peter Sanders,

Introduction Platform Support Algorithms Conclusion 48/49

Algorithms & DS to be Implemented

I priority queuesI some embarrassingly parallel functions

(e. g. valarray)I memory transfer operations (reverse, copy)?

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism

Page 62: =1=MCSTL: =1=Multi-=1=Core =1=Standard =1=Template =1 ...algo2.iti.kit.edu/sanders/courses/paralg08/singler.pdf · I core allocationnot static,other processes interfere Peter Sanders,

Introduction Platform Support Algorithms Conclusion 49/49

More About All That

I MCSTL website:http://algo2.iti.uni-karlsruhe.de/singler/mcstl/

I libstdc++ parallel mode:http://gcc.gnu.org/onlinedocs/libstdc++/manual/parallel_mode.html

Peter Sanders, Johannes Singler MCSTL - Practical Parallelism