OpenMP TBBPPL MPI OpenCLOpenACC CUDAC++ AMP Renderscript Cilk PlusGCD.

41
Parallelism in the Standard C++: What to Expect in C++ 17 Artur Laksberg Microsoft Corp. May 8th, 2014

Transcript of OpenMP TBBPPL MPI OpenCLOpenACC CUDAC++ AMP Renderscript Cilk PlusGCD.

Parallelism in the Standard C++: What to Expect in C++ 17

Artur Laksberg

Microsoft Corp.

May 8th, 2014

Agenda

Fundamentals Task regions

Parallel Algorithms Parallelization Vectorization

Part 1: The Fundamentals

OpenMPTBBPPL

MPIOpenCLOpenACC

CUDA C++ AMP

Renderscript

Cilk Plus GCD

Parallelism in C++11/14

Fundamentals: Memory model Atomics

Basics: thread mutex condition_variable async future

Quicksort: Serial

void quicksort(int *v, int start, int end) { if (start < end) { int pivot = partition(v, start, end);

quicksort(v, start, pivot - 1);

quicksort(v, pivot + 1, end);

}}

Quicksort: Use Threads

void quicksort(int *v, int start, int end) { if (start < end) {

int pivot = partition(v, start, end);

std::thread t1([&] { quicksort(v, start, pivot - 1); });

std::thread t2([&] { quicksort(v, pivot + 1, end); });

t1.join(); t2.join(); }}

Problem 1:expensive

Problem 2:Fork-join not enforced

Problem 3:Exceptions??

Quicksort: Fork-Join Parallelism

void quicksort(int *v, int start, int end) { if (start < end) {

int pivot = partition(v, start, end);

quicksort(v, start, pivot - 1);

quicksort(v, pivot + 1, end);

}}

parallel region

task

task

Quicksort: Using Task Regions (N3832)void quicksort(int *v, int start, int end) { if (start < end) {

task_region([&] (auto& r) {

int pivot = partition(v, start, end);

r.run([&] { quicksort(v, start, pivot - 1); });

r.run([&] { quicksort(v, pivot + 1, end); });

}); }}

task

task

parallel region

Under The Hood…

Work Stealing Scheduling

proc 1 proc 3proc 2 proc 4

Work Stealing Scheduling

proc 1

Old items

proc 3proc 2 proc 4

New items

Work Stealing Scheduling

proc 1

Old items

proc 3proc 2 proc 4

New items

Work Stealing Scheduling

proc 1

Old items

proc 3proc 2 proc 4

New items

“Thief”

Fork-Join Parallelism and Work Stealing

e();

task_region([] (auto& r) {

r.run(f);

g();

});

h();

e()

f() g()

h()

Q2: What thread runs g?

Q3: What thread runs h?

Q1: What thread runs f?

Work Stealing Design Choices What Thread Executes After

a Spawn? Child Stealing Continuation (parent)

Stealing

What Thread Executes After a Join? Stalling: initiating thread

waits Greedy: the last thread to

reach join continuestask_region([] (auto& r) { for(int i=0; i<n; ++i) r.run(f);});

Part 2: The Algorithms

Alex Stepanov: Start With The Algorithms

Inspiration

Performing Parallel Operations On Containers

Intel Threading Building Blocks

Microsoft Parallel Patterns Library, C++ AMP

Nvidia Thrust

Parallel STL

Just like STL, only parallel… Can be faster

If you know what you’re doing

Two Execution Policies: std:par std::vec

Parallelization: What’s a Big Deal?

Why not already parallel?

std::sort(begin, end, [](int a, int b) { return a < b; });

User-provided closures must be thread safe:

int comparisons = 0;std::sort(begin, end, [&](int a, int b) { comparisons++; return a < b; });

But also special-member functions, std::swap etc.

It’s a Contract

What the user can do What the implementer can do

Asymptotic Guarantees:std::sort: O(n*log(n)), std::stable_sort: O(n*log2(n)), what about parallel sort?

What is a valid implementation? (see next slide)

Chaos Sorttemplate<typename Iterator, typename Compare>void chaos_sort( Iterator first, Iterator last, Compare comp ) { auto n = last-first; std::vector<char> c(n); for(;;) { bool flag = false; for( size_t i=1; i<n; ++i ) { c[i] = comp(first[i],first[i-1]); flag |= c[i]; } if( !flag ) break; for( size_t i=1; i<n; ++i ) if( c[i] ) std::swap( first[i-1], first[i] ); }}

Execution Policies

Built-in Execution Policies:extern const sequential_execution_policy seq;extern const parallel_execution_policy par;extern const vector_execution_policy vec;

Dynamic Execution Policy:class execution_policy{public:// ... const type_info& target_type() const; template<class T> T *target(); template<class T> const T *target() const;};

Using Execution Policy To Write Paralel Code

std::vector<int> vec = ...

// standard sequential sortstd::sort(vec.begin(), vec.end());

using namespace std::experimental::parallel;

// explicitly sequential sortsort(seq, vec.begin(), vec.end());

// permitting parallel executionsort(par, vec.begin(), vec.end());

// permitting vectorization as wellsort(vec, vec.begin(), vec.end());

Picking Execution Policy Dynamically

size_t threshold = ...

execution_policy exec = seq;

if(vec.size() > threshold){ exec = par;}

sort(exec, vec.begin(), vec.end());

Exception Handling

In C++ philosophy, no exception is silently ignored Exception list: container of exception_ptr objects

try{ r = std::inner_product(std::par, a.begin(), a.end(), b.begin(), func1, func2, 0);}catch(const exception_list& list){ for(auto& exptr : list) { // process exception pointer exptr }}

Vectorization: A Tale From Agriculture

A Tale From Agriculture

A Tale From Agriculture

Idea: Fewer Tractors, Wider Plows

Vectorization: What’s a Big Deal?

int a[n] = ...;int b[n] = ...;for(int i=0; i<n; ++i){ a[i] = b[i] + c;}

movdqu xmm1, XMMWORD PTR _b$[esp+eax+132]movdqu xmm0, XMMWORD PTR _a$[esp+eax+132]paddd xmm1, xmm2paddd xmm1, xmm0movdqu XMMWORD PTR _a$[esp+eax+132], xmm1

a[i:i+3] = b[i:i+3] + c;

Vector Lane is not a Thread!

Taking locks Thread with thread_id x takes a lock… Then another “thread” with the same thread_id enters the

lock… Deadlock!!!

Exceptions Can we unwind 1/4th of the stack?

Vectorization: Not So Easy Any More…

void f(int* a, int*b){ for(int i=0; i<n; ++i) { a[i] = b[i] + c; func();

}}

mov ecx, DWORD PTR _b$[esp+esi+140]add ecx, ediadd DWORD PTR _a$[esp+esi+140], ecxcall func

Aliasing?

Side effects?Dependence?Exceptions?

Vectorization Hazard: Locks

for(int i=0; i<n; ++i){ lock.enter(); a[i] = b[i] + c; lock.release();}

for(int i=0; i<n; i+=4){ for(int j=0; j<4; ++j) lock.enter();

a[i:i+3] = b[i:i+3] + c;

for(int j=0; j<4; ++j) lock.release();}

This transformation is not safe!

Consider: f takes a lock, g releases the lock:

?

How Do We Get This?

void f(int* a, int*b){ for(int i=0; i<n; ++i) { a[i] = b[i] + c; func(); }

}

for(int i=0; i<n; i+=4){ a[i:i+3] = b[i:i+3] + c; for(int j=0; j<4; ++j) func();}

Need a helping hand from the programmer, because…

Vector Loop with Parallel STL

void f(int* a, int*b){ integer_iterator begin {0}; integer_iterator end {n};

std::for_each( std::vec, begin, end, [&](int i) { a[i] = b[i] + c; func(); }}

Parallelization vs. Vectorization

Parallelization Threads Stack Good for divergent code Relatively heavy-weight

Vectorization Vector Lanes No stack Lock-step execution Very light-weight

When To Vectorize

std::par No race conditions No aliasing

std::vec Same as std::vec, plus: No Exceptions No Locks No/Little Divergence

References

N3832: Task Region N3872: A Primer on Scheduling Fork-Join Parallelism

with Work Stealing N3724: A Parallel Algorithms Library N3850: Working Draft, Technical Specification for C++

Extensions for Parallelism parallelstl.codeplex.com