Microsoft PowerPoint - 5_Which Model.pptxChoosing the Right
Parallel Model
Stephen Blair-Chappell
P e rf
Parallel Processors Need Parallel Applications
Intel® Parallel Building Blocks provides the tools you need to
build parallel applications
2
What is a good parallel programming model?
• Easy to use
3
Agenda
Software and Services Group
Different Kinds of Programmer
Stupid compiler!! The inc sets the z flag, so what’s the compare
doing here? Wasted cycle!
Yuck@
Hey honey! I found this piece of code, made two simple
changes and now it works!! So what are you doing tonight?
What CPU does it run on? Huh? What’s the
difference? I just need to reformat the output
here and we can run that experiment again
All programmers are not equal
Software and Services Group Optimization Notice
Family of Parallel Models
Libraries
Different Levels of Abstraction
Intel Parallel Programming Model
Tasking Fundamental Concepts
– Large teams that develop components independently
– Calling into libraries with
• Utilize HW resources
Family of Parallel Models
Intel® Cilk Plus • Easy to learn. Both C and C++
• Serial semantics3 keywords
• Tasks, not threads
• Array operations
• Guaranteed vector implementation by compiler Pragma SIMD
11
Anatomy of a spawn
cilk_spawncilk_spawncilk_spawncilk_spawn g();g();g();g();
workworkworkwork workworkworkwork workworkworkwork
cilk_sync;cilk_sync;cilk_sync;cilk_sync; workworkworkwork
workworkworkwork workworkworkwork workworkworkwork
Work Stealing when another worker is available
void f()void f()void f()void f() {{{{
cilk_spawncilk_spawncilk_spawncilk_spawn g();g();g();g();
workworkworkwork workworkworkwork workworkworkwork
cilk_sync;cilk_sync;cilk_sync;cilk_sync; workworkworkwork
workworkworkwork workworkworkwork workworkworkwork
cilk_for and reducer
#include <#include <#include <#include
<cilkcilkcilkcilk\\\\reducer_opadd.hreducer_opadd.hreducer_opadd.hreducer_opadd.h>>>>
cilkcilkcilkcilk::::::::reducer_opaddreducer_opaddreducer_opaddreducer_opadd<<<<intintintint>
> > > gIntgIntgIntgInt; ; ; ;
{{{{
cilk_forcilk_forcilk_forcilk_for((((intintintint iiii=0; =0; =0;
=0; iiii < 8; < 8; < 8; < 8; iiii++){++){++){++){
gIntgIntgIntgInt++;++;++;++;
}}}} The Cilk scheduler automatically
Array notations for C/C++ Data parallel operations on array
sections
vectorization is always semantically correct
<array base> [<lower
bound>:<length>[:<stride>]]+
B[2:6] // Elements 2 to 7 of vector B
C[:][5] // Column 5 of matrix C
D[0:3:2] // Elements 0,2,4 of vector D
A[:] = B[:]
guaranteed vector
Software and Services Group
Elemental Functions
• Use scalar syntax to describe an operation on a single element •
Apply operation to arrays in parallel
• Utilize both vector parallelism and core parallelism
_declspec(vector)
{
double d2 = d1-(sigma*time_sqrt);
}
}
a[j] = my_ef(b[j]);
a[j] = my_ef(b[j]);
a[j] = my_ef(b[j]);
[Concurrency not yet
Software and Services Group Optimization Notice
Pragma SIMD • Write a C/C++/FTN standard loop, add a pragma to get
the compiler to
vectorize it
• The compiler does not prove equivalence to sequential loop, no
performance heuristics
• The programmer may need to provide additional clauses for correct
code generation – Private, reduction, scalar
• Elemental functions can be called from the loop
#pragma simd
a[j] = my_ef(b[j]);
Family of Parallel Models
• C++ generic programming C++ library
• Tasks, not threads
• Common parallel patterns
allocator
20
Threads
const int N = 100000;
for (int i = 0; i < M; i++){
array[i] *= 2;
Software and Services Group Optimization Notice
• Include and initialize the library
An Example using parallel_for
• Include and initialize the library
#include “tbb/task_scheduler_init.h”
#include “tbb/blocked_range.h”
#include “tbb/parallel_for.h”
• Include and initialize the library
#include “tbb/task_scheduler_init.h”
#include “tbb/blocked_range.h”
#include “tbb/parallel_for.h”
• Use the parallel_for pattern
for (int i = 0; i < M; i++){
array[i] *= 2;
class ChangeArrayBody {
float *array;
array[i] *= 2;
parallel_for (blocked_range <int>(0, M,
IdealGrainSize),
ChangeArrayBody(array));
green = provided by TBB
red = boilerplate for library
class ChangeArrayBody {
float *array;
array[i] *= 2;
parallel_for (blocked_range <int>(0, M,
IdealGrainSize),
ChangeArrayBody(array));
• Use the parallel_for pattern blue = original code
green = provided by TBB
red = boilerplate for library
tasks available to thieves
float Example() {
return sum;
functor::operator()
Software and Services Group 31
Lambda Syntax
parameters and return type
void or code is “return
expr;”
[&] ⇒ by-reference
[=] ⇒ by-value
[]{return rand();}
if(x<y) return x;
#include <tbb/tbb.h>
#include <tbb/tbb.h>
#include <tbb/tbb.h>
#include <tbb/tbb.h>
#include <vector>
void RunWhileLoop()
}
Family of Parallel Models
Libraries
OpenMP
Software and Services Group Optimization Notice
What is OpenMP™ ?
Portable, Shared Memory Multi-processing API – Fortran 77, Fortran
90, C, and C++ – Multi-vendor support, for both Unix and
Windows
• Standardizes loop-level parallelism
• Supports coarse-grained parallelism
• Combines serial and parallel code in single source – No need for
separate source code revision
• See www.openmp.org for standard documents, tutorials, sample
code
• Intel is premier member of OpenMP Review Board
Software and Services Group Optimization Notice
Parallel APIs: OpenMP*
#pragma omp critical
C$OMP PARALLEL REDUCTION (+: A, B)
call OMP_INIT_LOCK (ilok)
C$OMP THREADPRIVATE(/ABC/)
OpenMP: An API for Writing Multithreaded Applications
• A set of compiler directives and library routines for parallel
application programmers
• Makes it easy to create multithreaded (MT) programs in Fortran, C
and C++
Software and Services Group Optimization Notice
OpenMP Architecture
• Fork-Join Model
• Worksharing constructs
• Synchronization constructs
• Directive/pragma-based parallelism
Software and Services Group
OpenMP Programming Model:
Fork-Join Parallelism: Master thread spawns a team of threads as
needed.
Parallelism added incrementally until performance are met: i.e. the
sequential program evolves into a parallel program.
Parallel Regions Master Thread in red
A Nested
Parallel region
A Nested
Parallel region
Software and Services Group Optimization Notice
Hello World • This program runs on three threads: • Prints
this:
Hello World
Hello World
Hello World
Iter: 1
Iter: 2
Iter: 3
Iter: 4
Goodbye World
Goodbye World
Goodbye World
Void main()
The Private Clause
• Variables are un-initialized; C++ object is default
constructed
• Any value external to the parallel region is undefined
void* work(float* c, int N) {
float x, y; int i;
#pragma omp parallel for private(x,y)
for(i=0; i<N; i++) {
x = a[i]; y = b[i];
c[i] = x + y;
OpenMP* Critical Construct
float R1, R2;
#pragma omp parallel
{ float A, B;
#pragma omp for
B = big_job(i);
a time, only one calls consum() thereby
protecting R1 and R2 from race conditions.
Naming the critical
performance.
(R1_lock)
(R2_lock)
Software and Services Group Optimization Notice
Parallel Sections • Independent sections of code
can execute concurrently
The OpenMP Task
Mixing-and-Matching
Array Building Blocks
Why Mix
• Using third party libraries, or code developed by other
developers
• Supplementing one parallel model with bits ‘borrowed’ from
another model
49
Different Parallel Constructs
Software and Services Group
gInt++;
gInt = 0;
cilk_for(int i=0; i < 8; i++)
Hello(i + 1);
}
{
cilk_for(int i=0; i < 8; i++) {
Hello(i + 1);
scalable_free( pA);
OpenMP • Good for monolithic applications
• But a SW architect needs to
• break the application work into chunks,
• determine which thread does what
• make threads do equal amount of work.
• Good performance when it works, • some applications too complex
to design with a global view.
• Hard to use when the application is composed of libraries,
or
of independently developed modules.
53
OpenMP
Making the right choice
Parallel Model Num words
• Use Word Bucket counting to approx amount of editing needed
Software and Services Group Optimization Notice
Factors influencing your choice
• 4. Type of Parallelism
1. Language
• C\C++
– Cilk Plus
• Fortran
– OpenMP
– Coarrays
2. Operating System
– OpenMP (supported by GCC)
3. How many developers? Exclusive control of machine?
• Multiple developers / third-party libraries
59
4. Type of Work
5. What Compiler?
• MS
• No Array Notation
61
6. Standards
• Emerging Standards
• Open Source
7. Productised?
8. CPU
– OpenMP
64
9. Open Source?
What is a good parallel programming model?
• Easy to use
66
Optimization Notice
Intel® compilers, associated libraries and associated development
tools may include or utilize options that optimize for instruction
sets that are available in both Intel® and non-Intel
microprocessors (for example SIMD instruction sets), but do not
optimize equally for non-Intel microprocessors. In addition,
certain compiler options for Intel compilers, including some that
are not specific to Intel micro-architecture, are reserved for
Intel microprocessors. For a detailed description of Intel compiler
options, including the instruction sets and specific
microprocessors they implicate, please refer to the “Intel®
Compiler User and Reference Guides” under “Compiler Options." Many
library routines that are part of Intel® compiler products are more
highly optimized for Intel microprocessors than for other
microprocessors. While the compilers and libraries in Intel®
compiler products offer optimizations for both Intel and
Intel-compatible microprocessors, depending on the options you
select, your code and other factors, you likely will get extra
performance on Intel microprocessors.
Intel® compilers, associated libraries and associated development
tools may or may not optimize to the same degree for non- Intel
microprocessors for optimizations that are not unique to Intel
microprocessors. These optimizations include Intel®
Streaming SIMD Extensions 2 (Intel® SSE2), Intel® Streaming SIMD
Extensions 3 (Intel® SSE3), and Supplemental Streaming SIMD
Extensions 3 (Intel® SSSE3) instruction sets and other
optimizations. Intel does not guarantee the availability,
functionality, or effectiveness of any optimization on
microprocessors not manufactured by Intel. Microprocessor-dependent
optimizations in this product are intended for use with Intel
microprocessors.
While Intel believes our compilers and libraries are excellent
choices to assist in obtaining the best performance on Intel® and
non-Intel microprocessors, Intel recommends that you evaluate other
compilers and libraries to determine which best meet your
requirements. We hope to win your business by striving to offer the
best performance of any compiler or library; please let us know if
you find we do not.
Notice revision #20101101
68
Backup
69
Languages1 Learning
Distributed
Very
(Many
Vendors)
Good
Linux,
Windows,
Apple
Good
Intel Parallel
Building Blocks
Intel Cilk Plus C++, C Easy Yes Very Good Stay Tuned Yes4
Intel TBB C++ Medium Yes Very Good Yes
(Open Source) Yes
Other Standards
Posix C Easy/Hard2 No Difficult
Yes
(Many
Vendors)
float Example() {
return sum;
functor::operator()
Software and Services Group 72
Lambda Syntax
parameters and return type
void or code is “return
expr;”
[&] ⇒ by-reference
[=] ⇒ by-value
[]{return rand();}
if(x<y) return x;
#include <cilk\cilk.h>
#include <cilk\reducer_opadd.h>
gInt++;
{
cilk_spawn []{
Hello(1);
Hello(2);
Hello(3);
Hello(4);}();
Hello(5);
Hello(6);
Hello(7);
Hello(8);
cilk_sync;
}
cilk_spawn []{
Hello(1);
Hello(2);
Hello(3);
Hello(4);}();
Hello(5);
Hello(6);
Hello(7);
Hello(8);
cilk_sync;