An introduction to parallel...
Transcript of An introduction to parallel...
![Page 2: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized](https://reader033.fdocuments.us/reader033/viewer/2022052801/5f1484eb0fb5fc20ef6514d0/html5/thumbnails/2.jpg)
Definitions
› Parallel computing– Partition computation across different compute engines
› Distributed computing– Paritition computation across different machines
Same principle, more general
2
![Page 3: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized](https://reader033.fdocuments.us/reader033/viewer/2022052801/5f1484eb0fb5fc20ef6514d0/html5/thumbnails/3.jpg)
Outline
› Introduction to "traditional" programming– Writing code
– Operating systems
– …
› Why do we need parallel programming?– Focus on programming shared memory
› Different ways of parallel programming– PThreads
– OpenMP
– MPI?
– GPU/accelerators programming
3
![Page 4: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized](https://reader033.fdocuments.us/reader033/viewer/2022052801/5f1484eb0fb5fc20ef6514d0/html5/thumbnails/4.jpg)
As a side…
› A bit of computer architecture– We will understand why…
– Focus on shared memory systems
› A bit of algorithms– We will understand why…
› A bit of performance analysis– Which is our ultimate goal!
– Being able to identify bottlenecks
4
![Page 5: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized](https://reader033.fdocuments.us/reader033/viewer/2022052801/5f1484eb0fb5fc20ef6514d0/html5/thumbnails/5.jpg)
Programming basics
![Page 6: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized](https://reader033.fdocuments.us/reader033/viewer/2022052801/5f1484eb0fb5fc20ef6514d0/html5/thumbnails/6.jpg)
Take-aways
› Programming basics– Variables
– Functions
– Loops
› Programming stacks– BSP
– Operating systems
– Runtimes
› Computer architectures– Computing domains
– Single processor/multiple processors
– From single- to multi- to many- core
6
![Page 7: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized](https://reader033.fdocuments.us/reader033/viewer/2022052801/5f1484eb0fb5fc20ef6514d0/html5/thumbnails/7.jpg)
Why do we need parallel computing?
Increase performance of our machines
› Scale-up– Solve a "bigger" problem in the same time
› Scale-out– Solve the same problem in less time
7
![Page 8: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized](https://reader033.fdocuments.us/reader033/viewer/2022052801/5f1484eb0fb5fc20ef6514d0/html5/thumbnails/8.jpg)
Yes but..
› Why (highly) parallel machines…
› …and not faster single-core machines?
8
![Page 9: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized](https://reader033.fdocuments.us/reader033/viewer/2022052801/5f1484eb0fb5fc20ef6514d0/html5/thumbnails/9.jpg)
The answer #1 - Money
9
![Page 10: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized](https://reader033.fdocuments.us/reader033/viewer/2022052801/5f1484eb0fb5fc20ef6514d0/html5/thumbnails/10.jpg)
The answer #2 – the "hot" one
Moore's law
› "The number of transistors that we can pack in a given die area doubles every 18 months"
Dennard's scaling
› "performance per watt of computing is growing exponentially at roughly the same rate"
10
![Page 11: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized](https://reader033.fdocuments.us/reader033/viewer/2022052801/5f1484eb0fb5fc20ef6514d0/html5/thumbnails/11.jpg)
Transistors (K’s)
Clock (MHz)
Power (W)
Perf/Clock (ILP)
› SoC design paradigm
› Gordon Moore
– His law is still valid, but…
Performance
frequency
The answer #2 – the "hot" one
11
![Page 12: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized](https://reader033.fdocuments.us/reader033/viewer/2022052801/5f1484eb0fb5fc20ef6514d0/html5/thumbnails/12.jpg)
Transistors (K’s)
Clock (MHz)
Power (W)
Perf/Clock (ILP)
› SoC design paradigm
› Gordon Moore
– His law is still valid, but…
› “The free lunch is over”– Herb Sutter, 2005
The answer #2 – the "hot" one
11
Performance
parallelism
![Page 13: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized](https://reader033.fdocuments.us/reader033/viewer/2022052801/5f1484eb0fb5fc20ef6514d0/html5/thumbnails/13.jpg)
In other words…
1970 1980 1990 2000 2010 2020
Hot plate
Summertemperature
NuclearReactor
Surface ofthe sun
Rocketnozzle
First PCs
The explosionof web
Moderncomputers
12
![Page 14: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized](https://reader033.fdocuments.us/reader033/viewer/2022052801/5f1484eb0fb5fc20ef6514d0/html5/thumbnails/14.jpg)
Instead of going faster..
› ..(go faster but through) parallelism!
Problem #1
› New computer architectures
› At least, three architectural templates
Problem #2
› Need to efficiently program them
› HPC already has this problem!
The problem
› Programmers must know a bit of the architecture!
› To make parallelization effective
› "Let's run this on a GPU. It certainly goes faster" (cit.)
13
![Page 15: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized](https://reader033.fdocuments.us/reader033/viewer/2022052801/5f1484eb0fb5fc20ef6514d0/html5/thumbnails/15.jpg)
The Big problem
› Effectively programming in parallel is difficult
“Everyone knows that debugging is twice as hard as writing a program in
the first place.
So if you're as clever as you can be when you write it, how will you ever
debug it?”
14
![Page 16: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized](https://reader033.fdocuments.us/reader033/viewer/2022052801/5f1484eb0fb5fc20ef6514d0/html5/thumbnails/16.jpg)
I am *really* sorry guys..
› I will give you code…
› ..but first I need to give you some maths…
› …and then, some architectual principles
15
![Page 17: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized](https://reader033.fdocuments.us/reader033/viewer/2022052801/5f1484eb0fb5fc20ef6514d0/html5/thumbnails/17.jpg)
Amdahl's Law
![Page 18: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized](https://reader033.fdocuments.us/reader033/viewer/2022052801/5f1484eb0fb5fc20ef6514d0/html5/thumbnails/18.jpg)
Amdahl's law
› A sequential program that takes 100 sec to exec
› Only 95% can run in parallel (it's a lot)
› And.. you are an extremely good programmer, and you have a machine with 1billioncores, so that part takes 0 sec
› So,𝑇𝑝𝑎𝑟 = 100𝑠𝑒𝑐 − 95𝑠𝑒𝑐 = 5𝑠𝑒𝑐
𝑆𝑝𝑒𝑒𝑑𝑢𝑝 =100𝑠𝑒𝑐5𝑠𝑒𝑐
= 20𝑥
…20x, on one billion cores!!!
17
![Page 19: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized](https://reader033.fdocuments.us/reader033/viewer/2022052801/5f1484eb0fb5fc20ef6514d0/html5/thumbnails/19.jpg)
Computer architecture
![Page 20: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized](https://reader033.fdocuments.us/reader033/viewer/2022052801/5f1484eb0fb5fc20ef6514d0/html5/thumbnails/20.jpg)
Step-by-step
1. "Traditional" multi-cores– Typically, shared-memory
– Max 8-16 cores
– This laptop
2. Many-cores– GPUs but not only
– Heterogeneous architectures
3. More advanced stuff– Field-programmable Gate Arrays
– Neural Networks
19
![Page 21: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized](https://reader033.fdocuments.us/reader033/viewer/2022052801/5f1484eb0fb5fc20ef6514d0/html5/thumbnails/21.jpg)
Symmetric multi-processing
› Memory: centralized with bus interconnect, I/O
› Typically, multi-core (sub)systems– Examples: Sun Enterprise 6000, SGI Challenge, Intel (this laptop)
CPU 0
One or more cachelevels
Main memory
CPU 1
One or more cachelevels
CPU 2
One or more cachelevels
CPU 3
One or more cachelevels
I/O system
Can be 1 bus, N busses, or any
network
20
![Page 22: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized](https://reader033.fdocuments.us/reader033/viewer/2022052801/5f1484eb0fb5fc20ef6514d0/html5/thumbnails/22.jpg)
Asymmetric multi-processing
› Memory: centralized with uniform access time (UMA) and bus interconnect, I/O
› Typically, multi-core (sub)systems– Examples: ARM Big.LITTLE, NVIDIA Tegra X2 (Drive PX)
CPU 0
One or more cachelevels
Main memory
CPU 1
One or more cachelevels
CPU A
One or more cachelevels
CPU B
One or more cachelevels
I/O system
Can be 1 bus, N busses, or any
network
21
![Page 23: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized](https://reader033.fdocuments.us/reader033/viewer/2022052801/5f1484eb0fb5fc20ef6514d0/html5/thumbnails/23.jpg)
SMP – distributed shared memory
› Non-Uniform Access Time - NUMA
› Scalable interconnect– Typically, many cores
– Examples: embedded accelerators, GPUs
CPU 0
$
Main memory
SPM
I/O system
$ = "cache" CPU 1
$
SPM
CPU 2
$
SPM
CPU 3
$
SPMScratchPadMemory
Scalable interconnection
22
![Page 24: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized](https://reader033.fdocuments.us/reader033/viewer/2022052801/5f1484eb0fb5fc20ef6514d0/html5/thumbnails/24.jpg)
Go complex: NVIDIA's Tegra
› Complex heterogeneous system– 3 ISAs
– 2 subdomains
– Shmem between Big.SUPER host and GP-GPU
23
![Page 25: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized](https://reader033.fdocuments.us/reader033/viewer/2022052801/5f1484eb0fb5fc20ef6514d0/html5/thumbnails/25.jpg)
NUMAUMA
UMA vs. NUMA
› Shared mem: every thread can access every memory item– (Not considering security issues…)
› Uniform Memory Access (UMA) vs Non-Uniform Memory Access (NUMA)– Different access time for accessing different memory spaces
CPU 0
CPU 1
CPU 3
CPU 2
SPM0
CPU 12
CPU 13
CPU 15
CPU 14
SPM3
CPU 4
CPU 5
CPU 7
CPU 6
CPU 8
CPU 9
CPU 11
CPU 10
SPM1
SPM2
CPU 0
CPU 1
CPU 3
CPU 2
MAINMEM
24
![Page 26: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized](https://reader033.fdocuments.us/reader033/viewer/2022052801/5f1484eb0fb5fc20ef6514d0/html5/thumbnails/26.jpg)
NUMAUMA
UMA vs. NUMA
› Shared mem: every thread can access every memory item– (Not considering security issues…)
› Uniform Memory Access (UMA) vs Non-Uniform Memory Access (NUMA)– Different access time for accessing different memory spaces
CPU 0
CPU 1
CPU 3
CPU 2
SPM0
CPU 12
CPU 13
CPU 15
CPU 14
SPM3
CPU 4
CPU 5
CPU 7
CPU 6
CPU 8
CPU 9
CPU 11
CPU 10
SPM1
SPM2
CPU 0
CPU 1
CPU 3
CPU 2
MAINMEM
MEM0 MEM1 MEM2 MEM3
CPU0…3 0 clock 10 clock 20 clock 10 clock
CPU4…7 10 clock 0 clock 10 clock 20 clock
CPU8…11 20 clock 10 clock 0 clock 10 clock
CPU12..15 10 clock 20 clock 10 clock 00 clock
24
![Page 27: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized](https://reader033.fdocuments.us/reader033/viewer/2022052801/5f1484eb0fb5fc20ef6514d0/html5/thumbnails/27.jpg)
Some definitions
![Page 28: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized](https://reader033.fdocuments.us/reader033/viewer/2022052801/5f1484eb0fb5fc20ef6514d0/html5/thumbnails/28.jpg)
What is…
› ..a core?– An electronic circuit to execute instruction (=> programs)
› …a program?– The implementation of an algorithm
› …a process?– A program that is executing
› …a thread?– A unit of execution (of a process)
› ..a task?– A unit of work (of a program)
27
![Page 29: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized](https://reader033.fdocuments.us/reader033/viewer/2022052801/5f1484eb0fb5fc20ef6514d0/html5/thumbnails/29.jpg)
What is…
› ..a core?– An electronic circuit to execute instruction (=> programs)
› …a program?– The implementation of an algorithm
› …a process?– A program that is executing
› …a thread?– A unit of execution (of a process)
› ..a task?– A unit of work (of a program)
28
T
t code.c DATA
CORE0
CORE 1
CORE3
CORE 2
MEM
code.c
DATADATA
code.ccode.c
P SHARED MEMT
TT
![Page 30: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized](https://reader033.fdocuments.us/reader033/viewer/2022052801/5f1484eb0fb5fc20ef6514d0/html5/thumbnails/30.jpg)
What is a task?
Operating System
task
Real-timetask
OpenMPtask
29
![Page 31: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized](https://reader033.fdocuments.us/reader033/viewer/2022052801/5f1484eb0fb5fc20ef6514d0/html5/thumbnails/31.jpg)
P0 P1
Symmetric multi-processing
› Memory: centralized with bus interconnect, I/O
› Typically, multi-core (sub)systems– Examples: Sun Enterprise 6000, SGI Challenge, Intel (this laptop)
CPU 0
One or more cachelevels
Main memory
CPU 1
One or more cachelevels
CPU 2
One or more cachelevels
CPU 3
One or more cachelevels
I/O system
T TT T T
Can be 1 bus, N busses, or any
network
30
![Page 32: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized](https://reader033.fdocuments.us/reader033/viewer/2022052801/5f1484eb0fb5fc20ef6514d0/html5/thumbnails/32.jpg)
..start simple…
![Page 33: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized](https://reader033.fdocuments.us/reader033/viewer/2022052801/5f1484eb0fb5fc20ef6514d0/html5/thumbnails/33.jpg)
Something you're used to..
› Multiple processes
› That communicate via shared data
ProcessP0
????
T
ProcessP1
T
(read, write)
(read, write)
DATUM
32
![Page 34: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized](https://reader033.fdocuments.us/reader033/viewer/2022052801/5f1484eb0fb5fc20ef6514d0/html5/thumbnails/34.jpg)
Howto #1 - MPI
› Multiple processes
› That communicate via shared data
ProcessP0
T
ProcessP1
T
DATUM
MPI_Send
MPI_Recv
33
![Page 35: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized](https://reader033.fdocuments.us/reader033/viewer/2022052801/5f1484eb0fb5fc20ef6514d0/html5/thumbnails/35.jpg)
Howto #2 – UNIX pipes
› Multiple processes
› That communicate via shared data
ProcessP0
T
ProcessP1
T
DATUM
int main(void)
{
int fd[2], nbytes;
char string[] = "Hello, world!\n";
pipe(fd);
/* Send "string" through
the output side of pipe */
write(fd[1], string,
(strlen(string)+1));
return(0);
}
int main(void)
{
int fd[2], nbytes; pipe(fd);
/* Receive "string" from
the input side of pipe */
nbytes = read(fd[0], readbuffer,
sizeof(readbuffer));
return(0);
}
34
![Page 36: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized](https://reader033.fdocuments.us/reader033/viewer/2022052801/5f1484eb0fb5fc20ef6514d0/html5/thumbnails/36.jpg)
File.txt
Howto #3 – Files
› Multiple processes
› That communicate via shared data
ProcessP0
T
ProcessP1
T
DATUM
35
![Page 37: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized](https://reader033.fdocuments.us/reader033/viewer/2022052801/5f1484eb0fb5fc20ef6514d0/html5/thumbnails/37.jpg)
Shared memory
› Coherence problem– Memory consistency issue
– Data races
› Can share data ptrs– Ease-to-use
ProcessP0
Shared memory
TT
T
(read, write)(read, write)
DATUM
36
![Page 38: An introduction to parallel programminghipert.unimore.it/people/paolob/pub/PhD/Slides/01-Intro... · 2018-04-28 · OpenMP task 29. P0 P1 Symmetric multi-processing ›Memory: centralized](https://reader033.fdocuments.us/reader033/viewer/2022052801/5f1484eb0fb5fc20ef6514d0/html5/thumbnails/38.jpg)
References
› "Calcolo parallelo" website– http://hipert.unimore.it/people/paolob/pub/PhD/index.html
› My contacts– [email protected]
– http://hipert.mat.unimore.it/people/paolob/
› Useful links
› A "small blog"– http://www.google.com
37