HPC Workshop Presentation - HPC Current Use and Practice: Industry
Parallel Programming Models - Florida State Universityengelen/courses/HPC/Models.pdf2/7/17 HPC...
Transcript of Parallel Programming Models - Florida State Universityengelen/courses/HPC/Models.pdf2/7/17 HPC...
HPC2/7/17
Overview
n Basic conceptsn Programming modelsn Multiprogrammingn Shared address space model
¨ UMA versus NUMA¨ Distributed shared memory¨ Task parallel¨ Data parallel, vector and SIMD
n Message passing modeln Hybrid systemsn BSP model
HPC2/7/17
Parallel Programming:Basic Concepts
n Control¨ How is parallelism created, implicitly (hardwired) or explicitly?¨ What orderings exist between operations?¨ How do different threads of control synchronize?
n Naming data¨ What data is private and what is shared?¨ How is logically shared data accessed or communicated?
n Operations on data¨ What are the basic operations on shared data?¨ Which operations are considered atomic?
n Cost¨ How do we account for the cost of each of the above to achieve
parallelism (man hours spent, software/hardware cost)
HPC2/7/17
Parallel Programming Models
n Programming model is a conceptualization of the machine that a programmer uses for developing applications
¨ Multiprogramming modeln A set of independence tasks, no communication or synchronization
at program level, e.g. web server sending pages to browsers¨ Shared address space (shared memory) programming
n Tasks operate and communicate via shared data, like bulletin boards
¨ Message passing programmingn Explicit point-to-point communication, like phone calls (connection
oriented) or email (connectionless, mailbox posts)
HPC2/7/17
Flynn’s Taxonomy
n Single instruction stream single data stream (SISD)¨ Traditional PC system
n Single instruction stream multiple data stream (SIMD)¨ Similar to MMX/SSE/AltiVec
multimedia instruction sets¨ MASPAR
n Multiple instruction stream multiple data stream (MIMD)¨ Single program, multiple data
(SPMD) programming: each processor executes a copy of the program
MIMD
SISD SIMD
Data stream
Inst
ruct
ion
stre
am
single
sing
lem
ultip
le
multiple
HPC2/7/17
MIMD versus SIMDn Task parallelism, MIMD
¨ Fork-join model with thread-level parallelism and shared memory¨ Message passing model with (distributed processing) processes
n Data parallelism, SIMD¨ Multiple processors (or units) operate on segmented data set¨ SIMD model with vector and pipeline machines¨ SIMD-like multi-media extensions, e.g. MMX/SSE/Altivec
X3 X2 X1 X0
Y3 Y2 Y1 Y0
X3 Å Y3 X2 Å Y2 X1 Å Y1 X0 Å Y0
Å Å Å Å
src1
src2
dest
Vector operation X[0:3] Å Y[0:3] with SSE instruction on Pentium-4
HPC2/7/17
Task versus Data Parallel
n Task parallel (maps to high-level MIMD machine model)¨ Task differentiation, like restaurant cook, waiter, and receptionist¨ Communication via shared address space or message passing¨ Synchronization is explicit (via locks and barriers)¨ Underscores operations on private data, explicit constructs for
communication of shared data and synchronizationn Data parallel (maps to high-level SIMD machine model)
¨ Global actions on data by tasks that execute the same code¨ Communication via shared memory or logically shared address
space with underlying message passing¨ Synchronization is implicit (lock-step execution)¨ Underscores operations on shared data, private data must be
defined explicitly or is simply mapped onto shared data space
HPC2/7/17
A Running Example:
n Parallel decomposition¨ Assign N/P elements to each processor
¨ Each processor computes the partial sum
¨ One processor collects the partial sumsn Determine the data placement:
¨ Logically shared: array a, global sum A¨ Logically private: the function f(ai) evaluations¨ Either logically shared or private: partial sums Aj
HPC2/7/17
Programming Model 1
n Shared address space (shared memory) programmingn Task parallel, thread-based MIMD
¨ Program is a collection of threads of controln Collectively operate on a set of shared data items
¨ Global static variables, Fortran common blocks, shared heapn Each thread has private variables
¨ Thread state data, local variables on the runtime stackn Threads coordinate explicitly by synchronization
operations on shared variables, which involves¨ Thread creation and join¨ Reading and writing flags¨ Using locks and semaphores (e.g. to enforce mutual exclusion)
HPC2/7/17
Programming Model 1
n Uniform memory access (UMA) shared memory machine¨ Each processor has uniform access to memory¨ Symmetric multiprocessors (SMP)
n No local/private memory, private variables are put in shared memoryn Cache makes access to private variables seem “local”
Shared
Private
Programming model Machine model
HPC2/7/17
Programming Model 1
n Nonuniform memory access (NUMA) shared memory machine¨ Memory access time depends on location of data relative to processor¨ Local access is faster
n No local/private memory, private variables are put in shared memory
Shared
Private
Programming model Machine model
HPC2/7/17
Programming Model 1
n Distributed shared memory machine (DSM)n Logically shared address space
¨ Remote memory access is more expensive (NUMA)¨ Remote memory access requires communication, automatic either done
in hardware or via software layer
Shared
Private
Programming model Machine model
HPC2/7/17
Programming Model 1
Thread 1 Thread 2
shared Ashared A[1..2]private i
A[1] := 0for i = 1..N/2A[1] := A[1]+f(a[i])
A := A[1] + A[2]
shared Ashared A[1..2]private i
A[2] := 0for i = N/2+1..NA[2] := A[2]+f(a[i])
What could go wrong?
HPC2/7/17
Programming Model 1
Thread 1 Thread 2
A[1] := A[1]+f(a[0])A[1] := A[1]+f(a[1])A[1] := A[1]+f(a[2])…A[1] := A[1]+f(a[9])A := A[1] + A[2]
…A[2] := A[2]+f(a[10])A[2] := A[2]+f(a[11])A[2] := A[2]+f(a[12])… …A[2] := A[2]+f(a[19])
Thread 2 has not completed in time
HPC2/7/17
Programming Model 1
Thread 1 Thread 2
shared Ashared A[1..2]private i
A := 0A[1] := 0for i = 1..N/2A[1] := A[1]+f(a[i])
A := A + A[1]
shared Ashared A[1..2]private i
A := 0A[2] := 0for i = N/2+1..NA[2] := A[2]+f(a[i])
A := A + A[2]
What could go wrong?
HPC2/7/17
Programming Model 1
Thread 1 Thread 2
A[1] := A[1]+f(a[0])A[1] := A[1]+f(a[1])A[1] := A[1]+f(a[2])…A := A + A[1]
A[2] := A[2]+f(a[10])A[2] := A[2]+f(a[11])A[2] := A[2]+f(a[12])… A := A + A[2]
Race condition
reg1 = Areg2 = A[1]reg1 = reg1 + reg2A = reg1
reg1 = Areg2 = A[2]reg1 = reg1 + reg2A = reg1
Instructions from different threads can be interleaved arbitrarily: the resulting value of A can be A[1], A[2], or A[1]+A[2]
HPC2/7/17
Programming Model 1
Thread 1 Thread 2
shared Ashared A[1..2]private i
A[1] := 0for i = 1..N/2A[1] := A[1]+f(a[i])
atomic A := A + A[1]
shared Ashared A[1..2]private i
A[2] := 0for i = N/2+1..NA[2] := A[2]+f(a[i])
atomic A := A + A[2]
Solution with atomic operations to prevent race condition
HPC2/7/17
Programming Model 1
Thread 1 Thread 2
shared Ashared A[1..2]private i
A[1] := 0for i = 1..N/2A[1] := A[1]+f(a[i])
lockA := A + A[1]unlock
shared Ashared A[1..2]private i
A[2] := 0for i = N/2+1..NA[2] := A[2]+f(a[i])
lockA := A + A[2]unlock
Solution with locks to ensure mutual exclusion
Criticalsection
(But this can still go wrong when an FP add exception is raised, jumping to an exception handler without unlocking)
HPC2/7/17
Programming Model 1
Thread 1 Thread 2
shared Aprivate Ajprivate i
Aj := 0for i = 1..N/2Aj := Aj+f(a[i])
lockA := A + Ajunlock
shared Aprivate Ajprivate i
Aj := 0for i = N/2+1..NAj := Aj+f(a[i])
lockA := A + Ajunlock
Note that the A[1] and A[2] are just local, so make them private
Criticalsection
HPC2/7/17
Programming Model 1
Thread 1 Thread 2
shared Aprivate Ajprivate i
Aj := 0for i = 1..N/2Aj := Aj+f(a[i])
lockA := A + Ajunlock… := A
shared Aprivate Ajprivate i
Aj := 0for i = N/2+1..NAj := Aj+f(a[i])
lockA := A + Ajunlock… := A
Criticalsection
What could go wrong?
HPC2/7/17
Programming Model 1
Thread 1 Thread 2
shared Aprivate Ajprivate i
Aj := 0for i = 1..N/2Aj := Aj+f(a[i])
lockA := A + Ajunlockbarrier… := A
shared Aprivate Ajprivate i
Aj := 0for i = N/2+1..NAj := Aj+f(a[i])
lockA := A + Ajunlockbarrier… := A
With locks, private Aj, and barrier synchronization
All procs synchronize
HPC2/7/17
Programming Model 2
n Shared address space (shared memory) programmingn Data parallel programming
¨ Single thread of control consisting of parallel operations¨ Parallel operations are applied to (a specific segment of) a data
structure, such as an arrayn Communication is implicitn Synchronization is implicit
shared array a, xshared Aa := array of input datax := f(a)A := sum(x)
HPC2/7/17
Programming Model 2
n E.g. data parallel programming with a vector machinen One instruction executes across multiple data elements,
typically in a pipelined fashion
shared array a, xshared Aa := array of input datax := f(a)A := sum(x)
Programming model Machine model
HPC2/7/17
Programming Model 2
n Data parallel programming with a SIMD machinen Large number of (relatively) simple processors
¨ Like multimedia extensions (MMX/SSE/AltiVec) on uniprocessors, but with scalable processor grids
n A control processor issues instructions to simple processors¨ Each processor executes the same instruction (in lock-step)¨ Processors are selectively turned off for control flow in program
Lock-step execution by an array of processors with some processors temporarily turned off
REAL, DIMENSION(6) :: a,b…WHERE b /= 0.0
a = a/bENDWHERE
Fortran 90 / HPF(High-Performance Fortran)
HPC2/7/17
Programming Model 3
n Message passing programmingn Program is a set of named processes
¨ Process has thread of control and local memory with local address space
n Processes communicate via explicit data transfers¨ Messages between source and destination, where source and
destination are named processors P0…Pn (or compute nodes)¨ Logically shared data is explicitly partitioned over local memories¨ Communication with send/recv via standard message passing
libraries, such as MPI and PVM
HPC2/7/17
Programming Model 3
n Message passing programmingn Each node has a network interface
¨ Communication and synchronization via network¨ Message latency and bandwidth is dependent on network
topology and routing algorithms
Programming model Machine model
HPC2/7/17
Programming Model 3
n Message passing programmingn Each node has a network interface
¨ Communication and synchronization via network¨ Message latency and bandwidth is dependent on network
topology and routing algorithms
Programming model Machine model
Message passing over m
esh
HPC2/7/17
Programming Model 3
n Message passing programmingn Each node has a network interface
¨ Communication and synchronization via network¨ Message latency and bandwidth is dependent on network
topology and routing algorithms
Programming model Machine model
Message passing over hypercube
HPC2/7/17
Programming Model 3
n Message passing programmingn On shared memory machine
¨ Communication and synchronization via shared memory¨ Message passing library copies data (messages) in memory,
less efficient (MPI call overhead) but portable
Programming model Machine model
Message passing on a shared memory machine
Copy data
HPC2/7/17
Programming Model 3
Processor 1 Processor 2
A1 := 0for i = 1..N/2A1 := A1+f(al[i])
receive A2 from P2A := A1 + A2send A to P2
A2 := 0for i = 1..N/2A2 := A2+f(al[i])
send A2 to P1receive A from P1
Solution with message passing, where global a[1..N] is distributed such that each processor has a local array al[1..N/2]
HPC2/7/17
Programming Model 3
Processor 1 Processor 2
A1 := 0for i = 1..N/2A1 := A1+f(al[i])
send A1 to P2receive A2 from P2A := A1 + A2
A2 := 0for i = 1..N/2A2 := A2+f(al[i])
send A2 to P1receive A1 from P1A := A1 + A2
Alternative solution with message passing, where global a[1..N] is distributed such that each processor has a local array al[1..N/2]
What could go wrong?
HPC2/7/17
Programming Model 3
Processor 1 Processor 2
A1 := 0for i = 1..N/2A1 := A1+f(al[i])
send A1 to P2receive A2 from P2A := A1 + A2
A2 := 0for i = 1..N/2A2 := A2+f(al[i])
send A2 to P1receive A1 from P1A := A1 + A2
Blocking and non-blocking versions of send/recv operations are available in message passing libraries: compare connection-oriented with rendezvous (telephone) to connectionless (mailbox)
Synchronousblocking sends
Deadlock with synchronous blocking send operations: both processors wait for data to be send to a receiver that is not ready to accept the message
HPC2/7/17
Programming Model 4
n Hybrid systems: clusters of SMPsn Shared memory within SMP, message passing outsiden Programming model with three choices:
¨ Treat as “flat” system: always use message passing, even within an SMP
n Advantage: ease of programming and portabilityn Disadvantage: ignores SMP memory hierarchy and advantage of
UMA shared address space¨ Program in two layers: shared memory programming and
message passingn Advantage: better performance (use UMA/NUMA intelligently)n Disadvantage: harder (and ugly!) to program
¨ Program in three layers: SIMD (e.g. SSE instructions) per core, shared memory programming between cores on an SMP node, and message passing between nodes
HPC2/7/17
Programming Model 4Interconnect
Node 1 Node 2 Node 3 Node 4
shared a[1..N/numnodes]private n = N/numnodes/numprocsprivate x[1..n]private lo = (pid-1)*nprivate hi = lo+nx[1..n] = f(a[lo..hi])A[pid] := sum(x[1..n])send A[pid] to node1
A := 0if node=1 and pid=1
for j = 1..numnodesfor i = 1..numprocs
receive Aj from node(j)A := A + Aj
Extra code for node 1 proc 1Vector (SIMD) part
Processor-local part
Shared part
HPC2/7/17
Programming Model 5
n Bulk synchronous processing (BSP)n A BSP superstep consists of three phases
1. Compute phase: processes operate on local data (also read access to shared memory on SMP)
2. Communication phase: all processes cooperate in exchange of data or reduction of global data
3. Barrier synchronizationn A parallel program is composed of supersteps
¨ Ensures that computation and communication phases are completed before the next superstep
n Simplicity of data parallel programming, without the restrictions
HPC2/7/17
Programming Model 5
n The cost of a BSP superstep sis composed of three parts¨ ws local computation cost of s¨ hs is the number of messages
send in superstep s¨ l is the barrier cost
n The total cost of a program with S supersteps is
where g is the communication cost such that it takes gh time to send h messages
HPC2/7/17
Summary
n Goal is to distinguish the programming model from underlying hardware
n Message passing, data parallel, BSP¨ Objective is portable correct code
n Hybrid¨ Tuning for the architecture¨ Objective is portable fast code¨ Algorithm design challenge (less uniformity)¨ Implementation challenge at all levels (fine to coarse grain)
n Blocking at loop and data level (compiler and programmer)n SIMD vectorization at loop level (compiler and programmer)n Shared memory programming for each node (OpenMP)n Message passing between nodes (MPI)