Parallel Programming
description
Transcript of Parallel Programming
Parallel Programming
Sathish S. Vadhiyar
2
Motivations of Parallel ComputingParallel Machine: a computer system with more than one
processorMotivations
• Faster Execution time due to non-dependencies between regions of code
• Presents a level of modularity
• Resource constraints. Large databases.
• Certain class of algorithms lend themselves
• Aggregate bandwidth to memory/disk. Increase in data throughput.
• Clock rate improvement in the past decade – 40%
• Memory access time improvement in the past decade – 10%
3
Parallel Programming and Challenges
Recall the advantages and motivation of parallelism
But parallel programs incur overheads not seen in sequential programs Communication delay Idling Synchronization
4
Challenges
P0
P1
Idle timeComputation
Communication
Synchronization
5
How do we evaluate a parallel program? Execution time, Tp Speedup, S
S(p, n) = T(1, n) / T(p, n) Usually, S(p, n) < p Sometimes S(p, n) > p (superlinear speedup)
Efficiency, E E(p, n) = S(p, n)/p Usually, E(p, n) < 1 Sometimes, greater than 1
Scalability – Limitations in parallel computing, relation to n and p.
6
Speedups and efficiency
Ideal p
S
Practical
p
E
7
Limitations on speedup – Amdahl’s law Amdahl's law states that the performance
improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used.
Overall speedup in terms of fractions of computation time with and without enhancement, % increase in enhancement.
Places a limit on the speedup due to parallelism.
Speedup = 1(fs + (fp/P))
Amdahl’s law Illustration
Courtesy:
http://www.metz.supelec.fr/~dedu/docs/kohPaper/node2.html
http://nereida.deioc.ull.es/html/openmp/pdp2002/sld008.htm
Efficiency
0
0.2
0.4
0.6
0.8
1
0 5 10 15
S = 1 / (s + (1-s)/p)
Amdahl’s law analysis
f P=1 P=4 P=8 P=16 P=32
1.00 1.0 4.00 8.00 16.00 32.00
0.99 1.0 3.88 7.48 13.91 24.43
0.98 1.0 3.77 7.02 12.31 19.75
0.96 1.0 3.57 6.25 10.00 14.29
•For the same fraction, speedup numbers keep moving away from processor size.
•Thus Amdahl’s law is a bit depressing for parallel programming.
•In practice, the number of parallel portions of work has to be large enough to match a given number of processors.
Gustafson’s Law Amdahl’s law – keep the parallel work fixed Gustafson’s law – keep computation time
on parallel processors fixed, change the problem size (fraction of parallel/sequential work) to match the computation time
For a particular number of processors, find the problem size for which parallel time is equal to the constant time
For that problem size, find the sequential time and the corresponding speedup
Thus speedup is scaled or scaled speedup. Also called weak speedup
11
Metrics (Contd..)
N P=1 P=4 P=8 P=16 P=32
64 1.0 0.80 0.57 0.33
192 1.0 0.92 0.80 0.60
512 1.0 0.97 0.91 0.80
Table 5.1: Efficiency as a function of n and p.
12
Scalability Efficiency decreases with increasing P;
increases with increasing N How effectively the parallel algorithm can
use an increasing number of processors How the amount of computation performed
must scale with P to keep E constant This function of computation in terms of P is
called isoefficiency function. An algorithm with an isoefficiency function
of O(P) is highly scalable while an algorithm with quadratic or exponential isoefficiency function is poorly scalable
13
Parallel Program Models
Single Program Multiple Data (SPMD)
Multiple Program Multiple Data (MPMD)
Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/
14
Programming Paradigms
Shared memory model – Threads, OpenMP, CUDA
Message passing model – MPI
PARALLELIZATION
16
Parallelizing a ProgramGiven a sequential program/algorithm, how
to go about producing a parallel versionFour steps in program parallelization
1. DecompositionIdentifying parallel tasks with large extent of possible
concurrent activity; splitting the problem into tasks
2. AssignmentGrouping the tasks into processes with best load
balancing
3. OrchestrationReducing synchronization and communication costs
4. MappingMapping of processes to processors (if possible)
17
Steps in Creating a Parallel Program
P0
Tasks Processes Processors
P1
P2 P3
p0 p1
p2 p3
p0 p1
p2 p3
Partitioning
Sequentialcomputation
Parallelprogram
Assignment
Decomposition
Mapping
Orchestration
18
Orchestration
GoalsStructuring communicationSynchronization
ChallengesOrganizing data structures – packingSmall or large messages?How to organize communication and
synchronization ?
19
OrchestrationMaximizing data locality
Minimizing volume of data exchangeNot communicating intermediate results – e.g. dot product
Minimizing frequency of interactions - packingMinimizing contention and hot spots
Do not use the same communication pattern with the other processes in all the processes
Overlapping computations with interactionsSplit computations into phases: those that depend on
communicated data (type 1) and those that do not (type 2)
Initiate communication for type 1; During communication, perform type 2
Replicating data or computationsBalancing the extra computation or storage cost with
the gain due to less communication
20
Mapping
Which process runs on which particular processor?Can depend on network topology,
communication pattern of processorsOn processor speeds in case of
heterogeneous systems
21
Mapping
Static mapping Mapping based on Data partitioning
Applicable to dense matrix computations Block distribution Block-cyclic distribution
Graph partitioning based mapping Applicable for sparse matrix computations
Mapping based on task partitioning
0 0 0 1 1 1 2 2 2
0 1 0 1 2 0 1 22
22
Based on Task Partitioning Based on task dependency graph
In general the problem is NP complete
0
0 4
0 2 4 6
0 1 2 3 4 5 6 7
23
Mapping
Dynamic Mapping A process/global memory can hold a set
of tasks Distribute some tasks to all processes Once a process completes its tasks, it
asks the coordinator process for more tasks
Referred to as self-scheduling, work-stealing
24
High-level GoalsTable 2.1 Steps in the Parallelization Process and Their Goals
StepArchitecture-Dependent? Major Performance Goals
Decomposition Mostly no Expose enough concurrency but not too much
Assignment Mostly no Balance workloadReduce communication volume
Orchestration Yes Reduce noninherent communication via data locality
Reduce communication and synchronization cost as seen by the processor
Reduce serialization at shared resourcesSchedule tasks to satisfy dependences early
Mapping Yes Put related processes on the same processor if necessary
Exploit locality in network topology
PARALLEL ARCHITECTURE
25
26
Classification of Architectures – Flynn’s classification
In terms of parallelism in instruction and data stream
Single Instruction Single Data (SISD): Serial Computers
Single Instruction Multiple Data (SIMD)
- Vector processors and processor arrays
- Examples: CM-2, Cray-90, Cray YMP, Hitachi 3600
Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/
27
Classification of Architectures – Flynn’s classification
Multiple Instruction Single Data (MISD): Not popular
Multiple Instruction Multiple Data (MIMD)
- Most popular - IBM SP and most other
supercomputers, clusters,
computational Grids etc.
Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/
28
Classification of Architectures – Based on Memory Shared memory 2 types – UMA and
NUMA
UMA
NUMA
Examples: HP-Exemplar, SGI Origin, Sequent NUMA-Q
Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/
29
Classification 2:Shared Memory vs Message Passing
Shared memory machine: The n processors share physical address space Communication can be done through this
shared memory
The alternative is sometimes referred to as a message passing machine or a distributed memory machine
PP P P PP P
Interconnect
Main Memory
PP P P PP P
Interconnect
M MMMMMM
30
Shared Memory Machines
The shared memory could itself be distributed among the processor nodes Each processor might have some portion
of the shared physical address space that is physically close to it and therefore accessible in less time
Terms: NUMA vs UMA architecture Non-Uniform Memory Access Uniform Memory Access
31
Classification of Architectures – Based on Memory Distributed memory
Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/
Recently multi-cores Yet another classification – MPPs,
NOW (Berkeley), COW, Computational Grids
32
Parallel Architecture: Interconnection Networks
An interconnection network defined by switches, links and interfaces Switches – provide mapping between input
and output ports, buffering, routing etc. Interfaces – connects nodes with network
33
Parallel Architecture: Interconnections Indirect interconnects: nodes are connected to
interconnection medium, not directly to each other Shared bus, multiple bus, crossbar, MIN
Direct interconnects: nodes are connected directly to each other Topology: linear, ring, star, mesh, torus, hypercube Routing techniques: how the route taken by the message
from source to destination is decided Network topologies
Static – point-to-point communication links among processing nodes
Dynamic – Communication links are formed dynamically by switches
34
Interconnection Networks Static
Bus – SGI challenge Completely connected Star Linear array, Ring (1-D torus) Mesh – Intel ASCI Red (2-D) , Cray T3E (3-D), 2DTorus k-d mesh: d dimensions with k nodes in each dimension Hypercubes – 2-logp mesh – e.g. many MIMD machines Trees – our campus network
Dynamic – Communication links are formed dynamically by switches Crossbar – Cray X series – non-blocking network Multistage – SP2 – blocking network.
For more details, and evaluation of topologies, refer to book by Grama et al.
35
Indirect InterconnectsShared bus
Multiple bus
Crossbar switch
Multistage Interconnection Network
2x2 crossbar
36
Direct Interconnect TopologiesLinear
RingStar
Mesh2D
Torus
Hypercube(binary n-cube)
n=2 n=3
37
Evaluating Interconnection topologies Diameter – maximum distance between any two
processing nodes Full-connected – Star – Ring – Hypercube -
Connectivity – multiplicity of paths between 2 nodes. Maximum number of arcs to be removed from network to break it into two disconnected networks Linear-array – Ring – 2-d mesh – 2-d mesh with wraparound – D-dimension hypercubes –
12p/2
logP
12
24
d
38
Evaluating Interconnection topologies bisection width – minimum number of
links to be removed from network to partition it into 2 equal halves Ring – P-node 2-D mesh - Tree – Star – Completely connected – Hypercubes -
2
Root(P)
1
1
P2/4
P/2
39
Evaluating Interconnection topologies channel width – number of bits that can be
simultaneously communicated over a link, i.e. number of physical wires between 2 nodes
channel rate – performance of a single physical wire
channel bandwidth – channel rate times channel width
bisection bandwidth – maximum volume of communication between two halves of network, i.e. bisection width times channel bandwidth
40
X: 0
Shared Memory Architecture: Caches
X: 0
Read X Read X
X: 0
Read X
Cache hit: Wrong data!!
P1 P2
Write X=1
X: 1
X: 1
41
Cache Coherence Problem
If each processor in a shared memory multiple processor machine has a data cache Potential data consistency problem: the cache
coherence problem Shared variable modification, private cache
Objective: processes shouldn’t read `stale’ data
Solutions Hardware: cache coherence mechanisms
42
Cache Coherence Protocols
Write update – propagate cache line to other processors on every write to a processor
Write invalidate – each processor gets the updated cache line whenever it reads stale data
Which is better?
43
X: 0
Invalidation Based Cache Coherence
X: 0
Read X Read X
X: 0
Read X
Invalidate
P1 P2
Write X=1
X: 1X: 1
X: 1
44
Cache Coherence using invalidate protocols
3 states associated with data items Shared – a variable shared by 2 caches Invalid – another processor (say P0) has updated the
data item Dirty – state of the data item in P0
Implementations Snoopy
for bus based architectures shared bus interconnect where all cache controllers monitor all
bus activity There is only one operation through bus at a time; cache
controllers can be built to take corrective action and enforce coherence in caches
Memory operations are propagated over the bus and snooped Directory-based
Instead of broadcasting memory operations to all processors, propagate coherence operations to relevant processors
A central directory maintains states of cache blocks, associated processors
Implemented with presence bits
END