Parallel Numerical Algorithms - Edgar Solomonik · 2019. 9. 25. · Motivation Architectures...
Transcript of Parallel Numerical Algorithms - Edgar Solomonik · 2019. 9. 25. · Motivation Architectures...
MotivationArchitectures
NetworksCommunication
Parallel Numerical AlgorithmsChapter 1 – Parallel Computing
Michael T. Heath and Edgar Solomonik
Department of Computer ScienceUniversity of Illinois at Urbana-Champaign
CS 554 / CSE 512
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 1 / 63
MotivationArchitectures
NetworksCommunication
Outline
1 Motivation
2 ArchitecturesTaxonomyMemory Organization
3 NetworksNetwork TopologiesGraph EmbeddingTopology-Awareness in Algorithms
4 CommunicationMessage RoutingCommunication ConcurrencyCollective Communication
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 2 / 63
MotivationArchitectures
NetworksCommunication
Limits on Processor Speed
Computation speed is limited by physical lawsSpeed of conventional processors is limited by
line delays: signal transmission time between gatesgate delays: settling time before state can be reliably read
Both can be improved by reducing device size, but this is inturn ultimately limited by
heat dissipationthermal noise (degradation of signal-to-noise ratio)quantum uncertainty at small scalesgranularity of matter at atomic scale
Heat dissipation is current binding constraint on processorspeed
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 3 / 63
MotivationArchitectures
NetworksCommunication
Moore’s Law
Loosely: complexity (or capability) of microprocessorsdoubles every two yearsMore precisely: number of transistors that can be fit intogiven area of silicon doubles every two yearsMore precisely still: number of transistors per chip thatyields minimum cost per transistor increases by factor oftwo every two yearsDoes not say that microprocessor performance or clockspeed doubles every two yearsNevertheless, clock speed did in fact double every twoyears from roughly 1975 to 2005, but has now flattened atabout 3 GHz due to limitations on power (heat) dissipation
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 4 / 63
MotivationArchitectures
NetworksCommunication
Moore’s Law
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 5 / 63
MotivationArchitectures
NetworksCommunication
The End of Dennard Scaling
Dennard scaling: power usage scales with area, so Moore’s lawenables higher frequency with little increase in power
current leakage caused Dennard scaling to cease in 2005so can no longer increase frequency without increasingpower, must add cores or other functionality
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 6 / 63
MotivationArchitectures
NetworksCommunication
Consequences of Moore’s Law
For given clock speed, increasing performance depends onproducing more results per cycle, which can be achieved byexploiting various forms of parallelism
Pipelined functional unitsSuperscalar architecture (multiple instructions per cycle)Out-of-order execution of instructionsSIMD instructions (multiple sets of operands perinstruction)Memory hierarchy (larger caches and deeper hierarchy)Multicore and multithreaded processors
Consequently, almost all processors today are parallel
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 7 / 63
MotivationArchitectures
NetworksCommunication
High Performance Parallel Supercomputers
Processors in today’s cell phones and automobiles aremore powerful than supercomputers of twenty years agoNevertheless, to attain extreme levels of performance(petaflops and beyond) necessary for large-scalesimulations in science and engineering, many processors(often thousands to hundreds of thousands) must worktogether in concertThis course is about how to design and analyze efficientnumerical algorithms for such architectures andapplications
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 8 / 63
MotivationArchitectures
NetworksCommunication
TaxonomyMemory Organization
Flynn’s Taxonomy
Flynn’s taxonomy : classification of computer systems bynumbers of instruction streams and data streams:
SISD : single instruction stream, single data streamconventional serial computers
SIMD : single instruction stream, multiple data streamsspecial purpose, “data parallel” computers
MISD : multiple instruction streams, single data streamnot particularly useful, except perhaps in “pipelining”
MIMD : multiple instruction streams, multiple data streamsgeneral purpose parallel computers
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 9 / 63
MotivationArchitectures
NetworksCommunication
TaxonomyMemory Organization
SPMD Programming Style
SPMD (single program, multiple data): all processors executesame program, but each operates on different portion ofproblem data
Easier to program than true MIMD, but more flexible thanSIMDAlthough most parallel computers today are MIMDarchitecturally, they are usually programmed in SPMD style
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 10 / 63
MotivationArchitectures
NetworksCommunication
TaxonomyMemory Organization
Architectural Issues
Major architectural issues for parallel computer systems include
processor coordination : synchronous or asynchronous?
memory organization : distributed or shared?
address space : local or global?
memory access : uniform or nonuniform?
granularity : coarse or fine?
scalability : additional processors used efficiently?
interconnection network : topology, switching, routing?
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 11 / 63
MotivationArchitectures
NetworksCommunication
TaxonomyMemory Organization
Distributed-Memory and Shared-Memory Systems
P0 P1 PN
M0 M1 MN
network
• • •
• • •
shared-memory multiprocessor
P0 P1 PN
network
• • •
M0 M1 MN• • •
distributed-memory multicomputer
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 12 / 63
MotivationArchitectures
NetworksCommunication
TaxonomyMemory Organization
Distributed Memory vs. Shared Memory
distributed sharedmemory memory
scalability easier harderdata mapping harder easierdata integrity easier harderperformance optimization easier harderincremental parallelization harder easierautomatic parallelization harder easier
Hybrid systems are common, with memory shared locallywithin SMP (symmetric multiprocessor) nodes but distributedglobally across nodes
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 13 / 63
MotivationArchitectures
NetworksCommunication
TaxonomyMemory Organization
Distributed Memory vs. Shared Memory
distributed sharedmemory memory
scalability easier harderdata mapping harder easierdata integrity easier harderperformance optimization easier harderincremental parallelization harder easierautomatic parallelization harder easier
Hybrid systems are common, with memory shared locallywithin SMP (symmetric multiprocessor) nodes but distributedglobally across nodes
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 14 / 63
MotivationArchitectures
NetworksCommunication
Network TopologiesGraph EmbeddingTopology-Awareness in Algorithms
Network Topologies
Access to remote data requires communication
Direct connections would require O(p2) wires andcommunication ports, which is infeasible for large p
Limited connectivity necessitates routing data throughintermediate processors or switches
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 15 / 63
MotivationArchitectures
NetworksCommunication
Network TopologiesGraph EmbeddingTopology-Awareness in Algorithms
Some Common Network Topologies
bus
star crossbar
1-D torus (ring) 2-D mesh 2-D torus
1-D mesh
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 16 / 63
MotivationArchitectures
NetworksCommunication
Network TopologiesGraph EmbeddingTopology-Awareness in Algorithms
Some Common Network Topologies
butterfly
binary tree
0-cube
1-cube 2-cube 3-cube 4-cube
hypercubes
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 17 / 63
MotivationArchitectures
NetworksCommunication
Network TopologiesGraph EmbeddingTopology-Awareness in Algorithms
Graph Terminology
Graph : pair (V,E), where V is set of vertices or nodesconnected by set E of edgesComplete graph : graph in which any two nodes areconnected by an edgePath : sequence of contiguous edges in graphConnected graph : graph in which any two nodes areconnected by a pathCycle : path of length greater than one that connects anode to itselfTree : connected graph containing no cyclesSpanning tree : subgraph that includes all nodes of givengraph and is also a tree
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 18 / 63
MotivationArchitectures
NetworksCommunication
Network TopologiesGraph EmbeddingTopology-Awareness in Algorithms
Graph Models
Graph model of network: nodes are processors (orswitches or memory units), edges are communication links
Graph model of computation: nodes are tasks, edges aredata dependences between tasks
Mapping task graph of computation to network graph oftarget computer is instance of graph embedding
Distance between two nodes: number of edges (hops ) inshortest path between them
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 19 / 63
MotivationArchitectures
NetworksCommunication
Network TopologiesGraph EmbeddingTopology-Awareness in Algorithms
Network Properties
Some network properties affecting its physical realization andpotential performance
degree : maximum number of edges incident on any node;determines number of communication ports per processordiameter : maximum distance between any pair of nodes;determines maximum communication delay betweenprocessorsbisection bandwidth : (balanced min cut) smallest numberof edges whose removal splits graph into two subgraphs ofequal size; determines ability to support simultaneousglobal communicationedge length : maximum physical length of any wire; may beconstant or variable as number of processors varies
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 20 / 63
MotivationArchitectures
NetworksCommunication
Network TopologiesGraph EmbeddingTopology-Awareness in Algorithms
Network Properties
Network Nodes Deg. Diam. Bisect. W. Edge L.bus/star k + 1 k 2 1 varcrossbar k2 + 2k 4 2(k + 1) k var1-D mesh k 2 k − 1 1 const2-D mesh k2 4 2(k − 1) k const3-D mesh k3 6 3(k − 1) k2 constn-D mesh kn 2n n(k − 1) kn−1 var1-D torus k 2 k/2 2 const2-D torus k2 4 k 2k const3-D torus k3 6 3k/2 2k2 constn-D torus kn 2n nk/2 2kn−1 varbinary tree 2k − 1 3 2(k − 1) 1 varhypercube 2k k k 2k−1 varbutterfly (k + 1)2k 4 2k 2k var
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 21 / 63
MotivationArchitectures
NetworksCommunication
Network TopologiesGraph EmbeddingTopology-Awareness in Algorithms
Graph Embedding
Graph embedding : φ : Vs → Vt maps nodes in source graphGs = (Vs, Es) to nodes in target graph Gt = (Vt, Et) Edges in
Gs mapped to paths in Gt
load : maximum number of nodes in Vs mapped to samenode in Vtcongestion : maximum number of edges in Es mapped topaths containing same edge in Et
dilation : maximum distance between any two nodesφ(u), φ(v) ∈ Vt such that (u, v) ∈ Es
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 22 / 63
MotivationArchitectures
NetworksCommunication
Network TopologiesGraph EmbeddingTopology-Awareness in Algorithms
Graph Embedding
Uniform load helps balance work across processors
Minimizing congestion optimizes use of availablebandwidth of network links
Minimizing dilation keeps nearest-neighborcommunications in source graph as short as possible intarget graph
Perfect embedding has load, congestion, and dilation 1, butnot always possible
Optimal embedding difficult to determine (NP-complete, ingeneral), so heuristics used to determine good embedding
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 23 / 63
MotivationArchitectures
NetworksCommunication
Network TopologiesGraph EmbeddingTopology-Awareness in Algorithms
Examples: Graph Embedding
For some important cases, good or optimal embeddings areknown, for example
000
010
001
101100
011
110 111
ring in 2-D mesh binary tree in 2-D mesh ring in hypercube
dilation 1 dilation d(k − 1)/2e dilation 1
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 24 / 63
MotivationArchitectures
NetworksCommunication
Network TopologiesGraph EmbeddingTopology-Awareness in Algorithms
Gray Code
Gray code : ordering of integers 0 to 2k − 1 such thatconsecutive members differ in exactly one bit position
Example: binary reflected Gray code of length 16
0000 = 0
0001 = 1
0011 = 3
0010 = 2
0110 = 6
0111 = 7
0101 = 5
0100 = 4
1100 = 12
1101 = 13
1111 = 15
1110 = 14
1010 = 10
1011 = 11
1001 = 9
1000 = 8
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 25 / 63
MotivationArchitectures
NetworksCommunication
Network TopologiesGraph EmbeddingTopology-Awareness in Algorithms
Computing Binary Reflected Gray Code
/* Gray code */int gray(int i) {return ((i>>1)^i);}
/* inverse Gray code */int inv_gray(int i) {int k; k=i;while (k>0) {k>>=1; i^=k;}return(i);}
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 26 / 63
MotivationArchitectures
NetworksCommunication
Network TopologiesGraph EmbeddingTopology-Awareness in Algorithms
Hypercubes
Hypercube of dimension k, or k-cube, is graph with 2k
nodes numbered 0, . . . , 2k − 1, and edges between all pairsof nodes whose binary numbers differ in one bit position
Hypercube of dimension k can be created recursively byreplicating hypercube of dimension k − 1 and connectingtheir corresponding nodes
Visiting nodes of hypercube in Gray code order givesHamiltonian cycle, embedding ring in hypercube
For mesh or torus of higher dimension, concatenating Graycodes for each dimension gives embedding in hypercube
Hypercubes provide elegant paradigm for low-diametertarget network in designing parallel algorithms
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 27 / 63
MotivationArchitectures
NetworksCommunication
Network TopologiesGraph EmbeddingTopology-Awareness in Algorithms
Optimality in Network Topology Design
Hypercubes are near-optimal networks, in the sense thatthey can execute any communication pattern withO(log(p)) slowdown via randomizing the data layout
A more refined notion of optimality considers the physicalspace necessary to build the network
Fat trees (switched binary trees) which assign each linkmore bandwidth to higher-level switches are optimal in thissense within polylogarithmic factors
When increasing processors, bisection bandwidth scaleswith O(p2/3) as opposed to O(1) for binary trees
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 28 / 63
MotivationArchitectures
NetworksCommunication
Network TopologiesGraph EmbeddingTopology-Awareness in Algorithms
Low diameter networks
The Cray Dragonfly network has diameter 3define densely connected groups (cliques) of nodesa single pair of nodes connects each pair of groups
Given a target diameter r, the Moore bound provides alower-bound on the degree d
p ≤ 1 + d
r−1∑i=1
(d− 1)i(
asymptotically, p = O(dr)
)Slim Fly nearly attains this bound for diameter 2Slim Fly arranges processors into two 2D grids, with eachprocessor connecting to some nodes in its columns andsome nodes in the other gridThe Slim Fly network yields degree of roughly
√p
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 29 / 63
MotivationArchitectures
NetworksCommunication
Network TopologiesGraph EmbeddingTopology-Awareness in Algorithms
Topology-Awareness in Algorithms
Topology-aware algorithms aim to execute effectively onspecific network topologies
If mapped ideally to a network topology, applications andalgorithms often see significant performance gains
However, real applications are executed on a subset ofnodes of a distributed machine, which may not have thesame connectivity structure as the overall machine
Moreover, network-topology-specific optimziations aretypically not performance-portable
Nevertheless, topologies provide a convenient visualmodel for design of parallel algorithms
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 30 / 63
MotivationArchitectures
NetworksCommunication
Network TopologiesGraph EmbeddingTopology-Awareness in Algorithms
Topology-Obliviousness in Algorithms
An algorithm designed for a sparsely-connected network istypically as efficient on more densly-connected ones
An algorithm designed for a densly-connected networktypically incurs a bounded amount of overhead on moresparsely-connected ones
Ideally, parallel algorithms should be topology-oblivious,i.e. perform well on any reasonable network topology
A good parallel algorithm design methodology is to
1 try to obtain cost-optimality for a fully-connected network
2 organize it so it achieves the same cost on some networktopology that is as sparsely-connected as possible
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 31 / 63
MotivationArchitectures
NetworksCommunication
Message RoutingCommunication ConcurrencyCollective Communication
Message Passing
Simple model for time required to send message (move data)between adjacent nodes:
Tmsg = α+ β s
α = startup time = latency (i.e., time to send message oflength zero)β = incremental transfer time per word (1/β = bandwidthin words per unit time)s = length of message in words
For real parallel systems α� β, so we often simplify α+ β ≈ α
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 32 / 63
MotivationArchitectures
NetworksCommunication
Message RoutingCommunication ConcurrencyCollective Communication
Algorithmic Communication Cost
Let p processors send a message of size s in a ringps is the communication volume (total amount of data sent)However, the execution time depends on whether we sendthe messages concurrently or in sequenceThe communication time models execution time in terms ofper-message costs
if the messages are sent simultaneously,
Tsim−ring(s) = Tmsg(s) = α+ s · β
if the messages are sent in sequence,
Tseq−ring(s, p) = p · Tmsg(s) = p · (α+ s · β)
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 33 / 63
MotivationArchitectures
NetworksCommunication
Message RoutingCommunication ConcurrencyCollective Communication
Message Routing
Messages sent between nodes that are not directly connectedmust be routed through intermediate nodes
Message routing algorithms can beminimal or nonminimal, depending on whether shortestpath is always takenstatic or dynamic, depending on whether same path isalways takendeterministic or randomized, depending on whether path ischosen systematically or randomlycircuit switched or packet switched, depending on whetherentire message goes along reserved path or is transferredin segments that may not all take same path
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 34 / 63
MotivationArchitectures
NetworksCommunication
Message RoutingCommunication ConcurrencyCollective Communication
Message Routing
Most regular network topologies admit simple routing schemesthat are static, deterministic, and minimal
000
010
001
101100
011
110 111
2-D mesh hypercube
source
source
dest
dest
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 35 / 63
MotivationArchitectures
NetworksCommunication
Message RoutingCommunication ConcurrencyCollective Communication
Store-and-Forward vs. Cut-Through Routing
Store-and-forward routing: entire message is received andstored at each node before being forwarded to next node onpath, so
Troute = (α+ βs)D, where D = distance in hops
Cut-through (or wormhole ) routing: message broken intosegments that are pipelined through network, with eachsegment forwarded as soon as it is received, so
Troute = α+ βs + thD, where th = incremental time per hop
Generally th ≤ α, so we can treat both as network latency,
Troute = αD + βs
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 36 / 63
MotivationArchitectures
NetworksCommunication
Message RoutingCommunication ConcurrencyCollective Communication
Store-and-Forward vs. Worlmhole Routing
store-and-forward cut-through
P0
P1
P2
P3
P0
P1
P2
P3
Cut-through (wormhole) routing greatly reduces distance effect,but aggregrate bandwidth may still be significant constraint
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 37 / 63
MotivationArchitectures
NetworksCommunication
Message RoutingCommunication ConcurrencyCollective Communication
Communication Concurrency
For given communication system, it may or may not be possiblefor each node to
send message while receiving another simultaneously onsame communication linksend message on one link while receiving simultaneouslyon different linksend or receive, or both, simultaneously on multiple links
We will generally assume a processor can send or receive onlyone message at a time (but can send one and receive onesimultaneously).
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 38 / 63
MotivationArchitectures
NetworksCommunication
Message RoutingCommunication ConcurrencyCollective Communication
Collective Communication
Collective communication : multiple nodes communicatingsimultaneously in systematic pattern, which we can classify as
One-to-All: Broadcast, ScatterAll-to-One: Reduce, GatherAll-to-One + One-to-All: Allreduce (Reduce+Broadcast),Allgather (Gather+Broadcast), Reduce-Scatter(Reduce+Scatter), ScanAll-to-All: All-to-all
The distinction between the last two types is made due to theirdifferent cost characteristics
MPI (Message-Passing Interface) provides all of these as wellas variable size versions (e.g. (All)Gatherv, All-to-allv).
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 39 / 63
MotivationArchitectures
NetworksCommunication
Message RoutingCommunication ConcurrencyCollective Communication
Collective Communication
A0 A1 A2 A3 A4 A5 scatter
gather
A0A1A2A3A4A5
A0 A1 A2 A3 A4 A5B0 B1 B2 B3 B4 B5C0 C1 C2 C3 C4 C5D0 D1 D2 D3 D4 D5E0 E1 E2 E3 E4 E5F0 F1 F2 F3 F4 F5
A0 B0 C0 D0 E0 F0A1 B1 C1 D1 E1 F1A2 B2 C2 D2 E2 F2A3 B3 C3 D3 E3 F3A4 B4 C4 D4 E4 F4A5 B5 C5 D5 E5 F5
completeexchange
A0B0C0D0E0F0
allgather
A0 B0 C0 D0 E0 F0A0 B0 C0 D0 E0 F0A0 B0 C0 D0 E0 F0A0 B0 C0 D0 E0 F0A0 B0 C0 D0 E0 F0A0 B0 C0 D0 E0 F0
A0
data
broadcast
processes A0A0A0A0A0A0
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 40 / 63
MotivationArchitectures
NetworksCommunication
Message RoutingCommunication ConcurrencyCollective Communication
Broadcast
Broadcast : source node sends same message of size s toeach of p− 1 other nodes
Binary or binomial trees are often used for one-to-all collectiveslike broadcast, but any spanning tree will do
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 41 / 63
MotivationArchitectures
NetworksCommunication
Message RoutingCommunication ConcurrencyCollective Communication
Broadcast
1
2
3
3 3
3
2-D mesh hypercube
13
2
4
23 3 3
4 4 4 4
4 4 4
2
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 42 / 63
MotivationArchitectures
NetworksCommunication
Message RoutingCommunication ConcurrencyCollective Communication
Broadcast
Cost of broadcast depends on network, for example1-D mesh: T = (p− 1) (α+ βs)
2-D mesh: T = 2(√p− 1) (α+ βs)
hypercube: T = log p (α+ βs)
For long messages, bandwidth utilization may be enhanced bybreaking message into segments and either
pipeline segments along single spanning tree, orsend each segment along different spanning tree havingsame root
For example, hypercube with 2k nodes has k edge-disjointspanning trees for any given root node
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 43 / 63
MotivationArchitectures
NetworksCommunication
Message RoutingCommunication ConcurrencyCollective Communication
Butterfly Protocols
All collective-communication can be done near-optimally withbutterfly protocols, which use all links of a hypercube network
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 44 / 63
MotivationArchitectures
NetworksCommunication
Message RoutingCommunication ConcurrencyCollective Communication
Butterfly Protocols
All collective-communication can be done near-optimally withbutterfly protocols, which use all links of a hypercube network
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 45 / 63
MotivationArchitectures
NetworksCommunication
Message RoutingCommunication ConcurrencyCollective Communication
Butterfly Allgather (Recursive Doubling)
Allgather : each of p nodes sends message to all other nodes
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 46 / 63
MotivationArchitectures
NetworksCommunication
Message RoutingCommunication ConcurrencyCollective Communication
Cost of Butterfly Allgather
The butterfly has log2(p) levels. The size of the messagedoubles at each level until all s elements are gathered, so thetotal cost is
Tallgather(s, p) =
{0 : p = 1
Tallgather(s/2, p/2) + α+ β(s/2) : p > 1
≈ α log2(p) +
log2(p)∑i=1
βs/2i
≈ α log2(p) + βs
The geometric summation in the cost analysis is typical forbutterfly protocols for one-to-all and all-to-one collectives
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 47 / 63
MotivationArchitectures
NetworksCommunication
Message RoutingCommunication ConcurrencyCollective Communication
Butterfly Scatter
Scatter : source node sends message of size s/p to each ofp− 1 other nodes
Note that the messages are forwarded down a binomial and nota binary spanning tree of nodes.
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 48 / 63
MotivationArchitectures
NetworksCommunication
Message RoutingCommunication ConcurrencyCollective Communication
Butterfly Broadcast
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 49 / 63
MotivationArchitectures
NetworksCommunication
Message RoutingCommunication ConcurrencyCollective Communication
Butterfly Broadcast
Tbroadcast = Tscatter + Tallgather = 2Tallgather
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 50 / 63
MotivationArchitectures
NetworksCommunication
Message RoutingCommunication ConcurrencyCollective Communication
Reduction
Reduction : data from all p nodes are combined by applyingspecified associative operation ⊕ (e.g., sum, product, max, min,logical OR, logical AND) to produce overall result
Generally, we can turn any broadcast algorithm into a reductionalgorithm by reversing the flow of information, so we see
Broadcast done effectively by Scatter + Allgather
Reduction done effectively by Reduce-Scatter + Gather
Allreduce done effectively by Reduce-Scatter + AllgatherThese one-to-all + all-to-one collectives have butterfly protocolswith equivalent cost.
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 51 / 63
MotivationArchitectures
NetworksCommunication
Message RoutingCommunication ConcurrencyCollective Communication
Butterfly Reduce-Scatter (Recursive Halving)
Reduce-scatter : a reduction with the result distributed over all pnodes
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 52 / 63
MotivationArchitectures
NetworksCommunication
Message RoutingCommunication ConcurrencyCollective Communication
Butterfly Allreduce
Allreduce : a reduction with the result replicated on all p nodes
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 53 / 63
MotivationArchitectures
NetworksCommunication
Message RoutingCommunication ConcurrencyCollective Communication
Butterfly Allreduce
Tallreduce = Treduce−scatter + Tallgather
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 54 / 63
MotivationArchitectures
NetworksCommunication
Message RoutingCommunication ConcurrencyCollective Communication
Butterfly Allreduce: note recursive structure of butterfly
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 55 / 63
MotivationArchitectures
NetworksCommunication
Message RoutingCommunication ConcurrencyCollective Communication
Scan or Prefix
Scan or prefix : given data values x0, x1, . . . , xp−1, one pernode, along with associative operation ⊕, compute sequence ofpartial results y0, y1, . . . , yp−1, where
yk = x0 ⊕ x1 ⊕ · · · ⊕ xk,
and yk is to reside on node k, k = 0, . . . p− 1
Scan can be implemented via a butterfly protocol similar toAllreduce, except intermediate results must be stored whiledoing recursive halving to be recombined when doing recursivedoubling
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 56 / 63
MotivationArchitectures
NetworksCommunication
Message RoutingCommunication ConcurrencyCollective Communication
Butterfly All-to-All
The size of the message stays the same at each level, so
Tall−to−all(s, P ) = = α log2(P ) + βs log2(P )/2
Its possible to do All-to-All in less bandwidth cost (as low as βsby sending directly to targets) at the cost of more messages (ashigh as αP if sending directly)
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 57 / 63
MotivationArchitectures
NetworksCommunication
Message RoutingCommunication ConcurrencyCollective Communication
Collectives on Mesh and Torus Networks
Butterfly protocols cannot be mapped to tori without dilationbandwidth-efficient collectives can be achieved by insteadpipelining along spanning treesif height of spanning tree is H (e.g. H ≈ 2
√p for 2D mesh),
then cost of one-to-all and all-to-one collectives is
Tone-to-all(s, p,H) = Θ(αH + βs)
hypercube (general) cost is recovered with H = log2(p)
use of more than one disjoint spanning trees (rectangularcollectives) is beneficial if processors can send and receivemessages along multiple links concurrentlyall-to-all cost generally depends on the bisectionbandwidth of the network (proportional to p(d−1)/d ford-dimensional torus/mesh)
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 58 / 63
MotivationArchitectures
NetworksCommunication
References – Moore’s Law
M. T. Heath, A tale of two laws, International Journal ofHigh Performance Computing Applications, 29(3):320-330,2015
C. A. Mack, Fifty years of Moore’s law, IEEE Transactionson Semiconductor Manufacturing, 24(2):202-207, 2011
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 59 / 63
MotivationArchitectures
NetworksCommunication
References – Parallel Computing
G. S. Almasi and A. Gottlieb, Highly Parallel Computing, 2nd ed.,Benjamin/Cummings, 1994
J. Dongarra, et al., eds., Sourcebook of Parallel Computing,Morgan Kaufmann, 2003
A. Grama, A. Gupta, G. Karypis, and V. Kumar, Introduction toParallel Computing, 2nd. ed., Addison-Wesley, 2003
G. Hager and G. Wellein, Introduction to High PerformanceComputing for Scientists and Engineers, Chapman & Hall, 2011
K. Hwang and Z. Xu, Scalable Parallel Computing, McGraw-Hill,1998
A. Y. Zomaya, ed., Parallel and Distributed ComputingHandbook, McGraw-Hill, 1996
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 60 / 63
MotivationArchitectures
NetworksCommunication
References – Parallel Architectures
W. C. Athas and C. L. Seitz, Multicomputers:message-passing concurrent computers, IEEE Computer21(8):9-24, 1988
D. E. Culler, J. P. Singh, and A. Gupta, Parallel ComputerArchitecture, Morgan Kaufmann, 1998
M. Dubois, M. Annavaram, and P. Stenström, ParallelComputer Organization and Design, Cambridge UniversityPress, 2012
R. Duncan, A survey of parallel computer architectures,IEEE Computer 23(2):5-16, 1990
F. T. Leighton, Introduction to Parallel Algorithms andArchitectures: Arrays, Trees, Hypercubes, MorganKaufmann, 1992
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 61 / 63
MotivationArchitectures
NetworksCommunication
References – Interconnection NetworksL. N. Bhuyan, Q. Yang, and D. P. Agarwal, Performance ofmultiprocessor interconnection networks, IEEE Computer22(2):25-37, 1989
W. J. Dally and B. P. Towles, Principles and Practices ofInterconnection Networks, Morgan Kaufmann, 2004
T. Y. Feng, A survey of interconnection networks, IEEEComputer 14(12):12-27, 1981
I. D. Scherson and A. S. Youssef, eds., Interconnection Networksfor High-Performance Parallel Computers, IEEE ComputerSociety Press, 1994
H. J. Siegel, Interconnection Networks for Large-Scale ParallelProcessing, D. C. Heath, 1985
C.-L. Wu and T.-Y. Feng, eds., Interconnection Networks forParallel and Distributed Processing, IEEE Computer SocietyPress, 1984
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 62 / 63
MotivationArchitectures
NetworksCommunication
References – HypercubesD. P. Bertsekas et al., Optimal communication algorithms forhypercubes, J. Parallel Distrib. Comput. 11:263-275, 1991
S. L. Johnsson and C.-T. Ho, Optimum broadcasting andpersonalized communication in hypercubes, IEEE Trans.Comput. 38:1249-1268, 1989
O. McBryan and E. F. Van de Velde, Hypercube algorithms andimplementations, SIAM J. Sci. Stat. Comput. 8:s227-s287, 1987
S. Ranka, Y. Won, and S. Sahni, Programming a hypercubemulticomputer, IEEE Software 69-77, September 1988
Y. Saad and M. H. Schultz, Topological properties ofhypercubes, IEEE Trans. Comput. 37:867-872, 1988
Y. Saad and M. H. Schultz, Data communication in hypercubes,J. Parallel Distrib. Comput. 6:115-135, 1989
C. L. Seitz, The cosmic cube, Comm. ACM 28:22-33, 1985
Michael T. Heath and Edgar Solomonik Parallel Numerical Algorithms 63 / 63