Collective Operations for Wide-Area Message Passing
Systems Using Adaptive Spanning Trees
Hideo Saito, Kenjiro Taura and Takashi Chikayama (Univ. of Tokyo)
Grid 2005 (Nov. 13, 2005)
2
Message Passing in WANs
Increase in bandwidth of Wide-Area Networks More opportunities to perform parallel comp.
using multiple clusters connected by a WAN Demands to use message passing for parallel
computation in WANs Existing programs written using message passing Familiar programming model
WAN
3
Collective Operations
Communication operations in which all processors participate E.g., broadcast and reduction
Importance (Gorlatch et al. 2004) Easier to program using coll. operations than with just
point-to-point operations (i.e., send/receive) Libraries can provide a faster implementation than wit
h user-level send/recv Libraries can take advantage of knowledge of the underlying h
ardware
Broadcast
Root
4
Coll. Ops. in LANs and WANs
Collective Operations for LANs Optimized under the assumption that all links have the
same latency/bandwidth
Collective Operations for WANs Wide-area links are much slower than local-area links Collective operations need to avoid links w/ high
latency or low bandwidth Existing methods use static Grid-aware trees
constructed using manually-supplied information
5
Problems of Existing Methods
Large-scale, long-lasting applications More situations in which…
Different computational resources are used upon each invocation
Computational resources are automatically allocated by middleware
Processors are added/removed after app. startup Difficult to manually supply topology information Static trees used in existing methods won’t work
6
Contribution of Our Work
Method to perform collective operations Efficient in clustered wide-area systems Doesn’t require manual configuration Adapts to new topologies when processes are
added/removed Implementation for the Phoenix Message
Passing System Implementation for MPI (work in progress)
7
Outline
1. Introduction
2. Related Work
3. Phoenix
4. Our Proposal
5. Experimental Results
6. Conclusion and Future Work
8
MPICH
Thakur et al. 2005 Assumes that the latency and bandwidth of all lin
ks are the same Short messages: latency-aware algorithm Long messages: bandwidth-aware algorithm
Ring All-gather
Root
Binomial Tree
Short Root
Scatter
Long
9
MagPIe
Kielmann et al. ’99 Separate static trees for
wide-area and local-area communication
Broadcast Root sends to the “coordi
nator node” of each cluster
Coordinator nodes perform an MPICH-like broadcast within each cluster
LAN
LAN
LANRoot
Coord.Node
Coord.Node
10
Other Works on Coll. Ops. for WANs
Other works that rely on manually-supplied info. Network Performance-aware Collective Communicatio
n for Clustered Wide Area Systems (Kielmann et al. 2001)
MPICH-G2 (Karonis et al. 2003)
11
Content Delivery Networks
Application-level multicast mechanisms using topology-aware overlay networks Overcast (Jannotti et al. 2000) SelectCast (Bozdog et al. 2003) Banerjee et al. 2002
Designed for content delivery; don’t work for message passing Data loss Single source Only 1-to-N operations
12
Outline
1. Introduction
2. Related Work
3. Phoenix
4. Our Proposal
5. Experimental Results
6. Conclusion and Future Work
13
Phoenix
Taura et al. (PPoPP2003) Phoenix Programming Model
Message passing model Programs are written using send/receive
Messages are addressed to virtual nodes Strong support for addition/removal of processes during
execution
Phoenix Message Passing Library Message passing library based on the Phoenix
Programming Model Basis of our implementation
14
Addition/Removal of Processes
Virtual node namespace Messages are addressed to virtual nodes instead of to
processes API to “migrate” virtual nodes supports
addition/removal of processes during execution
0-19 20-39
send(30)
30-39
migration
20-29
JOINsend(30)
15
Broadcast in Phoenix
0 1 2 3 4 4
migration
Broadcast in MPI
root
Broadcast in Phoenix
root
may bedeliveredtogether
16
Reduction in Phoenix
Reduction in MPI
root
0 1 2 3 4
Reduction in Phoenix
root
Op Op
17
Outline
1. Introduction
2. Related Work
3. Phoenix
4. Our Proposal
5. Experimental Results
6. Conclusion and Future Work
18
Overview of Our Proposal
Create topology-aware spanning trees at run-time Latency-aware trees (for short messages) Bandwidth-aware trees (for long messages)
Perform broadcasts/reductions along the generated trees
Update the trees when processes are added/removed
19
Spanning Tree Creation
Create a spanning tree for each process w/ that process at the root
Each process autonomously Measures the RTT (or bandwidth)
between itself and randomly selected other processes
Searches for a suitable parent for each spanning tree
Root
RTTRTT
RTT
20
Latency-Aware Trees
Goal Few edges between cl
usters Moderate fan-out and
depth within clusters Parent selection
RTTp,cand < RTTp,parent
distcand,root < distp,root
root
p
distp,root
21
Latency-Aware Trees
Goal Few edges between cl
usters Moderate fan-out and
depth within clusters Parent selection
RTTp,cand < RTTp,parent
distcand,root < distp,root
LAN LAN
LAN
Root
22
Latency-Aware Trees
Goal Few edges between clu
sters Moderate fan-out and d
epth within clusters Parent selection
RTTp,cand < RTTp,parent
distcand,root < distp,root
Root(w/in cluster)
Root
ParentChange
ParentChange
23
Latency-Aware Trees
Goal Few edges between clu
sters Moderate fan-out and d
epth within clusters Parent selection
RTTp,cand < RTTp,parent
distcand,root < distp,root
parent/root
p
p
cand
RTTp,parent
RTTp,cand
distcand,root
distp,root
24
Bandwidth-Aware Trees
Goal Efficient use of bandwidth during a broadcast/reductio
n Bandwidth estimation
estp,cand =min(estcand, bwp2p/(nchildren+1))
Parent selection estp,cand > estp,parent
bwp2p
nchildren
estcand
cand’s parent
cand
p
25
Broadcast
Short message (<128KB) Forward along a latency-
aware tree Long message (>128KB)
Pipeline along a bandwidth-aware tree
Include in the header The set of virtual nodes to
be forwarded via the receiving process
5
0 (root)
21
43
{2, 5}{1,3,4}
{3} {4} {5}
5
migration
26
Reduction
Each processor Waits for a message from
all of its children Performs the specified
operation Forwards the result to its
parent Timeout mechanism
To avoid waiting forever for a child that has already sent its message to another process
Newparent
Oldparent
p
Timeout
Parent
ParentChange
27
Outline
1. Introduction
2. Related Work
3. Phoenix
4. Our Proposal
5. Experimental Results
6. Conclusion and Future Work
28
Broadcast (1-byte)
MPI-like broadcast Mapped 1 virtual node
to each process 201 processes in 3
clusters
Our Implementation MagPIe-like (Static Grid-aware) Impl.
MPICH-like (Grid-unaware) Impl.
0
5
10
15
20
25
30
35
40
0 40 80 120 160 200
Process Number
Com
plet
ion
Tim
e (m
s)
05
10152025303540
0 40 80 120 160 200
Process Number
Com
plet
ion
Tim
e (m
s)
05
10152025303540
0 40 80 120 160 200
Process Number
Com
plet
ion
Tim
e (m
s)
29
Broadcast (Long)
MPI-like broadcast Mapped 1 virtual node to each process 137 processes in 4 clusters
0
100
200
300
400
500
600
700
800
1. E+04 1. E+05 1. E+06 1. E+07 1. E+08
Message Si ze (Bytes)
Band
widt
h (M
B/se
c)
Dynami c (Our I mp. )MPI CH- l i keMagPI e- l i keLi st
30
Reduction
MPI-like Reduction Mapped 1 virtual node to each process 128 processes in 3 clusters
1.E+01
1.E+02
1.E+03
1.E+04
1.E+01 1.E+03 1.E+05 1.E+07Integers Summed
Com
plet
ion
Tim
e (m
icro
secs
)
Dynamic (Our Imp.)
MagPIe-like
MPICH-like
31
0
100
200
300
400
500
600
0 30 60 90 120
Elapsed Time (secs)
Ban
dw
idth
(MB
/sec
)
Addition/Removal of Processes
Repeated 4-MB broadcasts 160 procs. in 4 clusters Added/removed procs. whil
e broadcasting t = 0 [s]
1 virtual node/process
t = 60 [s] Remove half of the processes
t = 90 [s] Re-add the removed processes
Rm. Add
32
Outline
1. Introduction
2. Related Work
3. Phoenix
4. Our Proposal
5. Experimental Results
6. Conclusion and Future Work
33
Conclusion
Presented a method to perform broadcasts and reductions in WANs w/out manual configuration
Experiments Stable-state broadcast/reduction
1-byte broadcast 3+ times faster than MPICH, w/in a factor of 2 of MagPIe
Addition/removal of processes Effective execution resumed 8 seconds after adding/removi
ng processes
34
Future Work
Optimize broadcast/reduction Reduce the gap between our method and static
Grid-enabled methods Other collective operations
All-to-all Barrier
35
Thank you!
Top Related