Download - Collective Operations for Wide-Area Message Passing Systems Using Adaptive Spanning Trees

Collective Operations for Wide-Area Message Passing

Systems Using Adaptive Spanning Trees

Hideo Saito, Kenjiro Taura and Takashi Chikayama (Univ. of Tokyo)

Grid 2005 (Nov. 13, 2005)

2

Message Passing in WANs

Increase in bandwidth of Wide-Area Networks More opportunities to perform parallel comp.

using multiple clusters connected by a WAN Demands to use message passing for parallel

computation in WANs Existing programs written using message passing Familiar programming model

WAN

3

Collective Operations

Communication operations in which all processors participate E.g., broadcast and reduction

Importance (Gorlatch et al. 2004) Easier to program using coll. operations than with just

point-to-point operations (i.e., send/receive) Libraries can provide a faster implementation than wit

h user-level send/recv Libraries can take advantage of knowledge of the underlying h

ardware

Broadcast

Root

4

Coll. Ops. in LANs and WANs

Collective Operations for LANs Optimized under the assumption that all links have the

same latency/bandwidth

Collective Operations for WANs Wide-area links are much slower than local-area links Collective operations need to avoid links w/ high

latency or low bandwidth Existing methods use static Grid-aware trees

constructed using manually-supplied information

5

Problems of Existing Methods

Large-scale, long-lasting applications More situations in which…

Different computational resources are used upon each invocation

Computational resources are automatically allocated by middleware

Processors are added/removed after app. startup Difficult to manually supply topology information Static trees used in existing methods won’t work

6

Contribution of Our Work

Method to perform collective operations Efficient in clustered wide-area systems Doesn’t require manual configuration Adapts to new topologies when processes are

added/removed Implementation for the Phoenix Message

Passing System Implementation for MPI (work in progress)

7

Outline

1. Introduction

2. Related Work

3. Phoenix

4. Our Proposal

5. Experimental Results

6. Conclusion and Future Work

8

MPICH

Thakur et al. 2005 Assumes that the latency and bandwidth of all lin

ks are the same Short messages: latency-aware algorithm Long messages: bandwidth-aware algorithm

Ring All-gather

Root

Binomial Tree

Short Root

Scatter

Long

9

MagPIe

Kielmann et al. ’99 Separate static trees for

wide-area and local-area communication

Broadcast Root sends to the “coordi

nator node” of each cluster

Coordinator nodes perform an MPICH-like broadcast within each cluster

LAN

LAN

LANRoot

Coord.Node

Coord.Node

10

Other Works on Coll. Ops. for WANs

Other works that rely on manually-supplied info. Network Performance-aware Collective Communicatio

n for Clustered Wide Area Systems (Kielmann et al. 2001)

MPICH-G2 (Karonis et al. 2003)

11

Content Delivery Networks

Application-level multicast mechanisms using topology-aware overlay networks Overcast (Jannotti et al. 2000) SelectCast (Bozdog et al. 2003) Banerjee et al. 2002

Designed for content delivery; don’t work for message passing Data loss Single source Only 1-to-N operations

12

Outline

1. Introduction

2. Related Work

3. Phoenix

4. Our Proposal



13

Phoenix

Taura et al. (PPoPP2003) Phoenix Programming Model

Message passing model Programs are written using send/receive

Messages are addressed to virtual nodes Strong support for addition/removal of processes during

execution

Phoenix Message Passing Library Message passing library based on the Phoenix

Programming Model Basis of our implementation

14

Addition/Removal of Processes

Virtual node namespace Messages are addressed to virtual nodes instead of to

processes API to “migrate” virtual nodes supports

addition/removal of processes during execution

0-19 20-39

send(30)

30-39

migration

20-29

JOINsend(30)

15

Broadcast in Phoenix

0 1 2 3 4 4

migration

Broadcast in MPI

root

Broadcast in Phoenix

root

may bedeliveredtogether

16

Reduction in Phoenix

Reduction in MPI

root

0 1 2 3 4

Reduction in Phoenix

root

Op Op

17

Outline

1. Introduction

2. Related Work

3. Phoenix

4. Our Proposal



18

Overview of Our Proposal

Create topology-aware spanning trees at run-time Latency-aware trees (for short messages) Bandwidth-aware trees (for long messages)

Perform broadcasts/reductions along the generated trees

Update the trees when processes are added/removed

19

Spanning Tree Creation

Create a spanning tree for each process w/ that process at the root

Each process autonomously Measures the RTT (or bandwidth)

between itself and randomly selected other processes

Searches for a suitable parent for each spanning tree

Root

RTTRTT

RTT

20

Latency-Aware Trees

Goal Few edges between cl

usters Moderate fan-out and

depth within clusters Parent selection

RTTp,cand < RTTp,parent

distcand,root < distp,root

root

p

distp,root

21

Latency-Aware Trees

Goal Few edges between cl

usters Moderate fan-out and

depth within clusters Parent selection



LAN LAN

LAN

Root

22

Latency-Aware Trees

Goal Few edges between clu

sters Moderate fan-out and d

epth within clusters Parent selection



Root(w/in cluster)

Root

ParentChange

ParentChange

23

Latency-Aware Trees

Goal Few edges between clu

sters Moderate fan-out and d

epth within clusters Parent selection



parent/root

p

p

cand

RTTp,parent

RTTp,cand

distcand,root

distp,root

24

Bandwidth-Aware Trees

Goal Efficient use of bandwidth during a broadcast/reductio

n Bandwidth estimation

estp,cand =min(estcand, bwp2p/(nchildren+1))

Parent selection estp,cand > estp,parent

bwp2p

nchildren

estcand

cand’s parent

cand

p

25

Broadcast

Short message (<128KB) Forward along a latency-

aware tree Long message (>128KB)

Pipeline along a bandwidth-aware tree

Include in the header The set of virtual nodes to

be forwarded via the receiving process

5

0 (root)

21

43

{2, 5}{1,3,4}

{3} {4} {5}

5

migration

26

Reduction

Each processor Waits for a message from

all of its children Performs the specified

operation Forwards the result to its

parent Timeout mechanism

To avoid waiting forever for a child that has already sent its message to another process

Newparent

Oldparent

p

Timeout

Parent

ParentChange

27

Outline

1. Introduction

2. Related Work

3. Phoenix

4. Our Proposal



28

Broadcast (1-byte)

MPI-like broadcast Mapped 1 virtual node

to each process 201 processes in 3

clusters

Our Implementation MagPIe-like (Static Grid-aware) Impl.

MPICH-like (Grid-unaware) Impl.

0

5

10

15

20

25

30

35

40

0 40 80 120 160 200

Process Number

Com

plet

ion

Tim

e (m

s)

05

10152025303540

0 40 80 120 160 200

Process Number

Com

plet

ion

Tim

e (m

s)

05

10152025303540

0 40 80 120 160 200

Process Number

Com

plet

ion

Tim

e (m

s)

29

Broadcast (Long)

MPI-like broadcast Mapped 1 virtual node to each process 137 processes in 4 clusters

0

100

200

300

400

500

600

700

800

1. E+04 1. E+05 1. E+06 1. E+07 1. E+08

Message Si ze (Bytes)

Band

widt

h (M

B/se

c)

Dynami c (Our I mp. )MPI CH- l i keMagPI e- l i keLi st

30

Reduction

MPI-like Reduction Mapped 1 virtual node to each process 128 processes in 3 clusters

1.E+01

1.E+02

1.E+03

1.E+04

1.E+01 1.E+03 1.E+05 1.E+07Integers Summed

Com

plet

ion

Tim

e (m

icro

secs

)

Dynamic (Our Imp.)

MagPIe-like

MPICH-like

31

0

100

200

300

400

500

600

0 30 60 90 120

Elapsed Time (secs)

Ban

dw

idth

(MB

/sec

)

Addition/Removal of Processes

Repeated 4-MB broadcasts 160 procs. in 4 clusters Added/removed procs. whil

e broadcasting t = 0 [s]

1 virtual node/process

t = 60 [s] Remove half of the processes

t = 90 [s] Re-add the removed processes

Rm. Add

32

Outline

1. Introduction

2. Related Work

3. Phoenix

4. Our Proposal



33

Conclusion

Presented a method to perform broadcasts and reductions in WANs w/out manual configuration

Experiments Stable-state broadcast/reduction

1-byte broadcast 3+ times faster than MPICH, w/in a factor of 2 of MagPIe

Addition/removal of processes Effective execution resumed 8 seconds after adding/removi

ng processes

34

Future Work

Optimize broadcast/reduction Reduce the gap between our method and static

Grid-enabled methods Other collective operations

All-to-all Barrier

35

Thank you!