Collective Operations for Wide-Area Message Passing Systems Using Adaptive Spanning Trees

Post on 22-Jan-2016

21 views 0 download

Tags:

description

Collective Operations for Wide-Area Message Passing Systems Using Adaptive Spanning Trees. Hideo Saito , Kenjiro Taura and Takashi Chikayama (Univ. of Tokyo) Grid 2005 (Nov. 13, 2005). Message Passing in WANs. Increase in bandwidth of Wide-Area Networks - PowerPoint PPT Presentation

Transcript of Collective Operations for Wide-Area Message Passing Systems Using Adaptive Spanning Trees

Collective Operations for Wide-Area Message Passing

Systems Using Adaptive Spanning Trees

Hideo Saito, Kenjiro Taura and Takashi Chikayama (Univ. of Tokyo)

Grid 2005 (Nov. 13, 2005)

2

Message Passing in WANs

Increase in bandwidth of Wide-Area Networks More opportunities to perform parallel comp.

using multiple clusters connected by a WAN Demands to use message passing for parallel

computation in WANs Existing programs written using message passing Familiar programming model

WAN

3

Collective Operations

Communication operations in which all processors participate E.g., broadcast and reduction

Importance (Gorlatch et al. 2004) Easier to program using coll. operations than with just

point-to-point operations (i.e., send/receive) Libraries can provide a faster implementation than wit

h user-level send/recv Libraries can take advantage of knowledge of the underlying h

ardware

Broadcast

Root

4

Coll. Ops. in LANs and WANs

Collective Operations for LANs Optimized under the assumption that all links have the

same latency/bandwidth

Collective Operations for WANs Wide-area links are much slower than local-area links Collective operations need to avoid links w/ high

latency or low bandwidth Existing methods use static Grid-aware trees

constructed using manually-supplied information

5

Problems of Existing Methods

Large-scale, long-lasting applications More situations in which…

Different computational resources are used upon each invocation

Computational resources are automatically allocated by middleware

Processors are added/removed after app. startup Difficult to manually supply topology information Static trees used in existing methods won’t work

6

Contribution of Our Work

Method to perform collective operations Efficient in clustered wide-area systems Doesn’t require manual configuration Adapts to new topologies when processes are

added/removed Implementation for the Phoenix Message

Passing System Implementation for MPI (work in progress)

7

Outline

1. Introduction

2. Related Work

3. Phoenix

4. Our Proposal

5. Experimental Results

6. Conclusion and Future Work

8

MPICH

Thakur et al. 2005 Assumes that the latency and bandwidth of all lin

ks are the same Short messages: latency-aware algorithm Long messages: bandwidth-aware algorithm

Ring All-gather

Root

Binomial Tree

Short Root

Scatter

Long

9

MagPIe

Kielmann et al. ’99 Separate static trees for

wide-area and local-area communication

Broadcast Root sends to the “coordi

nator node” of each cluster

Coordinator nodes perform an MPICH-like broadcast within each cluster

LAN

LAN

LANRoot

Coord.Node

Coord.Node

10

Other Works on Coll. Ops. for WANs

Other works that rely on manually-supplied info. Network Performance-aware Collective Communicatio

n for Clustered Wide Area Systems (Kielmann et al. 2001)

MPICH-G2 (Karonis et al. 2003)

11

Content Delivery Networks

Application-level multicast mechanisms using topology-aware overlay networks Overcast (Jannotti et al. 2000) SelectCast (Bozdog et al. 2003) Banerjee et al. 2002

Designed for content delivery; don’t work for message passing Data loss Single source Only 1-to-N operations

12

Outline

1. Introduction

2. Related Work

3. Phoenix

4. Our Proposal

5. Experimental Results

6. Conclusion and Future Work

13

Phoenix

Taura et al. (PPoPP2003) Phoenix Programming Model

Message passing model Programs are written using send/receive

Messages are addressed to virtual nodes Strong support for addition/removal of processes during

execution

Phoenix Message Passing Library Message passing library based on the Phoenix

Programming Model Basis of our implementation

14

Addition/Removal of Processes

Virtual node namespace Messages are addressed to virtual nodes instead of to

processes API to “migrate” virtual nodes supports

addition/removal of processes during execution

0-19 20-39

send(30)

30-39

migration

20-29

JOINsend(30)

15

Broadcast in Phoenix

0 1 2 3 4 4

migration

Broadcast in MPI

root

Broadcast in Phoenix

root

may bedeliveredtogether

16

Reduction in Phoenix

Reduction in MPI

root

0 1 2 3 4

Reduction in Phoenix

root

Op Op

17

Outline

1. Introduction

2. Related Work

3. Phoenix

4. Our Proposal

5. Experimental Results

6. Conclusion and Future Work

18

Overview of Our Proposal

Create topology-aware spanning trees at run-time Latency-aware trees (for short messages) Bandwidth-aware trees (for long messages)

Perform broadcasts/reductions along the generated trees

Update the trees when processes are added/removed

19

Spanning Tree Creation

Create a spanning tree for each process w/ that process at the root

Each process autonomously Measures the RTT (or bandwidth)

between itself and randomly selected other processes

Searches for a suitable parent for each spanning tree

Root

RTTRTT

RTT

20

Latency-Aware Trees

Goal Few edges between cl

usters Moderate fan-out and

depth within clusters Parent selection

RTTp,cand < RTTp,parent

distcand,root < distp,root

root

p

distp,root

21

Latency-Aware Trees

Goal Few edges between cl

usters Moderate fan-out and

depth within clusters Parent selection

RTTp,cand < RTTp,parent

distcand,root < distp,root

LAN LAN

LAN

Root

22

Latency-Aware Trees

Goal Few edges between clu

sters Moderate fan-out and d

epth within clusters Parent selection

RTTp,cand < RTTp,parent

distcand,root < distp,root

Root(w/in cluster)

Root

ParentChange

ParentChange

23

Latency-Aware Trees

Goal Few edges between clu

sters Moderate fan-out and d

epth within clusters Parent selection

RTTp,cand < RTTp,parent

distcand,root < distp,root

parent/root

p

p

cand

RTTp,parent

RTTp,cand

distcand,root

distp,root

24

Bandwidth-Aware Trees

Goal Efficient use of bandwidth during a broadcast/reductio

n Bandwidth estimation

estp,cand =min(estcand, bwp2p/(nchildren+1))

Parent selection estp,cand > estp,parent

bwp2p

nchildren

estcand

cand’s parent

cand

p

25

Broadcast

Short message (<128KB) Forward along a latency-

aware tree Long message (>128KB)

Pipeline along a bandwidth-aware tree

Include in the header The set of virtual nodes to

be forwarded via the receiving process

5

0 (root)

21

43

{2, 5}{1,3,4}

{3} {4} {5}

5

migration

26

Reduction

Each processor Waits for a message from

all of its children Performs the specified

operation Forwards the result to its

parent Timeout mechanism

To avoid waiting forever for a child that has already sent its message to another process

Newparent

Oldparent

p

Timeout

Parent

ParentChange

27

Outline

1. Introduction

2. Related Work

3. Phoenix

4. Our Proposal

5. Experimental Results

6. Conclusion and Future Work

28

Broadcast (1-byte)

MPI-like broadcast Mapped 1 virtual node

to each process 201 processes in 3

clusters

Our Implementation MagPIe-like (Static Grid-aware) Impl.

MPICH-like (Grid-unaware) Impl.

0

5

10

15

20

25

30

35

40

0 40 80 120 160 200

Process Number

Com

plet

ion

Tim

e (m

s)

05

10152025303540

0 40 80 120 160 200

Process Number

Com

plet

ion

Tim

e (m

s)

05

10152025303540

0 40 80 120 160 200

Process Number

Com

plet

ion

Tim

e (m

s)

29

Broadcast (Long)

MPI-like broadcast Mapped 1 virtual node to each process 137 processes in 4 clusters

0

100

200

300

400

500

600

700

800

1. E+04 1. E+05 1. E+06 1. E+07 1. E+08

Message Si ze (Bytes)

Band

widt

h (M

B/se

c)

Dynami c (Our I mp. )MPI CH- l i keMagPI e- l i keLi st

30

Reduction

MPI-like Reduction Mapped 1 virtual node to each process 128 processes in 3 clusters

1.E+01

1.E+02

1.E+03

1.E+04

1.E+01 1.E+03 1.E+05 1.E+07Integers Summed

Com

plet

ion

Tim

e (m

icro

secs

)

Dynamic (Our Imp.)

MagPIe-like

MPICH-like

31

0

100

200

300

400

500

600

0 30 60 90 120

Elapsed Time (secs)

Ban

dw

idth

(MB

/sec

)

Addition/Removal of Processes

Repeated 4-MB broadcasts 160 procs. in 4 clusters Added/removed procs. whil

e broadcasting t = 0 [s]

1 virtual node/process

t = 60 [s] Remove half of the processes

t = 90 [s] Re-add the removed processes

Rm. Add

32

Outline

1. Introduction

2. Related Work

3. Phoenix

4. Our Proposal

5. Experimental Results

6. Conclusion and Future Work

33

Conclusion

Presented a method to perform broadcasts and reductions in WANs w/out manual configuration

Experiments Stable-state broadcast/reduction

1-byte broadcast 3+ times faster than MPICH, w/in a factor of 2 of MagPIe

Addition/removal of processes Effective execution resumed 8 seconds after adding/removi

ng processes

34

Future Work

Optimize broadcast/reduction Reduce the gap between our method and static

Grid-enabled methods Other collective operations

All-to-all Barrier

35

Thank you!