Collective Operations for Wide-Area Message Passing Systems Using Adaptive Spanning Trees

Collective Operations for Wide-Area Message Passing

Systems Using Adaptive Spanning Trees

Hideo Saito, Kenjiro Taura and Takashi Chikayama (Univ. of Tokyo)

Grid 2005 (Nov. 13, 2005)

Message Passing in WANs

Increase in bandwidth of Wide-Area Networks More opportunities to perform parallel comp.

using multiple clusters connected by a WAN Demands to use message passing for parallel

computation in WANs Existing programs written using message passing Familiar programming model

Collective Operations

Communication operations in which all processors participate E.g., broadcast and reduction

Importance (Gorlatch et al. 2004) Easier to program using coll. operations than with just

point-to-point operations (i.e., send/receive) Libraries can provide a faster implementation than wit

h user-level send/recv Libraries can take advantage of knowledge of the underlying h

ardware

Broadcast

Coll. Ops. in LANs and WANs

Collective Operations for LANs Optimized under the assumption that all links have the

same latency/bandwidth

Collective Operations for WANs Wide-area links are much slower than local-area links Collective operations need to avoid links w/ high

latency or low bandwidth Existing methods use static Grid-aware trees

constructed using manually-supplied information

Problems of Existing Methods

Large-scale, long-lasting applications More situations in which…

Different computational resources are used upon each invocation

Computational resources are automatically allocated by middleware

Processors are added/removed after app. startup Difficult to manually supply topology information Static trees used in existing methods won’t work

Contribution of Our Work

Method to perform collective operations Efficient in clustered wide-area systems Doesn’t require manual configuration Adapts to new topologies when processes are

added/removed Implementation for the Phoenix Message

Passing System Implementation for MPI (work in progress)

Outline

1. Introduction

2. Related Work

3. Phoenix

4. Our Proposal

5. Experimental Results

6. Conclusion and Future Work

Thakur et al. 2005 Assumes that the latency and bandwidth of all lin

ks are the same Short messages: latency-aware algorithm Long messages: bandwidth-aware algorithm

Ring All-gather

Binomial Tree

Short Root

Scatter

MagPIe

Kielmann et al. ’99 Separate static trees for

wide-area and local-area communication

Broadcast Root sends to the “coordi

nator node” of each cluster

Coordinator nodes perform an MPICH-like broadcast within each cluster

LANRoot

Coord.Node

Other Works on Coll. Ops. for WANs

Other works that rely on manually-supplied info. Network Performance-aware Collective Communicatio

n for Clustered Wide Area Systems (Kielmann et al. 2001)

MPICH-G2 (Karonis et al. 2003)

Content Delivery Networks

Application-level multicast mechanisms using topology-aware overlay networks Overcast (Jannotti et al. 2000) SelectCast (Bozdog et al. 2003) Banerjee et al. 2002

Designed for content delivery; don’t work for message passing Data loss Single source Only 1-to-N operations

Outline

1. Introduction

2. Related Work

3. Phoenix

4. Our Proposal

Phoenix

Taura et al. (PPoPP2003) Phoenix Programming Model

Message passing model Programs are written using send/receive

Messages are addressed to virtual nodes Strong support for addition/removal of processes during

execution

Phoenix Message Passing Library Message passing library based on the Phoenix

Programming Model Basis of our implementation

Addition/Removal of Processes

Virtual node namespace Messages are addressed to virtual nodes instead of to

processes API to “migrate” virtual nodes supports

addition/removal of processes during execution

0-19 20-39

send(30)

migration

JOINsend(30)

Broadcast in Phoenix

0 1 2 3 4 4

migration

Broadcast in MPI

Broadcast in Phoenix

may bedeliveredtogether

Reduction in Phoenix

Reduction in MPI

0 1 2 3 4

Reduction in Phoenix

Outline

1. Introduction

2. Related Work

3. Phoenix

4. Our Proposal

Overview of Our Proposal

Create topology-aware spanning trees at run-time Latency-aware trees (for short messages) Bandwidth-aware trees (for long messages)

Perform broadcasts/reductions along the generated trees

Update the trees when processes are added/removed

Spanning Tree Creation

Create a spanning tree for each process w/ that process at the root

Each process autonomously Measures the RTT (or bandwidth)

between itself and randomly selected other processes

Searches for a suitable parent for each spanning tree

RTTRTT

Latency-Aware Trees

Goal Few edges between cl

usters Moderate fan-out and

depth within clusters Parent selection

RTTp,cand < RTTp,parent

distcand,root < distp,root

distp,root

Latency-Aware Trees

Goal Few edges between cl

usters Moderate fan-out and

depth within clusters Parent selection

LAN LAN

Latency-Aware Trees

Goal Few edges between clu

sters Moderate fan-out and d

epth within clusters Parent selection

Root(w/in cluster)

ParentChange

Latency-Aware Trees

Goal Few edges between clu

sters Moderate fan-out and d

epth within clusters Parent selection

parent/root

RTTp,parent

RTTp,cand

distcand,root

distp,root

Bandwidth-Aware Trees

Goal Efficient use of bandwidth during a broadcast/reductio

n Bandwidth estimation

estp,cand =min(estcand, bwp2p/(nchildren+1))

Parent selection estp,cand > estp,parent

nchildren

estcand

cand’s parent

Broadcast

Short message (<128KB) Forward along a latency-

aware tree Long message (>128KB)

Pipeline along a bandwidth-aware tree

Include in the header The set of virtual nodes to

be forwarded via the receiving process

0 (root)

{2, 5}{1,3,4}

{3} {4} {5}

migration

Reduction

Each processor Waits for a message from

all of its children Performs the specified

operation Forwards the result to its

parent Timeout mechanism

To avoid waiting forever for a child that has already sent its message to another process

Newparent

Oldparent

Timeout

Parent

ParentChange

Outline

1. Introduction

2. Related Work

3. Phoenix

4. Our Proposal

Broadcast (1-byte)

MPI-like broadcast Mapped 1 virtual node

to each process 201 processes in 3

clusters

Our Implementation MagPIe-like (Static Grid-aware) Impl.

MPICH-like (Grid-unaware) Impl.

0 40 80 120 160 200

Process Number

10152025303540

0 40 80 120 160 200

Process Number

10152025303540

0 40 80 120 160 200

Process Number

Broadcast (Long)

MPI-like broadcast Mapped 1 virtual node to each process 137 processes in 4 clusters

1. E+04 1. E+05 1. E+06 1. E+07 1. E+08

Message Si ze (Bytes)

Dynami c (Our I mp. )MPI CH- l i keMagPI e- l i keLi st

Reduction

MPI-like Reduction Mapped 1 virtual node to each process 128 processes in 3 clusters

1.E+01

1.E+02

1.E+03

1.E+04

1.E+01 1.E+03 1.E+05 1.E+07Integers Summed

Dynamic (Our Imp.)

MagPIe-like

MPICH-like

0 30 60 90 120

Elapsed Time (secs)

Addition/Removal of Processes

Repeated 4-MB broadcasts 160 procs. in 4 clusters Added/removed procs. whil

e broadcasting t = 0 [s]

1 virtual node/process

t = 60 [s] Remove half of the processes

t = 90 [s] Re-add the removed processes

Rm. Add

Outline

1. Introduction

2. Related Work

3. Phoenix

4. Our Proposal

Conclusion

Presented a method to perform broadcasts and reductions in WANs w/out manual configuration

Experiments Stable-state broadcast/reduction

1-byte broadcast 3+ times faster than MPICH, w/in a factor of 2 of MagPIe

Addition/removal of processes Effective execution resumed 8 seconds after adding/removi

ng processes

Future Work

Optimize broadcast/reduction Reduce the gap between our method and static

Grid-enabled methods Other collective operations

All-to-all Barrier

Thank you!

Collective Operations for Wide-Area Message Passing Systems Using Adaptive Spanning Trees

Documents

Transcript of Collective Operations for Wide-Area Message Passing Systems Using Adaptive Spanning Trees

Spanning Trees

Rapid Spanning Tree Protocol Per VLAN Spanning Tree ... · Rapid Spanning Tree Protocol Per VLAN Spanning Tree Multiple Spanning Tree Protocol De nition Topology Change I STP I A

The Message Passing Interface (MPI). Outline Introduction to message passing and MPI Point-to-Point Communication Collective Communication MPI Data Types.

Minimum Spanning Trees - SKKU › files › algorithm_chap23-MST.pdf · A spanning tree whose weight is minimum over all spanning trees is called a minimum spanning tree, or MST.

Multivariate polynomials: A spanning question - …pinkus/papers/mppw.pdfMultivariate Polynomials: A Spanning Question ... Spanning set. 165 . ... and sufficient conditions on an algebraic

Minimum Spanning Treesdszajda/classes/... · Minimum Spanning Trees Spanning subgraph n Subgraph of a graph G containing all the vertices of G Spanning tree n Spanning subgraph that

COSC 6374 Parallel Computation Message Passing Interface (MPI ) III Collective Operations › ~gabriel › courses › cosc6374_f15 › ParCo_05... · 2018-06-18 · 1 COSC 6374 Parallel

1 Minimum Spanning Trees. Minimum- Spanning Trees 1. Concrete example: computer connection 2. Definition of a Minimum- Spanning Tree.

Minimum Spanning Tree What is a Minimum Spanning Tree. Constructing Minimum Spanning Trees. What is a Minimum-Cost Spanning Tree. Applications of Minimum.

Worksheet Minimum spanning treeslearn.stleonards.vic.edu.au › ... › 06 › ...Minimum-Spanning-Tree-Work… · Calculate the minimum spanning tree for each of the following graphs.

Final Exam ReviewMessage passing • No globally accessible storage • MPI • Blocking vs non-blocking communication • Point to point vs collective • Collectives distinguish

Minimum spanning trees Aims: To know the terms: tree, spanning tree, minimum spanning tree. To understand that a minimum spanning tree connects a network.

CCJ: Object-based Message Passing and Collective ...

Spanning Tree Protocol - cisco.com · Spanning Tree Protocol ForconceptualinformationaboutSpanningTreeProtocol,seethe“UsingtheSpanningTreeProtocolwith theEtherSwitchNetworkModule

Lecture 8: Distributed memorybindel/class/cs5220-f11/slides/lec08.pdfMessage passing programming Basic operations: I Pairwise messaging: send/receive I Collective messaging: broadcast,

A COUSINS RESEARCH GROUP Report SHIPS PASSING ...SHIPS PASSING IN THE NIGHT SHIPS PASSING IN THE PASSING IN THE NIGHT SHIPS PASSING SHIPS PASSING IN THE NIGHT SHIPS PASSING IN THE

Dell Networking and Cisco Spanning- Tree Interoperabilityi.dell.com/...sheets/...and-Cisco-Spanning-Tree-Interoperability.pdf · Dell Networking and Cisco Spanning-Tree Interoperability

Ankle Spanning, Knee Spanning, Long Bone andPelvic Spanning, Knee Spanning, Long Bone andPelvic 1 Nota Bene The technique description herein is made available to the healthcare professional

2a.1 Message-Passing Computing More MPI routines: Collective routines Synchronous routines Non-blocking routines ITCS 4/5145 Parallel Computing, UNC-Charlotte,

Spanning Tree