h-RELATION PERSONALIZED COMMUNICATION STRATEGY FOR … · h-RELATION PERSONALIZED COMMUNICATION...

h-RELATION PERSONALIZED

COMMUNICATION STRATEGY FOR

HP-ADAPTIVE COMPUTATIONS

M. Paszynski(1),(2), K. Milfeld(3)

(1) Institute for Computational Engineering and Sciences (ICES)

The University of Texas at Austin

e-mail: [email protected](2) AGH University of Science and Technology,

Department of Computer Methods in Metallurgy, Cracow, Poland(3) Texas Advanced Computing Center (TACC)

Abstract

In an h-relation personalized communication, each processor can have a variable number of datasets of non-fixed size, that must be communicated with a subset of other processors, such thateach processor is the origin or destination of at most h messages. For the case where nothingcan be assumed about the regularity of the communication pattern, the h-relation personalizedcommunication strategy problem has been solved using two transpose primitives.In HP -adaptive finite element methods, the domain is frequently repartitioned between

processors representing adjacent sub-domains in a geometrical sense. We show that all thecommunications in this case can be described as h-relation personalized communications, withh equal to the maximum number of neighboring sub-domains. We also demonstrate that thecommunications problem can be solved by scheduling over bipartite graphs.Specifically, we show that when the communication pattern arises from the partition of a 2D

geometric area, the planar graph representation of the computational domain can be partitionedinto not more than two bipartite graphs and a third graph with maximum vertex valency 2, bymeans of a simple algorithm. In the general case (e.g., for computations over partitioned 3Dgeometric areas) our simple algorithm finds h−1 or fewer bipartite graphs. After the graphs areconstructed, the task of message scheduling can be reduced to a set of independent schedulingproblems over the bipartite graphs.The algorithm is supported by a theoretical discussion on its correctness and efficiency,

together with performance measurements. We also present measurements for the concurrentpoint-to-point communication used.

Keywords: Parallel Algorithms, Routing h-Relation Personalized Communication, Concurrent

Point-to-Point Communication, HP -Adaptive Finite Element Method.

1

Acknowledgment

The computations reported in this work were done through the National Science Foundation’s

National Partnership for Advanced Computational Infrastructure. The authors are greatly indebted

to Anna Paszynska, Leszek Demkowicz, Robert A. van de Geijn, and Greg Plaxton for numerous

discussions on the subject. We are greatfull to the Texas Advanced Computing Center and the

University of Texas at Austin for the use of their resources.

1 Introduction

Main results of the paper. In the h-relation personalized communication problem, each pro-

cessor has possibly different amounts of data to share with some subset of the other processors,

such that each processor is the origin or destination of at most h messages [2]. In the general case,

where nothing can be assumed about the regularity of the communication pattern, the h-relation

personalized communication strategy problem can be solved by using two transpose primitives

[2] (all-to-all communications where each processor has to send a unique block of data to every

processor, and all the blocks are of the same size).

The paper presents an alternative strategy for h-relation personalized communications (h-rpc),

by employing bipartite graphs derived from the communication graph. Through this reformulation,

the task of message scheduling is reduced to a set of independent problems of scheduling over the

bipartite graphs.

We present a simple algorithm for partitioning of the planar graph into two bipartite graphs

and another graph with maximum vertex valency 2. In a general case, when the communication

pattern is a non-planar graph, our algorithm partitions the graph into h − 1 bipartite graphs.

We also analyze the communication patterns that appear in parallel H- and HP -adaptive finite

element method applications [28]. We show that an efficient implementation of message passing

can be represented as an h-rpc strategy with h equal to the maximum number of neighboring

sub-domains over the graph representation of the partitioned computational domain.

We include a theoretical discussion of the algorithm’s correctness, an analysis of its complexities,

and performance measurements of our proposed communication strategy.

Routing h-relation. The h-rpc problem can be solved using a simple fixed-pattern algorithm

[35]. However, if there is a large variation in the message size, the algorithm becomes inefficient.

A scalable two-phase algorithm for the h-rpc was proposed by Bader, Helman, and JaJa in [2] to

mitigate this inefficiency. The authors also noted that the problem can be solved by using two

transpose primitives.

Some basic work with MPI primitive implementations was done by Johnsson [27], [26], [19],

[20]. Efficient implementation of the MPI primitives is described in papers of Bader [1], Sundar

2

et al. [33]; and the approach presented in [2] is acclaimed to be the optimal strategy for routing

h-relation.

The authors of [2] showed that the overall complexity of the algorithm isO(

τ + h+(

h+ np+ p2

)

σ)

,

where τ is the message initialization time, n is total amount of data assumed to be distributed innpportions over p processors, and σ is the time per byte at which processor can send or receive

data through the network ( 1σ

is the bandwidth of the network). In their paper, Goldman and

Trystram [15] noticed that when all the communication patterns are known by every processor, a

global knowledge algorithm can be used. By finding an optimal solution through a permutation on

the communication matrix, their solution becomes potentially better; however, they used a greedy

algorithm whose results can be far from the optimal schedule.

In our approach, we use a global knowledge communication strategy based on scheduling over

bipartite graphs, with an overall complexity O(

τh+ npσ)

, which scales favorably when the proces-

sor count is increased. Also, in H- and HP - adaptive finite element applications, the maximum

number h of neighboring sub-domains is usually constrained to a small value by domain decompo-

sition algorithms such as Zoltan [9]. This implies that our algorithm will scale better then other

proposed h-rpc algorithms.

Partitions of the planar graph. A bipartite graph is a graph whose set of vertices can be

decomposed into two disjoint sets such that no two graph vertices within the same set are adjacent.

A forest is an acyclic graph (set of possibly disconnected trees). Every forest is a bipartite graph,

but not every bipartite graph is a forest. A vertex valency d(x) is the number of edges incident to

it. The maximum vertex valency in a graph G is denoted by ∆(G).

The problem of partitioning planar graphs has been extensively investigated in graph theory,

using forests. An effective algorithm for finding 3 forest partitions of every planar graph is presented

in Chen et al. [37]. However, there are no theoretical results on the upper bound for the valency

of forests.

The most recent theoretical results are presented in He, Hou et al. [36], and can be summarized

as follows. Every planar graph G has an edge-partition in two forests T1 and T2 and another graph

H with ∆(H) ≤ 8. An exact evaluation of ∆(H) is still an open problem, according to He Hou et

al. [36]. There are examples of planar graphs whose graphs H have the property ∆(H) = 2, which

implies that 2 ≤ ∆(H) ≤ 8. The theoretical results presented in He, Hou et al. [36] do not imply

any simple graph partition algorithm, since their proofs are based on the mathematical induction

on the number of vertices and edges in G.

We present a simple linear time algorithm for partitioning a planar graph into not more than

three bipartite graphs {Bi}i=1,2,3 with ∆(B1) ≤ h, ∆(B2) ≤ h−1 and ∆(B3) ≤ 2. We use bipartite

graphs that may not be forests, because we do not need to map partitions into a set of forests, we

only need partitions consisting of bipartite graphs.

3

Scheduling over bipartite graphs. The usefulness of edge coloring for scheduling data transfers

in large parallel architectures has been demonstrated by several authors, and a comprehensive

survey of applications in various fields has been compiled by Jain et al. [18]. The application

of simple edge-coloring algorithms to scheduling problems and evaluation of their experimental

performance are presented by Marathe et al. [25], [30]. In general, when nothing can be assumed

about the regularity of a graph, the problem of edge coloring is NP-complete (see [3], [5]). However,

the problem of edge coloring of bipartite graphs can be solved by a polynomial-time algorithm.

The fastest known bipartite graph coloring algorithm was developed by Rizzi et al. [31], [21]. The

computational complexity of the algorithm is O(n logn log v + e log v) where n is the number of

nodes, e is the number of edges, and v is the maximum vertex degree in the graph.

Managing the communication problem in HP -adaptive applications. The problem of

managing the communication in parallel H- and HP -adaptive finite element simulations was inves-

tigated by several researchers. Crutchfield et al. [6] developed an object oriented library to support

adaptive mesh refinement for hyperbolic gas dynamic simulations running on vector supercomput-

ers. Simple geometric abstractions for expressing complicated communication patterns in dynamics

simulations were developed within the LPARX system [22] and used by the KeLP programming

system [13].

Patra et al. [23] enumerated data according to a Space Filing Curve (SFC) algorithm to simplify

communication patterns. In particular, load balancing was achieved in this case by periodically

adjusting the key space and migrating node- and element-objects to neighboring sub-domains.

Sandia National Laboratories has developed a finite element infrastructure package called SIERRA

([10], [12], [11], [34]) that supports parallel architectures. The SIERRA package uses the Interpro-

cessor Collective Communication Library (iCC) [17]. The collective communications in the library

are based on gather operations followed by broadcast within a subset of processors.

In H- and HP -adaptive applications, the communication pattern changes throughout the cal-

culation after each adaptation. It is impossible to predict the pattern at compile time. The concept

of creating an inspector and an executor part of the computational loop was introduced by Saltz et

al. [7]. The inspector produces sets of schedules and the executor gathers data using the schedules

and performs the calculations on them. Both inspector and executor are implemented using a set

of primitives within the PARTI library [7].

The CHAOS system subsumes the previous PARTI library and supports a larger class of ap-

plications, as described by Hwang et al. [16]. CHAOS can also be used by compilers to parallelize

irregular applications automatically. When data access patterns are known only at runtime, com-

piler support for irregular loops can create pre-processed code for loops that generates appropriate

communication calls at runtime and places off-processor data in a pre-determined order [16], [8].

Such compiler techniques transform the irregular loops into inspector / executor parts, and embeds

CHAOS remap procedures that redistribute data irregularities.

4

The actual communication is generated inside the inspector part by a Communication Pattern

Generator (CPG) from the communication graph, as described by Saltz et al. [8]. The commu-

nication pattern is optimized in the following ways: (1) simple software caching and incremental

scheduling eliminates duplicate off-processor data retrieval using a hash table, (2) communication

coalescing packs the maximum possible amount of data into a single message, (3) schedule merging

uses the same schedule for gather and scatter operations. In the CHAOS library there are no

global knowledge optimizers that use the geometrical properties of the planar graph to represent

communication patterns.

Our approach differs from the methods used in the above applications in the following ways:

• All essential communication appearing in HP -adaptive computations is reduced to an h-rpc.

• The h-relation algorithm does not rely on expensive, collective (global) all-to-all primitives,

which implies better scalability with increasing numbers of processors.

• A global knowledge optimization is applied (from a global understanding of the communica-

tion pattern at each processor).

• The communication optimizer synchronizes all local communication patterns by applying

graph partitions and edge-coloring over bipartite graphs.

2 Communication problems in HP -adaptive applications

This section describes the parallel, fully automatic HP -adaptive finite element method (FEM)

package, as detailed in Demkowicz et al. [28]. We show that all communication problems that

arise in H- and HP -adaptive FEM computations can be described as h-rpc, with h equal to the

maximum number of neighboring sub-domains.

Parallel H- and HP -Adaptive Finite Element Methods. In the H version of the FEM, ele-

ments may have a locally varying element size, H. In the more flexible HP version, the polynomial

order of approximation P is allowed to vary. High-order elements are usually necessary to minimize

the dispersion error, while small elements are necessary to capture real world geometrical details.

Only HP elements can economically combine small elements of lower order with large elements of

higher order in the same mesh.

A parallel version of the fully automatic HP -adaptive 2D FEM was developed at ICES [28].

Given an initial mesh, the code automatically produces a sequence of optimal meshes delivering

exponential convergence rates, using a general HP -refinement strategy based on the minimization

of a projection-based interpolation error. The parallelized components in each stage of the HP -

adaptivity algorithm are: the decomposition of the computational domain, load balancing and

5

data redistribution, a parallel frontal solver, and algorithms for parallel mesh refinement and mesh

reconciliation.

Flowchart for parallel execution. A flow chart of the iterations for the HP -adaptive method

is presented in Fig. 1. The input data are the initial mesh, and the equation defined over the

computational domain. The initial mesh is the coarse mesh for the first iteration. Load balancing

using the Zoltan library [9] is performed at the beginning of each iteration step, on the level of

initial mesh elements. The coarse mesh is globally refined in both element size H (for 2D each

rectangular element is divided into 4) and order of approximation P (raised uniformly by one) to

obtain a fine mesh, as illustrated in Fig.2. The problem is solved twice, once on the current coarse

mesh and once on a fine mesh. The automatic HP strategy requires estimation of the global error

obtained by comparison of the coarse grid and the fine grid solutions. The optimal mesh (in Fig.2)

in the current iteration is obtained on the basis of error estimations over each finite element. The

refinement trees grows vertically within the initial mesh elements. The optimal mesh obtained in

current step constitutes the coarse mesh for the next step in the algorithm. The iteration loop is

terminated when all element errors reach a (small) limit.

Communication tasks.

1. Data migration after load balancing.

A domain decomposition approach is the most natural way to parallelize a FEM application.

In this approach the computational domain is divided (partitioned) into sub-domains, and

each sub-domain is assigned to a processor.

In H- or HP -adaptive methods, the finite element mesh changes from one iteration to the

next, and the domain must be repartitioned to balance the load uniformly over all processors.

This is usually done by a load-balancing library like Zoltan [9]. The library uses informa-

tion about computational load estimation over each initial mesh element together with its

refinements tree, and it returns a repartition of the initial mesh elements. To balance the

calculations, some initial mesh elements, together with their refinement trees, must be sent

to other processors; likewise, elements must be received from some other processors. The

communication pattern for creating the new partition of the mesh is mainly between pro-

cessors representing adjacent sub-domains in the current partition of the mesh. Hence, the

migration of the initial mesh elements (together with refinement trees) is usually to adjacent

sub-domains.

Moreover, when an SFC load balancing algorithm is used, the load balance can be achieved

by periodically migrating node and element data to just neighboring processors, as was shown

by Patra et al. [23].

2. Parallel solver.

Both the coarse mesh and fine mesh problems are solved using a parallel frontal solver [28].

6

The frontal solver is a direct solver, an extension of the Gaussian elimination algorithm, where

assembly and elimination are performed together on the so-called frontal sub-matrix of the

global matrix. The use of direct solvers in H- and HP -adaptive applications is motivated by

the high accuracy required by the error-estimation routines. In the domain-decomposition ap-

proach the domain is partitioned into sub-domains, each sub-domain is assigned to separated

processor, and all degrees of freedom (dof) within the computational mesh are enumerated in

the following way:

• All internal dof from 1-st sub-domain

• All internal dof from 2-nd sub-domain

• ...

• All internal dof from last sub-domain

• All dof from the global interface

By the interface, we mean all common edges of neighboring finite elements located in adjacent

sub-domains. The following steps are performed:

(a) a forward elimination is executed over each sub-domain in parallel

(b) a wire-frame problem involving all dof from the interface is formulated and solved

(c) a solution of the wire-frame problem is broadcast

(d) a backward substitution is executed over each sub-domain in parallel.

The greatest communication cost is incurred in the formulation of the interface problem

and the broadcast of its solution. We will show how the data exchanges can be reduced to

communications between neighboring sub-domains only.

Parallel direct solvers, found in libraries such as PLAPACK [29], are used to solve the interface

problem in an optimal way. The interface problem matrix is assumed to be distributed by

rows, where each processor stores some consecutive rows of the matrix. The rows are assigned

to particular processors by following a simple algorithm, implemented in [28]:

(a) Each interface degree of freedom may be kept by many processors, since the interface

dof are shared between neighboring sub-domains. The list of processors assigned to the

dof is obtained from the Zoltan library after the load balancing step.

(b) Each interface degree is assigned to the processor with the smaller number in the list.

It is assumed that the row of the interface problem matrix associated with the dof will

be stored on that processor.

An example of the interface node assignment is presented in Fig. 3.

7

The forward elimination stage of the parallel frontal solver provides the partially assembled

rows of the interface dof [28]. The interface dof rows must be fully assembled by sending data

to the processor associated with the row.

It is clear that the communication is only between neighboring sub-domains, since the interface

dof are shared only between neighboring sub-domains. The communication pattern graph can

be assumed to be planar, if the adjacency is considered only through non-zero length edges

and not through vertices. In the planar graph case, the communication pattern has much

better properties, as shown in the following sections. Unfortunately the adjacency through

vertices must be taken into account during assembly of the interface problem matrix. However,

it has been shown that the load balancing can be optimally performed on the level of the

initial mesh elements only [28]. For the case of regular rectangular initial mesh elements

with curvilinear edges only on the boundary of the domain, the diagonal adjacency through

the vertex can be assumed to occur with only one sub-domain. In other words, there are

always no more than four sub-domains sharing a common vertex in 2D, and eight in 3D.

The amount of data to be exchanged between neighboring sub-domains is proportional to

the size of the edge shared by neighboring sub-domains. Thus, it is reasonable to split the

communication pattern into two planar graphs, one containing all adjacency through non-zero

length edges, and the other one containing all adjacency through vertices. Hence, we need

to schedule two communication patterns over two planar graphs. For the case of non-regular

initial mesh elements, the resulting graph may be non-planar, and the number of bipartite

graphs is bounded by h− 1.

3. Parallel mesh refinement and mesh reconciliation.

In the parallel mesh refinement step some finite elements are divided into multiple (new)

elements (each rectangular element is divided into 2 or 4 sons in 2D), new nodes are created,

and new orders of approximation are set (see [28] for more details). The parallel version of the

algorithm requires a global mesh-reconciliation step after each refinement has been performed

over each sub-domain. The mesh must be globally consistent, which involves

• The 1-irregularity rule: Edges of a given element can be broken only once, without break-

ing neighboring elements. This implies that refinement tree data describing newly created

nodes on the interface edges must be exchanged between neighboring sub-domains.

• When an interface edge is broken, then the polynomial order of newly created nodes

must be established. This is done by comparing coarse- and fine-grid solutions over

finite elements neighboring the edge. But some of the elements neighboring the interface

edge are present in adjacent sub-domains. We need to ask the neighboring sub-domain

for the order of new nodes.

The parallel mesh refinement and reconciliation algorithm has 3 stages:

8

(a) determination of the optimal refinements (essentially a parallel task)

(b) exchange of information about edge refinement trees and polynomial order along the

interface

(c) mesh reconciliation

Stages (b) and (c) are repeated if some of the interface edges are modified during the last

mesh reconciliation.

The communication problem in this case is just to exchange some data between neighboring

sub-domains. It should be noted that for the 2D case, data need only be exchanged between

sub-domains adjacent through nonzero length edges, and not through finite element vertices.

In the 3D case, data are only exchanged between sub-domains adjacent through non-empty-

area finite element faces and not through finite element edges and vertices, since the state of

edges is determined by the state of faces, and the state of vertices is determined by the state

of edges.

The amount of data to be exchanged between neighboring sub-domains is proportional to the

density of refinements over edges shared by adjacent sub-domains.

As noted, the communication strategy of 3D mesh refinement and reconciliation is more com-

plicated. Also, the h-rpc is solved over a non planar graph in 3D, which implies a larger number of

bipartite graphs.

3 Graph partition algorithm

Graph representation of the computational domain. Let us consider a computational do-

main divided into sub-domains. A graph representation of the domain can be created,

G = (V,E) (3.1)

where graph vertices V denote sub-domains, and edges E = {{u, v} : u, v ∈ V } denote their

adjacency. A graph G is assumed to be undirected, since the sub-domains adjacency relation is

symmetric.

Partition of the graph. The layer-connection algorithm can be applied to the graph represen-

tation of the domain. The goal of this algorithm is to partition the graph into a set of subgraphs

called layers and connections (Figs. 4 and 5). Any vertex representing the sub-domain located on

the boundary of the domain is selected and assigned to the first layer. Then all neighbors of the

previously selected vertex are assigned to the second layer. The procedure is repeated by assigning

the sequence of neighboring vertices to consecutive layers until all vertices have been assigned to

some layer.

9

More formally, we select one vertex u0 in the graph and create an attributing function (3.2) that

assigns a layer index to each graph vertex, such that within a minimal length path connecting the

vertex with the initial vertex u0, all vertices belong to a different layer.

u0 ∈ V, f : V → {1, ..., NL} :

∀v ∈ V ∃ < v1, ..., vk >: v1 = u0, vk = v,

f (vi) = i, k = min<w1,...,wl>:w1=u0,wl=v

l (3.2)

The path < v1, ..., vk > is defined here as a sequence of graph vertices v1, ..., vk ∈ V such that each

vertex is connected with its successor and ancestor {vi−1, vi} ∈ E ∀i = 1, ..., k. The path length k

is assumed to be minimal (it is the shortest path connecting vertices u0 and v). Such a path exists

for each vertex since the domain is assumed to be compact. NL denotes the resulting number of

layers.

Layers and connections. The i-th layer is defined as a subgraph containing all vertices attributed

(by the f attributing function) with value i, and all edges between these vertices present in the

original graph G.

Li =(

V Li , EL

i

)

⊂ G :

V Li = {u ∈ V : f(u) = i}

ELi = {{u, v} ∈ E : f(u) = f(v) = i}

i = 1, ..., NL (3.3)

The i-th connection is defined as a subgraph containing all vertices attributed (by the f at-

tributing function) with i and i + 1 values, and all edges between vertices assigned to layer i and

i+ 1.

Ii =(

V Ii , EI

i

)

⊂ G :

V Ii = {u ∈ V : f(u) ∈ {i, i+ 1}}

EIi = {{u, v} ∈ E, f(u) = i, f(v) = i+ 1}

i = 1, ..., NL− 1 (3.4)

Graph partition algorithm. A bipartite graph is a graph in which a set of vertices can be de-

composed into two disjoint sets such that no two graph vertices within the same set are adjacent.

A valency of a vertex is the number of edges incident to it. We denote by ∆(G) = h the maximum

vertex valency in the graph G representing the partitioned computational domain (i.e.,the maxi-

mum number of neighboring sub-domains). Here we describe the graph-partitioning algorithm for

partitioning the graph G into a set of bipartite subgraphs with disjoint edges. The input for the

algorithm is the graph G, and the output is the set of bipartite graphs {Bi}i=1,...,NB .

10

As shown in the algorithm description below, in a general case when the graph G is not planar,

the total number of bipartite graphs NB is no greater than h− 1, with

∆(BK) ≤ h−K + 1,K = 1, ..., h− 2; ∆(Bh−1) ≤ 2 (3.5)

Also, for a planar graph G representing a partitioned 2D area, the resulting number of bipartite

graphs is not greater than 3, with

∆(B1) ≤ h; ∆(B2) ≤ h− 1; ∆(B3) ≤ 2 (3.6)

An example of a planar graph that partitions into 2 bipartite graphs is illustrated in Fig. 4. A

planar graph with more complex partitioning, requiring sub-layering, is illustrated in Fig. 5.

Through the layer-connection algorithm, the task of message scheduling over the entire graph

G is reduced to scheduling over independent bipartite graphs, which are simple and can be solved

in polynomial time. The algorithm steps are:

• Step (1): Execute the layer-connection algorithm over entire graph G. The algorithm results

in a set of connections {Ii}i=1,...,LN−1 and disjoint layers {Li}i=1,...,LN sub-graphs.

• Step (2): Create the first bipartite graph as the union of all connection subgraphs

B1 =⋃

k=0,...,LN−1 Ik.

• Step (3): Let iteration index K = 0.

• Step (4): If the maximum vertex valency in layer subgraphs {Li}i=1,...,LN is less than or

equal to 2, Then Go To Step (9)

• Step (5): Let iteration index K = K + 1.

• Step (6): Execute recursively the layer-connection algorithm over each layer subgraph

{Li}i=1,...,LN . The algorithm results in set of K-th sub-connection {Im,Ki }

m=1,...,LNK

i−1

i=1,...,LN and

disjoint K-th sub-layers {Lm,Ki }

m=1,...,LNK

i

i=1,...,LN subgraphs. LNKi is the number of the K-th sub-

layer resulting from execution of the algorithm over the Li layer.

• Step (7): The K +1 bipartite graph is defined as the union of all K-th sub-connection sub-

graphs BK+1 =⋃m=0,...,LNK

i−1

i=1,...,LN Im,Ki resulting from recursive execution of the layer-connection

algorithm.

• Step (8): Denote by Li all K-th sub-layer subgraphs, and by Ii all K-th sub-connection

subgraphs. Denote LN as the total number of K-th sub-layer subgraphs. Go To Step (4).

• Step (9): The K + 2 bipartite graph is defined as the union of all layer subgraphs BK+2 =⋃

i=1,...,LN Li.

Stop

11

The correctness and efficiency analysis of the algorithm are presented in the Correctness of

graph partition algorithm and Efficiency of graph partition algorithm sections.

4 Scheduling over bipartite graphs

We assume that

• Each processor can only send or receive one message at one time.

• Each processor sends different messages to different neighbors.

• The communication cost between each discrete pair of communicating processors is not the

same.

The cost of each communication can be bounded as tcomm = τ + Lσ where τ is the message ini-

tialization time, L is the number of words in the message, and σ is the time per byte at which

processor can send or receive data through the network. In reality, σ may depends on the number

of processor pairs trying to send a message at the same time and the routes within the intercon-

nection. The measurements of the σ parameter on the TACC Cray-Dell Power-Edge Xeon Cluster

are presented in the section Performance measurements. The discussion on the architecture of

the cluster in the context of concurrent point-to-point communication is presented in the section

Multiple point-to-point communication on clusters.

If we assume the size of each message is the same, the scheduling problem is equivalent

to the problem of coloring of edges of a graph representation of the computational domain, where

each vertex denote processor and each edge denotes a message to be sent.

A k-edge-coloring of graph G is an assignment of k colors to the edges of G in such a way that

any two edges meeting at a common vertex are assigned different colors. If G can be k-edge colored,

then G is said to be k-edge colorable. The chromatic index of G, denoted by X‘(G), is the smallest

k for which G is k-edge-colorable. From Konig’s Theorem [32] it follows that if G is a bipartite

graph whose maximum vertex valency is d, then X‘(G) = d. The idea of the proof is mathematical

induction on the number of edge of G. Konig’s theorem tells us that every bipartite graph with

maximum vertex valency d can be edge-colored with just d colors. All trees are bipartite graphs

[32].

When a graph is not a representation of 2D mesh, then (3.5) implies that the entire graph G

can be colored by using not more than∑

K=1,h−2{h−K + 1}+ 2 = 12h

2 + 12h− 3 = O

(

h2)

colors,

in a pessimistic case. The total communication cost is then bounded by

tcomm = (τ + Lσ) (1

2h2 +

1

2h− 3)2 = O

(

τh2 + Lσh2)

(4.7)

12

When a graph to be partitioned is a representation of a 2D mesh, there are not more than

three bipartite subgraphs and equation (3.6) implies that the chromatic index (number of colors

required to color the entire graph G) is no greater then h+ h− 1 + 2 = 2h+ 1 = O (h). The total

communication cost is then bounded by

tcomm = (τ + Lσ) (2h+ 1)2 = O (τh+ Lσh) (4.8)

The presented evaluations consider a pessimistic case, and usually the algorithm deals better with

the input graph. For example, on a regular 2D mesh with maximum vertex valency 6, presented

in Fig. 4, there are only two sets of messages to be sent within 6 time units, and this result does

not depend on the number of processors represented by the graph vertices.

In the best known bipartite graph edge coloring algorithms, the complexity is

O (n logn log h+ e log h); where n is number of vertices, e is number of edges, and h is the maximum

vertex valency in the graph [31].

If we assume that the size of each message is not the same, the scheduling problem can

also be solved by the same strategy above. It is only necessary to select a minimum size message,

divide all large messages into a sequence of minimal-sized messages, and replicate the edges in the

bipartite graphs representing the larger messages. In the new graph, a single edge between a pair

of processors is a minimum size message, and multiple edges constitute a large-sized message.

5 h-relation personalized communication strategy

In the case when the communication pattern arises from partition of computational mesh and

different messages with different sizes must be exchanged between processors representing sub-

domains adjacent in a geometrical sense, the proposed communication strategy over a heterogeneous

interconnection network can be summarized in the following steps:

1. Create a graph G describing the partitioned mesh as presented in (3.1). The maximum vertex

valency in the graph is denoted by ∆(G) = h.

2. Execute the graph partitioning algorithm described in section Graph partition algorithm. The

algorithm results in a set of bipartite graphs {Bi}i=1,...,NB .

3. Schedule messages over each bipartite graph, as described in a section Scheduling over bipartite

graphs:

(a) Find the minimum and maximum size of messages represented by edges in the Bi graphs.

Split each message larger than the minimum into a sequence of messages with a length

equal to the minimum size message (not that the last one will usually contain a shorter

message). Replicate each edge by the number of messages in the sequence.

13

(b) Color the bipartite graph with multiple edges using an efficient algorithm e.g. presented

in [31].

(c) Schedule messages according to the edge coloring.

In the case of a partitioned 2D mesh, the graph partition algorithm produces no more than three

bipartite graphs, and in the general case there are no more than h− 1 bipartite graphs, as shown

in in the next section.

6 Correctness of the graph partition algorithm

Algorithm Correctness: general case. The analysis in this section applies to a simple geomet-

rical consideration over any graph representation of 3D partitioned domain, where vertices denote

sub-domains, and edges denote adjacency (neighbors that share a common border - a non-empty

2D area). A planar graph is a graph representation of a partitioned 2D domain. This section deals

with graph representation of partitioned 3D meshes, which are not planar graphs.

Lemma 1. The execution of the layer-connection algorithm over graph G with ∆(G) ≤ h

produces a connection subgraph I with ∆(I) ≤ h, and a layer sub-graph L with ∆(L) ≤ h− 1.

Proof: Execution of the algorithm results in {Li =(

V Li , EL

i

)

}i=1,...,LN and

{Ii =(

V Ii , EI

i

)

}i=1,...,LN−1 subgraphs. Connection subgraph I and layer subgraph L are defined

as unions I =⋃

k=0,...,LN−1 Ik and L =⋃

i=1,...,LN Li.

Each connection subgraph Ik, k = 1, ..., LN − 1 contains all vertices from layers Lk and Lk+1,

but does not contain edges between vertices within the same layer. If the vertex v = u0 ∈ V L1 has

valency h and all neighbors of v are in the next layer, then ∆(I1) = h. If there are neighbors of the

vertex v in graph G within the same layer Li as v, then ∆(Ii) < h. We conclude that ∆(I) ≤ h.

Any vertex v ∈ V Li has valency no greater then h− 1 in Li, since

• If i = 1, v ∈ L1, v is the initial vertex v = u0 then v has no edge in L1, because all of its edges

are assigned to the connection subgraph I1. It simply implies that the valency of v = u0 is 0

in L1.

• If i > 1, then each vertex within Li has an edge to some vertex from the previous layer

Li−1, from definition of layers-connections algorithm. If v ∈ Li has h neighbors in G, then at

least one of its neighbors belongs to the previous layer Li−1. This implies that v may have a

maximum of h− 1 neighbors in Li; in other words, the maximum valency of v in Li is h− 1.

We conclude that ∆(L) ≤ h− 1.

Lemma 2. Each connection subgraph Ii is a bipartite graph.

14

Proof: There are two disjoint sets of vertices in each Ii graph, the first one contains all

vertices from layer Li and the second one contains vertices from layer Li+1. From the definition

of the connection subgraph construction there are no edges between vertices inside each set; hence

the union of the two layer graphs is a bipartite graph.

Lemma 3. The union of all connection subgraphs I =⋃

k=0,..,LN−1 Ik is a bipartite graph.

Proof: Each connection subgraph Ii contains all vertices from layers Li and Li+1. Vertices

from the union graph I = (V I , EI) can be split into two disjoint sets V I = V I1 ∪ V I

2 , where V I1

is the union of all vertices from all odd layers V I1 =

⋃

k=0,...,bLN−1

2c V

L2k+1 and V I

2 is a union of all

vertices from all even layers V I2 =

⋃

k=1,..,bLN−1

2c V

L2k.

There are no edges between vertices from the same layer. There are also no edges between

vertices from layer Li and Li+2 and beyond. This implies that the union graph I is a bipartite

graph.

Theorem 1. Any graph G is a union of h− 1 bipartite graphs {Bi}i=1,...,h−1, where ∆(Bi) ≤

h− i+ 1, i = 1, ..., h− 2 and ∆(Bh−1) ≤ 2.

Proof: By executing the layer-connection algorithm over G we obtain a connection sub-graph

I1 with ∆(I1) ≤ h and layer sub-graphs L with ∆(L) ≤ h − 1, according to Lemma 1. By

executing the algorithm recursively h−2 times over layer sub-graphs, we obtain a set of connection

subgraphs Bk = Ik, k = 1, ..., h− 2 and the layer subgraph Bh−1 = Lh−1 with ∆(Lh−1) ≤ 2. Each

connection sub-graph Ik, k = 1, ..., h − 2 is a bipartite graph by Lemma 3. The terminal layer

subgraph, Bh−1, has a maximum vertex valency 2, which implies that it is also a bipartite graph.

The procedure is illustrated in Fig. 6.

The Theorem 1 establishes the correctness of the layer-connection algorithm in the general

case.

Algorithm correctness: graph representations of 2D meshes. The analysis in this sec-

tion applies to a simple geometrical consideration over planar graphs. Graph representations of

partitioned 2D meshes are planar graphs.

For clarity, we assume that the domain is convex without holes. However, the analysis can be

easily extended to the case of non-convex domains with holes. We make the following observations:

• All layers are created by passing of a wave from the initial sub-domain vertex u0, located on

the boundary of the domain.

• Each layer represents a sequence of connected sub-domains. The first and the last sub-domains

are located on the boundary of the domain.

• Within a layer there exists paths between each pair of vertices. Each edge represents adjacency

between sub-domains.

15

• We define an internal path as one that passes through the entire layer Li, and call edges

within this path internal edges. An internal path contains all vertices in the layer. We order

all vertices representing sub-domains in Li from “up” to “down”, where by the “upper” sub-

domain we mean that one located on the upper boundary of the domain. Fig. 7 illustrates

the ordering.

We enumerate vertices according to the order and denote them by < vLi

1 , ..., vLi

p > where p is

the length of the path.

• In general, all non-internal path edges may be the result of adjacency on the side of the pre-

vious layer, or on the side of the next layer, since edges represent geometrical 2D connections

between sub-domains.

• We distinguish two sides of a layer for assigning a “side” to connections between sub-domains

(that are not in the internal path defined above): the previous layer side and the next layer

side.

• Within a layer, there are three kinds of edges: internal edges, previous side edges and next

side edges, according to their geometrical location, as illustrated in Figs. 7,8.

Lemma 4. There are no previous side edges.

Proof: Let us suppose there is a previous side edge {vLi

k , vLi

k+l} with l ≥ 2, see Fig.8. From the

definition of the layer construction, each vertex vLi

k must be connected with some vertex from the

previous layer. Existence of a previous side edge {vLi

k , vLi

k+l} means that the sub-domain represented

by vertex vLi

k is adjacent to the sub-domain represented by vertex vLi

k+l through a common vertex

in the previous layer side. But previous-side adjacency of sub-domain vLi

k to sub-domain vLi

k+l for

l ≥ 2 means that sub-domains vLi

k+1, ..., vLi

k+l−1 are “covered” from any possible connection to the

previous layer. Hence, the existence of previous-side edges is a contradiction with the fact that the

covered vertices are adjacent to some vertex vLi−1m in the previous layer.

Lemma 5. Next side edges do not intersect themselves.

Proof: Let us suppose there is a next side edge {vLi

k , vLi

k+m} intersecting with a next side

edge {vLi

k+l, vLi

k+n} where m > 2, n − l > 2 and 1 < l < m < n. Adjacency of sub-domain vLi

k to

sub-domain vk+m means that sub-domains vLi

k+1, ...vLi

k+l, ..., vLi

k+m−1 are covered on the next layer

side by the {vLi

k , vLi

k+l} connection. Hence, this contradicts the fact that sub-domain vLi

k+l can also

be connected to sub-domain vLi

k+n on the next side.

Lemma 6. The second execution of the layer-connection algorithm over all layers Li with

∆(Li) = h − 1 results in one bipartite graph SI with ∆(SI) ≤ h − 1 and one graph SL with

∆(SL) ≤ 2.

Proof: By executing the layer-connection algorithm over each layer Li, we obtain a set of bipar-

tite sub-layers subgraphs {Lmi }

m=1,...,LNi−1i=1,...,LN−1 and a set of sub-connection subgraphs {Imi }

m=1,...,LNi−1i=1,...,LN−1 ,

16

where LNi denotes the number of sub-layers resulting from execution of the algorithm over the layer

Li.

The union of all sub-connection subgraphs is a bipartite subgraph, SI =⋃m=1,...,LNi−1

i=1,...,LN−1 Lmi , by

Lemma 3. Also, ∆(SI) ≤ h − 1, since ∆(Lmi ) ≤ h − 1, i = 1, ..., LN , for each layer according to

Lemma 1.

It is necessary to show that ∆(SL) ≤ 2, where SL is the union of sub-layers Lmi resulting from

the execution of the algorithm over all layers Li. We can do this by proving the same property

for each sub-layer, ∆(Lmi ) ≤ 2, i = 1, ..., LN , m = 1, ..., LNi, since all sub-layers are disjoint. In

other words, it is only necessary to show that each vertex vLm

i

k in any sub-layer Lmi has a valency

no greater then 2 in Lmi .

We proceed with the proof by considering all adjacency cases of the vertex vLm

i

k , and observations

about the cases (each sub-layer has all the properties of a layer):

• The internal path in the sub-layer is an ordered set of vertices along the path < vLm

i

1 , ..., vLm

ip >.

• We can assume that the initial vertex vL1

i

1 is the first vertex in the internal path, and is located

on the boundary of the domain.

• Each vertex vLm

i

k from sub-layer Lmi can have only two kind of edges: internal edges and next

side edges, since previous side edges are not possible, according to Lemma 4.

• Each vertex vLm

i

k in sub-layer Lmi for m > 1 must have an edge to some vertex v

Lm−1

i

l from

the previous sub-layer Lm−1i .

• Each vertex vLm

i

k in sub-layer Lmi can have neighbors in the layer Li only from layers Lm−1

i

or Lmi or Lm+1

i , from the properties of the layer-connection algorithm.

• We distinguish forward and backward directions along the internal path according to the order

of vertices on the internal path.

First, we consider the following observations for the possible adjacency cases in the backward

direction:

1. Each vertex vLm

i

k from sub-layer Lmi can have only one internal edge to the vertex v

Lm

i

k−1 and

many next side edges.

2. If there is a next side edge {vLm

i

l , vLm

i

k } between two vertices in sub-layer Lmi , then between

those vertices there can only be vertices from further sub-layers Lm+ni , n ≥ 1. Assume between

them there is a vertex vLm

i

k+n, n ∈ {1, ..., l − 1} from the same sub-layer, see Fig. 9(a). The

vertex vLm

i

k+n must be connected with some vertex from the previous sub-layer vLm−1

i , by the

definition of sub-layer. There is no possibility of connecting internal vertices vk+1, ..., vl−1

17

with any external vertex, because all the vertices are covered by the next side edge, and edges

do not intersect.

3. If the vertex vLm

i

k from sub-layer Lmi is connected by a next side edge with another vertex

vLm

i

l from the same layer in the backward direction, then the neighbor vL

m+1

i

j of the vertex

vLm

i

k through the internal edge in the backward direction is from the next layer Lm+1i , see

Fig. 9(b). Let us consider the execution of of the layer-connection algorithm over the layer

Li. The m-step assigns vertices vLm

i

k and vLm

i

l to sub-layer Lmi . In the next step vj will be

assigned into a next sub-layer Lm+1i .

4. Each vertex vLm

i

k from sub-layer Lmi can have only one neighbor v

Lm

i

l through the next side

edge from the same sub-layer in the backward direction. Suppose it has two such neighbors

vLm

i

l and vLm

i

l+1, see Fig, 9(c). Then, the neighbor vLm

i

l+1 from sub-layer Lmi is covered by the

edge {vLm

i

k , vLm

i

l }, which is a contradiction with observation 2.

5. Each vertex vLm

i

k from sub-layer Lmi can only have one neighbor within the same sub-layer

in the backward direction. The reason is the following: The vertex vLm

i

k cannot have two

neighbors from the same sub-layer through the next side edges in the backward direction,

from observation 4, see Fig, 9(c). If the vertex has only one neighbor from the same sub-

layer through next side edge in the backward direction, then its internal edge neighbor is

not from the same sub-layer, from observation 3, see Fig, 9(b). If the vertex vLm

i

k has an

internal path neighbor from the same sub-layer in the backward direction, then the vertex

vLm

i

k cannot have another neighbor through a next sub-layer edge, compare Fig. 9(d).

We conclude that the vertex vLm

i

k may have only one neighbor vLm

i

k−1 from the same sub-layer when

traversing the internal path in the backward direction. The same observations apply in the forward

direction. Since the vertex vLm

i

k from sub-layer Lmi can have only two neighbors v

Lm

i

k−1 and vLm

i

k+1 in

the same sub-layer as determined from a scan in both directions, ∆Lmi ≤ 2.

Theorem 2. If the graph G is planar and ∆(G) = h, then G is a union of no more than 3

bipartite graphs; that is, G =⋃

i=1,...,3 Bi, with ∆(B1) ≤ h, ∆(B2) ≤ h− 1 and ∆(B3) ≤ 2.

Proof: The theorem is true as a consequence of Lemma 1 and Lemma 6. The recursive

partitioning procedure in Fig. 6 is truncated at the second recursion level as illustrated in Fig. 10,

because the sub-graph layers have a maximum vertex valency of 2.

The theorem proves the implied correctness of the layer-connection algorithm in the case of

planar graph representing partition of 2D mesh.

18

7 Complexity of the graph partition algorithm

Algorithm complexity for the general case. In the general case the graph partition algorithm

produces h− 1 bipartite graphs, where h is the maximum vertex valency in the graph.

We assume that the graph is stored as a data structure where graph vertices and edges are

kept in a variable-length list. The operation of accessing all edges of all vertices in the list has a

complexity O (nh), where n denotes the number of vertices in the graph. The complexity of each

step (See Graph partition algorithm.) in the algorithm is described below:

1. Step(1) is O (nh). The layer-connection algorithm visits each vertex, browses all of its

neighbors, and assigns layer-connection indexes to each vertex.

2. Step(2) is O (nh). This step browses all vertices and edges of the graph, to form a bipartite

graph B1 and layer subgraph L1. It is performed by attributing graph edges and vertices.

3. Step(3) is O (1). Initializes a layer counter.

4. Step(4) is O (nh). This step searches all vertices of the graph, and all edges of particular

layers, to find the maximum valency of the graph vertices. All edges of all graph vertices are

browsed, to determine their assignment to layers. layer.

5. Step(5) is O (1). The layer counter is incremented.

6. Step(6) is O (nh). This step required browsing of all vertices of the graph and all edges of

those vertices assigned to particular layers. All edges of the graph vertices are checked to

determine their layer assignment.

7. Step(7) is O (nh). This step queries all vertices of the graph, and all edges of particular

layers, to group them into a bipartite graph Bk+1 and the sub-layers of subgraph Lk+1. This

is accomplished by attributing graph edges and vertices. All edges of all graph vertices are

searched for assignment to a sub-layer.

8. The Step(8) This is a pseudo-code description for defining a naming convention.

9. Step(9) is O (nh). This step requires browsing all vertices of the graph, and all edges of

particular sub-layers, for attributing them to the last bipartite graph Bk+2. All graph edges

of the graph vertices are searched for assignment to a sub-layer.

In the general case, Step(4) - Step(8) are repeated h − 2 times. The total computational com-

plexity contribution from the above list is

O (nh+ nh+ 1 + (h− 2) (nh+ 1 + nh+ nh) + nh) = O(nh2) (7.9)

19

Algorithm complexity for a planar graph. The case of planar graph differs from the general

case only in the number of iterations. The Step(4) - Step(7) loop is performed no more than one

time. Hence, the total computational complexity of the algorithm for a planar graph is

O (nh+ nh+ 1 + nh+ 1 + nh+ nh+ nh) = O(nh) (7.10)

8 Concurrent point-to-point communication on clusters

In today’s HPC clusters the processors are interconnected by high performance switches that pro-

vide point-to-point bandwidths from 0.1 to 10 Gb/s (gigabits per second). Every processor is

connected to a system bus, which in turn is bridged to an I/O bus (usually a PCI bus). The I/O

bus, in turn, is connected to the network switch through a “network adapter”, as shown in Figure

11.

Peak performance is usually quoted for a single direction, as derived with a “ping-pong” program

that measures the round-trip MPI send/receive message speeds between two processors. Some

switched networks are bidirectional: that is, they can send messages between the same two points

simultaneously in both directions. But even when the quoted bidirectional speed is much higher

than the unidirectional speed, practical MPI communication performance measurements show only

a marginal or no performance improvement in bidirectional communication [24], although there are

exceptions.

The physical connection to the switch from the network adapter is called a “port” of the switch.

There are usually between 4 and 128 ports on a switch. Often switches with large port counts have

layers of “crossbars” to interconnect groups of ports. To extend a cluster’s node interconnection

beyond the number of ports available on a single switch, it is necessary to create a hierarchy of

switches. A Clos topology [4] is implemented to obtain full bi-sectional bandwidth; that is, to

achieve the peak aggregate performance when any half of the nodes simultaneous communicates

with any other half. In essence, a Clos topology allows concurrent independent communications to

occur throughout a hierarchical switch fabric.

For high-speed switches the performance of each simultaneous communication is nearly the

same as the communication performance of a single communication through the switch. Figure

12(a) depicts 8 single-processor nodes connected to a high speed switch. A large message can be

sent between any two processors at a nominal bandwidth. For example, on a high-speed switch,

processors 1−4 can send messages to processors 5−8 concurrently, each communicating at or near

the nominal bandwidth.

The peak bandwidth for switches is determined by measuring the “effective” bandwidth, for a

very large message size, using the formula

20

BE =L

tcomm

(8.11)

where BE is the effective bandwidth, L is the message size, and tcomm is the communication

time.

The effective bandwidth is often measured over a wide range of message sizes to produce a

profile of the communication performance. At very large message sizes an asymptotic (peak) is

achieved. Developers use these profiles to quickly determine the time for sending a fixed-size

message. Fig. 13 shows the profiles for single and concurrent (explained below) communications.

(Note the logarithmic scale in the message size.)

The effective bandwidth does not account for the startup time, or latency, required to initialize

the communication, and hence only provides a complete bandwidth in the asymptotic region where

the latency has practically no contribution. The communication time is often modeled by a constant

bandwidth B and latency τ , with a linear dependence on the message size, by the formula

tcomm = τ +L

B(8.12)

The true bandwidth and latency in this formula are the asymptotic bandwidth plus the y

intercept time (message size / effective bandwidth for the smallest message size) in an effective

bandwidth profile.

In HPC networks, the latency and bandwidth are about 10−5 seconds and 100 (Megabytes

per second) respectively. For a message size that contributes an amount of time equivalent to

the latency (Latency = Message Size / Bandwidth), the system reaches half the potential of the

network. It is necessary to send 1 KB messages (10−5 sec. 100 MB/sec) to achieve the half-peak

performance. While the bandwidth, per se, doesn’t change in this model, the communication times

for short-message algorithms are significantly impacted by the latency.

In Fig. 14 the communication time through a Myrinet 2000 switch for a range of message sizes

illustrates that the model is correct for a single communication path. The graph also shows the

linear behavior for multiple (30) simultaneous communications through the switch fabric on the

system. While a large number of multiple simultaneous communications produces a degradation of

performance, the linear character implies that a constant bandwidth in equation 8.12 can be used

to model a set of simultaneous communications. (The degradation of the bandwidth is attributed

to non-dynamic routing through the switch fabric. This can be minimized by a dynamic routing

mechanism, soon to be released by Myricom [38].) Also, for messages below 10K there may be

different algorithms for handling normal and short messages. Nevertheless, a single linear equation

may be adequate to handle both cases.

Many of the newer HPC clusters have dual-processor nodes (e.g., Intel Xeon, Apple G5 X-

server, AMD MP/Opteron), and the IBM Power4 series has a 4-way platform (model P655) that

21

is used as a building block for clusters. Figures 12(b) and 12(c) show typical interconnects for

2-way and 4-way nodes, respectively. These systems complicate the communication analysis in

two ways. First, intra-node communication between processors on a node is usually much faster

than the inter-node communication, between processors in different nodes. (This might not be

the case if the MPI libraries were not compiled with the correct configuration switches.) Second,

the switch bandwidth must be shared with the other processors on the node. For instance, in

Fig. 12(c), a single message between processor 1 and 4 might experience a performance of 250

MB/s, while groups of simultaneous messages between nodes (e.g., between processors 1 − 4 and

processors 5 − 8) would experience a performance of less than 62.5 MB/s, one fourth of the peak

network speed. The characteristics of intra-node and node-shared communications are illustrated

in Fig. 15. Concurrent communications between nodes “share” the bandwidth equally. Intra-node

communication performance is significantly higher than inter-node performance. Also, for small

intra-node messages, cached message performance is higher. (Inter-node communications is not

affected by caching, results not shown.) Often the inter-node bandwidth is used in equation 8.12

because it is the bottleneck in the communications.

9 Performance measurements

The performance of our h-rpc algorithm of colored bipartite graphs was evaluated on the TACC

Linux Xeon cluster, using a planar graph representing a 2D rectangular domain partitioned into p

sub-domains. The domain was partitioned according to the pattern presented in Fig. 4, but for

different numbers of sub-domains p = 4, 8, 16, 32 and 64, equal to the number of processors. Here,

the maximum number of neighboring sub-domains is equal to ∆ (G) = h = 6. The graph-partition

algorithm produces 2 bipartite graphs (B1 and B2), where ∆(B1) = 4 and ∆(B2) = 2. The number

of colors required for edge-coloring is equal to 6. There are S = 6 sets of messages to be exchanged

concurrently. The number of messages, M , in the set is dependent on the number of processors and

is determined by the formula M = p1

2 + p. The theoretical communication cost of the algorithm in

this case is equal to tcomm = (τh+ Lσ (M)) 2S, where L is the size of each message within a set,

S is the number of sets, and the factor 2 accounts for the fact that each message is sent in both

directions. The time required to send a byte, σ (M), depends on the number of messages in the set

M .

The measurements of the effective bandwidth BE(M) performed on the Xeon cluster were

presented in Fig.13. The effective bandwidth BE (M) is related with the real bandwidth B (M) by

the relation BE(M) = L/ (τ + L/B(M)) -compare (8.11) and (8.12). This implies that

B(M) =L

LBE(M) − τ

(9.13)

22

If σ (M) is in units of bytes, then

σ(M) =106

B(M)=

106

BE(M)−

106τ

L(9.14)

where τ = 10−5 and 106τL

= 10L

is negligible for very large message sizes L. We can simply calculate

values of σ(M) by using the profiles of BE(M) in Fig. 13.

In our experiment the processor count was set and then the message size was varied. The

number of nodes in the graph G changes as well as the number of messages (M = p1

2 + p in each

of S = 6 sets), when the processor count is varied.

From the profiles in Fig. 13, (asymptotic) bandwidths were assigned for each processor count

value (9.14); and then the theoretical performance was calculated from: tcomm = (τh+ Lσ (M)) 2S

(message size = L, number of messages in a set = M = p1

2 + p, and number of sets S = 6, see

Fig. 17). A comparison of the measured and theoretical results in Figs. 17 and 16 shows good

agreement.

The above measurements were performed on partitioned domains with a maximum number of

neighboring sub-domains equal to 6 (i.e., h = 6). In HP adaptive computations, the maximum

number of neighboring sub-domains usually varies between 4 and 8. Hence our experimental case

represents a realistic test for HP -adaptive methods and predicts excellent scaling when a large

number of processors is required.

10 Conclusions

• We have shown that all communication problems arising in H- and HP -adaptive computa-

tions can be solved using a h-relation personalized communication (h-rpc), where h is the

maximum number of neighboring sub-domains.

• We propose an alternative h-rpc strategy for communication patterns arising from partition

of the computational domain into sub-domains.

• The strategy uses an algorithm that partitions a graph representation of the computational

domain into a set of bipartite graphs, edge-colors the bipartite graphs, and schedules com-

munication according to the assigned color.

• We have shown the correctness of proposed graph partitioning algorithm and analyzed its

computational complexity.

• We have derived the cost for scheduling our bipartite graphs edge-coloring algorithm.

• For planar graphs representing partitioned 2D areas, the number of resulting bipartite graphs

is not more that 3, and the number of colors required for edge-coloring is not greater then

2h+ 1.

23

• For the graph representation of partitioned 3D areas, the number of resulting bipartite graphs

is not more that h− 1, and the number of colors required for edge-coloring is no greater than12h

2 + 12h− 3.

• The performance of the proposed strategy was evaluated on a Linux cluster and agrees well

with the theoretical communication cost.

• The communication cost of the proposed strategy O(

τh+ npσ)

scales favorably when the

processor count p is increased, as opposed to the two-phase algorithm considered the optimal

solution for h-rpc.

References

[1] Bader D.A. “High-Performance Algorithms and Applications for SMP Clusters”. NASA High

Performance Computing and Communications Aerospace Workshop CAS 2000, NASA Ames

Research Center, Moffett Field, CA, February 15-17, (2000)

[2] Bader D.A., Helman D.R. JaJa J. “Practical Parallel Algorithms for Personalized Communica-

tion and Integer Sorting”. ACM Journal of Experimental Algorithmics, 1(3), (1996) pp.1-42

[3] Biggs N.L. Discrete Mathematics, Oxford University Press, (1985) pp.173

[4] Clos C. “A Study of Non-Blocking Switch Networks”, Interconnection Networks for Parallel and

Distributed Processing, Chuan-lin Wu and Tse-yun Feng (ed.), IEEE Computer Society Press,

(1984), pp.126-144

[5] Cormen T.H, Leiserson C.E, Rivest R.L. Introduction to Algorithms, MIT Press, (1999) pp.964

[6] Crutchfield W.Y., Welcome M.L. “Object oriented implementation of adaptive mesh refinement

algorithms”. Scientific Prog, 2, (1993) pp.145-156

[7] Das R., Hwang Y.-S., Uysal M., Saltz J., Sussman A. “Applying the CHAOS/PARTI Library

to Irregular Problems in Computational Chemistry and Computational Aerodynamics”. Pro-

ceedings of the Scalable Parallel Libraries Conference, Mississippi State University, Starkville,

(1993) pp.45-46

[8] Das R., Uysal M., Saltz J., Hwang Y.-S. “Communication Optimizations for Irregular Scien-

tific Computations on Distributed Memory Architectures”. UMIACS Technical Reports CS-TR-

3163, UMIACS-TR-93-109 University of Maryland, Department of Computer Science (1993)

[9] Devine K.D, Hendrickson B, Boman E.G, St.John M.M, Vaughan C. “Zoltan: A dynamic load-

balancing library for parallel applications - user’s guide”. Tech. Rep. SAND99-1377, Sandia

National Laboratories, (1999)

24

[10] Edwards H.C. “SIERRA Framework Version 3: Core Services Theory and Design”. SAND2002-

3616, Albuquerque, NM: Sandia National Laboratories, (2002)

[11] Edwards H.C, Stewart J.R. “SIERRA, A Software Environment for Developing Complex Mul-

tiphysics Applications”. Computational Fluid and Solid Mechanics Proc. First MIT Conf., Cam-

bridge MA, (2001)

[12] Edwards H.C, Stewart J.R, Zepper J.D. “Mathematical Abstractions of the SIERRA Compu-

tational Mechanics Framework”. Proceedings of the 5th World Congress Comp. Mech., Vienna

Austria, (2002)

[13] Fink S.J., Kohn S.R., Baden S.B. “Efficient run-time support for irregular block-structured

applications”. J. Parallel Distrib. Comput., 50(1,2), (1998) pp.61-82

[14] Foster I. “Designing and Building Parallel Programs”. Available at

http://www-unix.mcs.anl.gov/dbpp

[15] Goldman A., Trystram D. “Algorithms for the Message Exchange Problem”. Proceedings of

the International Conference on Parallel Computing in Electrical Engineering, Poland (1998)

pp.153-161

[16] Hwang Y.-S., Moon B., Sharma S.D. “Runtime and Language Support for Compile Adap-

tive Irregular Problems on Distributed Memory Machines”. Software-Practice and Experience,

25(6), (1995) pp.597-621

[17] iCC: Interprocessor Collective Communication Library. Available at

http://www.cs.utexas.edu/users/rvdg/intercom

[18] Jain R., Werth J., Browne J.C., Sasaki G. “A graph-theoretic model for scheduling problem

and its application to simultaneous resource scheduling”. Computer Science and Operations

Research: New Developments in Their Interfaces, Ed. by O. Balci, R. Schander, and S. Zerrick,

Penguin Press, (1992)

[19] Johnsson S.L., Ho. C.-T. “Optimal All-to-All Personalized Communication with Minimum

Span on Boolean Cubes”. Technical Report 18-91, Center for Research in Computing Technol-

ogy, Harvard University (1991)

[20] Johnsson S.L., C.T. Ho. “Optimal Broadcasting and Personalized Communication in Hyper-

cubes”. IEEE Transactions on Computers, 38(9), (1989) pp.1249-1268

[21] Kapoor A., Rizzi R. “Edge Coloring Bipartite Graphs”. Technical Report DIT-02-040, Infor-

matica e Telecomunicazioni, University of Trento (1999)

25

[22] Kohn S.R., Baden S.B. “A robust parallel programming model for dynamic non-uniform sci-

entific computations”. Proceeding of SHPCC-94, IEEE Computer Society Press, Silver Spring,

MD, (1993) pp.509-597

[23] Laszloffy A., Long J., Patra A. “Simple data management, scheduling and solution strate-

gies for managing the irregularities in parallel adaptive hp finite element simulations”. Parallel

Computing, 26, (2000) pp.1765-1788

[24] Liu J., Chandrasekaran B., Wu J., Jiang W., Kini S., Yu W., Buntinas D., Wyckoff P.,

Panda D. K. “Performance Comparison of MPI Implementations over InfiniBand, Myrinet and

Quadrics”, SuperComputing 2003 Conference, Pheonix, AZ, November, (2003)

[25] Marathe M.V., Panconesi A., Risinger L.D. “An experimental study of a simple, distributed

edge coloring algorithm”. Proceedings of the twelfth annual ACM symposium on Parallel algo-

rithms and architectures, Bar Harbor, Maine (2000) pp.166-175

[26] Mathur K.K, Johnsson S.L. “All-to-All Communication Algorithms for Distributed BLAS”.

Technical Report 07-93, Center for Research in Computing Technology, Harvard University

(1993)

[27] Mathur K.K, Johnsson S.L. “Communication Primitives for Unstructured Finite Element Sim-

ulations on Data Parallel Architectures”. Technical Report 23-92, Center for Research in Com-

puting Technology, Harvard University (1992)

[28] Paszynski M., Kurtz J., Demkowicz L. “Parallel Fully Automatic HP -Adaptive 2D Finite

Element Package”. ICES Report 04-07, University of Texas in Austin (2004)

[29] PLAPACK: Parallel Linear Algebra Package. Available at

http://www.cs.utexas.edu/users/plapack

[30] Risinger L.D., Marathe M.V., Panconesi A. “Edge Coloring Algorithms for Scheduling in

ASCI: Survey and Experimental Results”. LAUR-No-97-2341, Los Alamos National Laboratory,

(1997)

[31] Rizzi R. “Finding 1-Factors in Bipartite Regular Graphs and Edge-Coloring Bipartite Graphs”.

SIAM J. Discrete Math. 15(3) (2002) pp.283-288

[32] Skiena S. “Coloring Bipartite Graphs”, Implementing Discrete Mathematics: Combinatorics

and Graph Theory with Mathematica. Reading, MA: Addison-Wesley, (1990)

[33] Sundar N.S., Jayasimha D.N., Panda D.K., Sadayappan P. “Hybrid Algorithms for Complete

Exchange in 2D Meshes”. IEEE Transactions on Parallel and Distributed Systems, 12(12),

(2001) pp.1201-1218

26

[34] Stewart J.R, Edwards H.C. “SIERRA Framework Version 3: h-Adaptivity Design and Use”.

SAND2002-4016, Albuquerque, NM: Sandia National Laboratories, (2002)

[35] Wang J.-C., Lin T.-H., Ranka S. “Distributed Scheduling of Unstructured Collective Commu-

nication on the CM-5.” Parallel Processing Letters, special issue on Partitioning and Scheduling,

(1995)

[36] Wenjie He, Xiaoling Hou, Ko-Wei Lih, Jiating Shao, Weifan Wang, Xuding Zhu. “Edge-

Partitions of Planar Graphs and Their Game Coloring Numbers”. Journal of Graph Theory,

41(4), (2002) pp.307-317

[37] Zhi-Zhong Chen, Xin He. “Parallel complexity of partitioning a planar graph into vertex-

induced forests”. Discrete Applied Mathematics 69, (1996) pp.183-198

[38] Personal communication with Myricom, Myricom help #29246

27

Solve global interface problemSolve local problems over each subdomain

Coarse grid solve:

Solve local problems over each subdomain

Compute the difference between fine grid and coarse grid solution

Is the difference smallYES

STOP

NO

Perform optimal mesh refinements over each subdomain separately (full parallel)

Perform Schur complement of the global problem

Solve global interface problemPerform Schur complement of the global problem

of computational domainWe assign some weights with finite elements

(computational complexity over the element)

Load balancing using ZOLTAN library

Fine grid solve:

(frontal solver over each subdomain)

(frontal solver over each subdomain)

(over each subdomain − full parallel)

Mesh reconciliation (inter−subdomain communication required)

Each processor reads some initial part

Some elements migrate to other subdomains

hp−refinement over each subdomain(full parallel)

Compute global maximum error

Figure 1: Flow chart for fully automatic, parallel HP -adaptive finite element algorithm.

L, p 4 8 16 32 64

10 0.0001 0.0001 0.0001 0.0001 0.0001

100 0.0002 0.0002 0.0013 0.0005 0.0072

1000 0.0004 0.0002 0.0015 0.0014 0.0033

10000 0.0005 0.0007 0.0016 0.0014 0.0035

100000 0.0033 0.0036 0.0087 0.0097 0.0131

1000000 0.0342 0.0352 0.0868 0.0892 0.1311

10000000 0.3255 0.3421 0.7915 0.9243 1.2204

Table 1: Communication time (in seconds) for scheduling with bipartite graphs edge-coloring algo-rithm. p is the number of processors and L is the message size (in bytes).

28

Figure 2: Two iterations of automatic hp-adaptive strategy: Figure (a): Initial mesh = coarsemesh. Figure (b): Fine mesh in the first iteration. Figure (c): Optimal mesh after the firstiteration. Figure (d): Optimal mesh from previous iteration = coarse mesh for the next iteration.Figure (e): Fine mesh in the second iteration. Figure (f): Optimal mesh after the seconditeration. Figure (g): Different colors denote different orders of approximation over a finiteelement. The brighten colors denote higher orders of approximation.

Figure 3: Assignment of nodes from the global interface to processors.

29

Figure 4: Example of a nice message scheduling for partitioned 2D domain: Figure (a): 2Ddomain partitioned into sub-domains. Figure (b): Planar graph representation of partitioneddomain. Figure (c): Layer and connection subgraphs. Figure (d): Partition of the planar graphinto two bipartite graphs: union of all connection subgraphs B1; union of all layer subgraphs B2.Figure (e): Edge-coloring of bipartite graphs. Figure (f): Six sets of messages to be sent.

30

Figure 5: Example of the worst case of message scheduling for 2D partitioned domain: Figure (a):2D domain partitioned into sub-domains. Figure (b): Planar graph representation of partitioneddomain. Figure (c): Layer and connection subgraphs. Figure (d): Second execution of layer-connection algorithm over layers. Figure (e): Sub-layer and sub-connection subgraphs. Figure(f): Partition of the planar graph into three bipartite graphs: union of all connection subgraphsB1; union of all sub-connection sub-graphs B2; union of all sub-layer subgraphs B3. Figure (g):Edge-coloring of bipartite graphs. Figure (h): Nine messages to be sent.

31

Figure 6: Partition of any graph G with maximum vertex valency h into h− 1 bipartite graphs.

Figure 7: Example of internal path and next layer edges.

32

Figure 8: Covering of the sub-domain vLi

k+1 by the previous side edge connections.

Figure 9: Adjacency cases for vertex from sub-layer: Figure (a): Vertices covered by the edge

{vLm

i

k , vLm

i

l }. Figure (b): Internal edges are from the next sub-layer. Figure (c): Only one edgefrom the same sub-layer through the next sub-layer edge. Figure (d): Only one neighbor fromthe same sub-layer through the internal edge.

33

Figure 10: Partition of planar graph G with maximum vertex valency h into 3 bipartite graphs.

Figure 11: Cluster communication architecture: processors, buses, network adapter cards, andswitch.

Figure 12: Network connectivity for single- and multi- processors nodes.

34

Figure 13: Effective bandwidth profiles for concurrent communication through a Myrinet 2000switch hierarchy.

Figure 14: Point-to-point communication times for message sizes up to 8 MB. Single (1) com-munication and concurrent (30 simultaneous messages) communication are measured in a switchhierarchy.

35

Figure 15: Performance profiles for dual-processor communications. Intra-processor: On Node(Non-Cached removes data from cache before timing). Inter-processor: Between Nodes (1 and 2simultaneous communications).

Figure 16: Measured communication time for scheduling obtained from the bipartite graphs edgecoloring algorithm. Both processor count and message size are varied.

36

Figure 17: Theoretical communication time over various p processors (sub-domains, partitionedaccording to Fig. 4), as determined by tcomm = (τh+ Lσ (M)) 2S, where σ (M) values are deter-

mined from the asymptotes in Fig. 13, M = p1

2 + p is the number of messages in each set, S = 6number of sets, h = 6 maximum number of neighboring sub-domains.

37

h-RELATION PERSONALIZED COMMUNICATION STRATEGY FOR … · h-RELATION PERSONALIZED COMMUNICATION...

Documents

Transcript of h-RELATION PERSONALIZED COMMUNICATION STRATEGY FOR … · h-RELATION PERSONALIZED COMMUNICATION...