A faster algorithm for solving linear algebraic equations on the star graph
-
Upload
ramesh-chandra -
Category
Documents
-
view
214 -
download
1
Transcript of A faster algorithm for solving linear algebraic equations on the star graph
J. Parallel Distrib. Comput. 63 (2003) 465–480
A faster algorithm for solving linear algebraic equationson the star graph
Ramesh Chandraa and C. Siva Ram Murthyb,�,1
aDepartment of Computer Science, Stanford University, Stanford, CA 94305, USAbDepartment of Computer Science & Engineering, Indian Institute of Technology, Madras 600036, India
Received 24 April 2000; revised 30 January 2003; accepted 7 February 2003
Abstract
The problem of solving a linear system of equations is widely encountered in many fields of science and engineering. In this paper,
we present a parallel algorithm to solve the above problem on a star graph. The proposed solution (i) is based on a variant of the
Gaussian elimination algorithm (GE) called the successive Gaussian elimination algorithm (SGE) (IEE Proc. Comput. Digit. Tech.
143 (4) (1996)) and (ii) supports partial pivoting to provide numerical stability. We present efficient matrix distribution techniques
on the star graph. Our proposed parallel algorithm employs these techniques to reduce communication overhead during matrix
operations on the star graph. We estimate the performance of our parallel algorithm and demonstrate its effectiveness by comparing
it with a recent algorithm for the same problem on star graphs (IEEE Trans. Parallel Distrib. Systems 8 (8) (1997) 803).
r 2003 Elsevier Science (USA). All rights reserved.
Keywords: Linear system; Matrix decomposition; Gaussian elimination; Parallel algorithm; Star graph
1. Introduction
The problem of solving a system of N linear equationsAx ¼ b (where A is a known N � N matrix, and b and x
are, respectively, the known and unknown N � 1vectors) is frequently encountered in many fields ofscience and engineering. Efficient numerical methods forsolving this problem on uniprocessor systems have beendeveloped and reliable and high-quality codes areavailable for different cases of linear systems [21].Recent advances in networking technology coupled withthe increasing microprocessor speeds have led to awidespread interest in the use of multiprocessor systemsfor solving many practical problems.
Gaussian elimination (GE) is one of the most popularmethods for solving the above-mentioned problem. Thestandard GE algorithm consists of two phases, namely,a triangularization phase and a back-substitution phase.In the triangularization phase, the linear system Ax ¼ b
is transformed into Ux ¼ c; where U is an uppertriangular matrix. The solution set is then obtained byperforming the back-substitution phase. Since the GEalgorithm on a uniprocessor requires OðN3Þ computa-tional steps to solve a system of N linear equations, a lotof research has been directed toward developing parallelGE implementations on different multiprocessor sys-tems. A survey of several of the important algorithms inparallel numerical algebra can be found in [9,11,13,17].
There is a growing interest in the star graph as adesirable alternative for massive parallel computing. Ithas been proposed in [1] as an attractive alternative tothe hypercube for interconnecting processors in parallelcomputers. The star graph is superior to the hypercubein three key properties: It has a lower degree, smallerdiameter, and a smaller average diameter than ahypercube with a similar number of processors [1,8].So, the star graph has fewer communication links andsmaller communication delays compared to the hyper-cube. The other desirable properties of the star graphinclude its regularity, vertex and edge symmetry,hierarchical structure, fault tolerance, and strongresilience. The topological properties of the star havebeen analyzed in [8]. Algorithms developed for the starinclude: broadcasting [14,24], sorting [19,22], embedding
�Corresponding author. Fax: +91-44-22578352.
E-mail addresses: [email protected] (R. Chandra),
[email protected] (C.S.R. Murthy).1This work was supported by the Department of Science and
Technology, New Delhi, India.
0743-7315/03/$ - see front matter r 2003 Elsevier Science (USA). All rights reserved.
doi:10.1016/S0743-7315(03)00031-5
[4,7,15,16,20,23], fault tolerance [5,12], routing [8,22],computing FFT [10] and star graph variants [18]. Themajor disadvantage of the star is its poor scalabilitysince the number of processors increase as n!, where n isthe dimension of the star. To circumvent this difficulty,a variant of the star called the incomplete star has beenproposed in [18]. The incomplete star allows incrementalscalability which considerably improves the scalabilityof the network.
Recently, a star-graph-based parallel algorithm hasbeen proposed in [3] to solve a linear system ofequations. In this paper, we present an efficient parallelalgorithm to solve the same problem on the star graph.Our algorithm is based on the successive Gaussianelimination algorithm (SGE) [6]. We compare ouralgorithm with the one presented in [3] and show thatour algorithm is superior in performance.
The rest of the paper is organized as follows. InSection 2, we first briefly describe the star graph alongwith its properties and then give a brief overview of theexisting solution to the above-mentioned problem on thestar graph. In Section 3, we first briefly describe the SGEalgorithm and present two efficient matrix distributiontechniques for the star graph. Then we present ourparallel algorithm for the above-mentioned problem. InSection 4, we compare our algorithm with the onepresented in [3]. Finally, in Section 5, we present ourconclusions.
2. Existing star-graph-based solution—AD
In this section, we give a brief overview of the existingalgorithm to solve a linear system of equations on thestar graph. This algorithm was presented in [3] by Al-Ayyoub and Day, and we refer to it as the AD algorithm
in the remainder of this paper. Before we discuss the ADalgorithm, we briefly describe the star graph andsummarize some of its important properties.
2.1. The star graph
The star graph of dimension n; denoted by Sn; has aset of n! processors corresponding to all the n!permutations of n distinct symbols /nS ¼ f1; 2;y; ng;and a set of n � 1 generators g2; g3;y; gn; where gi is thetransposition of the symbol in the ith position with thesymbol in the first position. The processor u correspond-ing to the permutation p1p2ypn; where piA/nS;1pipn; is denoted by u ¼ ðp1p2ypnÞ: A processor u ¼ðp1p2ypnÞ is connected to the processor uðiÞ ¼ðpip2ypi�1p1piþ1ypnÞ; for 2pipn (i.e., the processorcorresponding to the permutation p1p2ypn is connectedto all those processors whose corresponding permuta-tions result from interchanging the first symbol inp1p2ypn with any of the remaining n � 1 symbols).Processor uðiÞ is obtained from the processor u byapplying the generator gi on u; and hence the linkconnecting u and uðiÞ is said to be of type gi:Consequently, we see that the degree of each processorin Sn is n � 1: Sn is regular and vertex and edgesymmetric. It has a diameter of I3
2ðn � 1Þm; and an
average path length of n þ ln n þ Oð1Þ [2]. The minimumdistance between a pair of processors u and v in Sn isdenoted by dðu; vÞ: Fig. 1 shows S3 and S4; the stargraphs with dimensions 3 and 4, respectively.
Because of its rich symmetry, Sn is easily extensibleand can be partitioned in a number of ways. We describebelow two partitioning methods used in the remainderof the paper. In the first partitioning scheme, the set ofprocessors in Sn are decomposed into n disjoint subsetsI1; I2;y; In; where Ik is the subset of processors that end
Fig. 1. (a) The 3-dimensional star graph, S3: (b) The 4-dimensional star graph, S4: Note that S4 consists of four S3’s connected appropriately.
R. Chandra, C.S.R. Murthy / J. Parallel Distrib. Comput. 63 (2003) 465–480466
with symbol k [1]. Each of the subsets Ik is isomorphicwith the n � 1-dimensional star Sn�1 as shown for thecase n ¼ 4 in Fig. 1. In the second scheme, the set ofprocessors in Sn are partitioned into n disjoint subsetsX1;X2;y;Xn; where Xk is the subset of processors thatstart with the symbol k: The subsets I1; I2;y; In andX1;X2;y;Xn have ðn � 1Þ! processors each.
A number of processor ranking schemes have beenproposed for the star graph. We describe below one ofthem that we use in the remaining sections [19].
Definition 1. Let Gn be a one-to-one mapping fromthe set of permutations fðp1p2ypnÞ j piA/nS; piapj
for any jaig onto the set of integers f1; 2;y; n!g: Forany permutation p1p2ypn; a unique integer can begenerated using the following recursive function
Gnðp1p2ypnÞ
¼1 if n ¼ 1;
½ðpn � 1Þðn � 1Þ!�þGn�1ðq1q2yqn�1Þ otherwise;
8><>:
where q1q2yqn�1 is obtained from p1p2ypn afterdropping pn and renumbering the remaining symbolsfrom 1 to n � 1: For any permutation u ¼ ðp1p2ypnÞ;we use Gn�kðuÞ to denote Gn�kðq1q2yqn�kÞ for1pkpn � 1 and q1q2yqn�k is obtained after droppingthe last k symbols from u and renumbering theremaining symbols from 1 to n � k:
We state below the propositions that we employ in thelater sections. We borrow these propositions from [3]and they are proved in the same.
Proposition 1. Any processor u in Xk is connected to
exactly one processor v in Ik; by a link of type gn in Sn:
Proposition 2. For any two processors u ¼ðp1p2ypn�1kÞAIk and v ¼ ðq1q2yqn�1k þ 1ÞAIkþ1 such
that Gn�1ðuÞ ¼ Gn�1ðvÞ; we have dðu; vÞp3:
Proposition 3. For any two processors u ¼ðkp2p3ypnÞAXk and v ¼ ðk þ 1q2q3yqnÞAXkþ1 such
that Gn�1ðuðnÞÞ ¼ Gn�1ðvðnÞÞ; we have dðu; vÞ ¼ 1:
Proposition 4. For any two processors u ¼ðkp2p3ypnÞAXk and v ¼ ðkq2q3yqnÞAXk; we have
dðu; vÞX3:
2.2. The AD algorithm
The existing solution to the problem of solving alinear system of equations on Sn; which we call the ADalgorithm, is presented in [3] by Al-Ayyoub and Day.
The AD algorithm consists of two phases—the matrixdecomposition (or triangularization) phase and theback-substitution phase. Only the matrix decompositionphase is considered in [3]. In this section, we brieflydiscuss the matrix decomposition phase of [3] and thenpresent an efficient back-substitution phase to go alongwith it so that the AD algorithm provides a complete
solution to a linear system of equations.Two cyclic matrix distribution techniques, namely,
the star cyclic matrix distribution (SCMD) and the lineararray cyclic matrix distribution (LCMD) have beenproposed in [3]. These matrix distribution techniquesdistribute the matrix over Sn in a cyclic fashion. Thecyclic matrix distribution techniques offer better loadbalancing at the expense of increased communicationcost. Since all the elements of any row of the matrixreside in a single ðn � 1Þ substar in SCMD, SCMD is usedfor row communication. Since all the elements of anycolumn of the matrix are connected by a linear array inLCMD, LCMD is used for column communication. A matrixin SCMD can be converted to LCMD and vice versa in asingle communication step.
We now briefly describe the two phases of the ADalgorithm.
2.2.1. The matrix decomposition phase
The matrix decomposition phase of the AD algo-rithm, presented in [3], requires N � 1 steps to decom-pose a matrix of order N to an upper triangular form.Initially, the known vector b is appended to the N � N
matrix A to obtain the N � ðN þ 1Þ augmented matrixAjb: The matrix Ajb is then distributed on Sn using thecyclic matrix distribution techniques. At the kth step,the task sequence performed by the processors can beinformally given as follows.
1. All the processors containing the pivot columnelements perform partial pivoting. At the end of thisstep all the pivot column processors contain the valueof the pivot element.
2. The processors containing the pivot row elementsconcurrently broadcast the pivot row element valuesalong the respective columns.
3. The processors containing the pivot column elementsconcurrently compute the multiplier values for eachrow and then concurrently broadcast these multipliervalues along the respective rows.
4. All the processors which do not contain either thepivot row elements or pivot column elements usethe multiplier values and the pivot row elementvalues to update the values of the elements present inthem.
The above sequence of tasks are repeated N � 1 times,at the end of which the linear system Ax ¼ b istransformed into Ux ¼ c; where U is an uppertriangular matrix. The solution vector x is obtained by
R. Chandra, C.S.R. Murthy / J. Parallel Distrib. Comput. 63 (2003) 465–480 467
back substitution of vector c in the matrix U : A moreformal discussion of the above matrix decompositionphase can be found in [3].
2.2.2. The back-substitution phase
The algorithm presented in [3] gives only the matrixdecomposition phase of the AD algorithm which trans-forms the linear system Ax ¼ b into Ux ¼ c; where U isan upper triangular matrix. Our algorithm, which wewould describe in a later section, provides a complete
solution to a linear system of equations. Since we wouldevaluate the performance of our algorithm against theAD algorithm, the AD algorithm should also provide acomplete solution. This necessitates that a back sub-stitution of the vector c in the matrix U should followthe matrix decomposition phase in the AD algorithm. Inthis section, we develop an efficient star-based back-substitution phase to go along with the matrix decom-position phase of the AD algorithm. The tasks involvedin the back-substitution phase of the AD algorithm areoutlined below.
In the above task listing, we assume that the reducedupper triangular matrix U ¼ fui;j j 1pi; jpNg; thereduced known vector c ¼ fci j 1pipNg; and theunknown vector x ¼ fxi j 1pipNg: We also assume
that results after the matrix decomposition phase arestill in place in the respective processors. In the abovetask listing, after a new xkAx is calculated, its value isused to update the value of c vector so that the nextvariable xk�1 can be calculated.
After identifying the various tasks in the back-substitution phase, these tasks have to be distributedamong the different processors ofSn:We give below thealgorithm executed by the processor u ¼ ðp1p2ypnÞ ofSn during the back-substitution phase. The matrix isinitially distributed using the SCMD distribution. Thedescription of the notation used in the algorithm belowis as follows.
* The processor u in SCMD is denoted by PR;C ; where,PR;C is the processor at row R and column C in then � ðn � 1Þ! processor grid representation of Sn usedin SCMD.
* The processor u in LCMD is denoted by P%R;
%C; where,
P%R;
%C is the processor at row
%R and column
%C in the
n � ðn � 1Þ! processor grid representation of Sn usedin LCMD.
* lR is the maximum integer such that R þ lRnpN:* The procedure broadcast linear array performs
broadcasting in a linear array in LCMD.* The procedure exchange submatrices is used to
exchange submatrices between PR;C and P%R;
%C (i.e.,
between SCMD and LCMD) and vice-versa.
A detailed explanation of the above notation can be foundin [3]. It is to be noted that the above notational descriptionis valid for this section only. We redefine the abovenotation in later sections for use in our algorithm.
R. Chandra, C.S.R. Murthy / J. Parallel Distrib. Comput. 63 (2003) 465–480468
The above back-substitution algorithm consists of N steps.At the kth step, first the processors containing theðk þ 1Þth column of matrix U send in parallel the updatedvector c to the processors containing the kth column of U :Then the processor containing uk;k calculates xk andbroadcasts it to all the processors containing kth column ofU : These processors use the value of xk to update theelements of vector c in parallel. At the end of the N steps ofthe back-substitution phase, the solution to the linearsystem is obtained.
With the addition of the back-substitution phase, theAD algorithm is completely in place. In the next section,we present our algorithm for the problem of solving alinear system of equations on Sn: We then demonstratethe superior performance of our algorithm by compar-ing it with the AD algorithm.
3. Our algorithm for the star graph
In this section, we present our star-graph-basedparallel algorithm to solve the above-mentioned pro-blem. Our algorithm is based on the SGE algorithm. Wefirst briefly describe the SGE algorithm as given in [6].Then we present efficient non-cyclic matrix distributiontechniques for distributing a matrix on Sn: We thenpresent our algorithm which makes use of these matrixdistribution techniques to reduce communication over-head in matrix operations, thus providing faster solutionto the problem considered here.
3.1. The SGE algorithm
It is well known, in Ax ¼ b; that the value of xi
ði ¼ 1; 2;y;NÞ depends on the value of xj
ð j ¼ 1; 2;y;N and iajÞ; indicating the ðN � 1Þth leveldependency. Hence, the influence of ðN � 1Þ otherunknowns has to be unraveled to find one solution. Inthe GE algorithm, the value of xi ði ¼ 1; 2;y;NÞ isfound by eliminating its dependency on xk ðkoiÞ in thematrix decomposition (or triangularization) phase andxk ðk4iÞ in the back-substitution phase.
Two linear systems are said to be equivalent providedthat their solution sets are the same. Accordingly, thegiven set of N linear equations can be replaced by twosets of equivalent linear independent equations with halfthe number of unknowns by eliminating N
2variables each
in forward (left to right) and backward (right to left)directions. This process of replacing the given set oflinear independent equations by two sets of equivalentequations with half the number of variables is usedsuccessively in the SGE algorithm. The algorithminvolves the following steps to find the solution vector,x; in the equation Ax ¼ b [6].
1. First, we form an augmented matrix Ajb (i.e., the b-vector is appended as the ðN þ 1Þth column) of orderN � ðN þ 1Þ: This matrix is duplicated to form A0jb0
and b1jA1 (where the b-vector is appended to the leftof the coefficient matrix A) with A0 and A1 being thesame as the coefficient matrix A; and similarly, b0 andb1 being the same as the b-vector. Note the b-vector isappended to the left or right of the coefficient matrixonly for programming convenience.
2. Using the GE method, we triangularize A0jb0 in theforward direction to eliminate the subdiagonalelements in the columns 1; 2;y; ðN
2Þ to reduce its
order to N2� ðN
2þ 1Þ (ignoring the columns eliminated
and the corresponding rows). Concurrently, wetriangularize b1jA1 in the backward direction toeliminate the superdiagonal elements in the columnsN; ðN � 1Þ;y; ðN
2þ 1Þ to reduce the order of A1jb1
also to N2� ðN
2þ 1Þ (again ignoring the columns
eliminated and the corresponding rows). With this,A0jb0 may be treated as a new augmented matrix withcolumns and rows ðN
2þ 1Þ; ðN
2þ 2Þ;y;N and the
modified b-vector appended to its right, and similarlyb1jA1 as a new augmented matrix with columns androws 1; 2;y; ðN
2Þ and the modified b-vector appended
to its left.3. We duplicate the reduced augmented matrices A0jb0
to form A00jb00 and b01jA01; and b1jA1 to formA10jb10 and b11jA11 (each of these duplicated matriceswill be of the same order, N
2� ðN
2þ 1Þ). We now
R. Chandra, C.S.R. Murthy / J. Parallel Distrib. Comput. 63 (2003) 465–480 469
triangularize A00jb00 and A10jb10 in the forwarddirection and b01jA01 and b11jA11 in the backwarddirection through N
4 columns using the GE method,thus reducing the order of each of these matrices tohalf of their original size, i.e., N
4� ðN
4þ 1Þ: Note that
the above four augmented matrices are reduced inparallel.
4. We continue this process of halving the size ofthe submatrices using GE and doubling the numberof submatrices log N times2 so that we end upwith N submatrices each of order 1� 2: The modifiedb-vector part, when divided by the modified A
matrix part in parallel, gives the complete solutionvector, x:
The solution of Ax ¼ b using the SGE algorithm isshown in Fig. 2 in the form of a binary tree for N ¼ 8:
3.2. Non-cyclic matrix distribution on the star graph
The distribution of matrix elements among theprocessors of a processor configuration is a veryimportant factor for effective parallel matrix computa-tion. The matrix distribution techniques should allow
effective communication during the various matrixoperations. In particular, parallel broadcasting acrossrows and columns should be effectively supported bythese techniques. Such techniques are easily achievablefor the mesh and hypercube architectures. An N � N
matrix can be distributed on an n � n mesh ðnoNÞ bydividing the N rows (columns) of the matrix into n rowgroups (column groups) such that each row group(column group) contains N
nconsecutive rows (columns).
The submatrix formed by the intersection of the ith rowgroup and the jth column group can be allotted to theprocessor at location ði; jÞ of the mesh. In a hypercube ofdimension n; matrix distribution can be done bypartitioning the n-bit binary addresses into 2
n2 disjoint
subcubes each of dimension n2: Thus, matrix elements can
be distributed over these disjoint subcubes such thatelements of the same row (column) reside in the samesubcube.
Matrix distribution on the star is not so obvious as inthe mesh and the hypercube. Since it is apparentlyunachievable to define a single distribution that allowsus to use star-based broadcasting in both row andcolumn directions, we define two non-cyclic matrixdistribution techniques, one which allows efficient rowbroadcasting and another which allows efficient columnbroadcasting. Similar cyclic matrix distribution techni-ques were used in [3]. The proposed non-cyclic matrix
Fig. 2. The SGE algorithm.
2 In this paper, log is the logarithm to the base 2 and ln is the natural
logarithm.
R. Chandra, C.S.R. Murthy / J. Parallel Distrib. Comput. 63 (2003) 465–480470
distribution techniques have lower communicationoverhead than the cyclic distribution techniques duringmatrix operations and hence are employed in ouralgorithm.
Let A ¼ fai;j j 1pipN; 1pjpMg be the set ofelements of a general N � M matrix to be distributedon Sn; and V the set of processors in Sn: Let mr ¼N � IN
nmn and mc ¼ M � I M
ðn�1Þ!mðn � 1Þ!: First wedefine the following useful functions:
f ðiÞ ¼J i
JNnnn if ipmrJN
nn;
mr þ Ji�mrJ
Nnn
INnm
n otherwise
8>><>>:
and
gð jÞ ¼
J j
J Mðn�1Þ!n
n if jpmcJ Mðn�1Þ!n;
mc þ Jj�mcJ
Mðn�1Þ!n
I Mðn�1Þ!m
n otherwise:
8>>><>>>:
Definition 2. The star matrix distribution is a functionSMD : A-V given by SMDðaijÞ ¼ v; such that vAIR andGn�1ðvÞ ¼ C; where R ¼ f ðiÞ and C ¼ gð jÞ: Such aprocessor is denoted by PR;C :
The function SMD distributes the matrix A over then � ðn � 1Þ! processor grid formed by the set of n
substars I1; I2;y; In; each containing ðn � 1Þ! proces-sors. Using SMD, a processor PR;C is assigned lR ¼ JN
nn
consecutive subrows if it belongs to any of the substarsI1; I2;y; Imr
(i.e., Rpmr), and it is assigned lR ¼ INnm
consecutive subrows if it belongs to any of the remainingsubstars Imrþ1;y; In (i.e., R4mr). Similarly, PR;C isassigned mC ¼ J M
ðn�1Þ!n consecutive subcolumns if itbelongs to any of the first mc processor columns (i.e.,Cpmc), and it is assigned mC ¼ I M
ðn�1Þ!m consecutivesubcolumns if it belongs to any of the remainingprocessor columns (i.e., C4mc). So, PR;C is assigned asubmatrix formed by lR consecutive subrows and mC
consecutive subcolumns. Fig. 3 shows an example ofdistributing a 6� 6 matrix on an S4 using the SMD
function.
Definition 3. The linear array matrix distribution is afunction LMD : A-V given by LMDðaijÞ ¼ v such thatvAX
%R and Gn�1ðvðnÞÞ ¼
%C; where
%R ¼ f ðiÞ and
%C ¼ gð jÞ:
Such a processor is denoted by P%R;
%C:
The function LMD distributes the matrix A overthe n � ðn � 1Þ! processor grid formed by the setsX1;X2;y;Xn; with each Xk containing ðn � 1Þ! proces-sors. Using LMD, a processor P
%R;
%C is assigned l
%R ¼ JN
nn
consecutive subrows if it belongs to any ofX1;X2;y;XmR
(i.e.,%RpmR), and it is assigned l
%R ¼
INnm consecutive subrows if it belongs to any of
Xmrþ1;y;Xn (i.e.,%R4mR). Similarly, P
%R;
%C is assigned
m%C ¼ J M
ðn�1Þ!n consecutive subcolumns if it belongs toany of the first mc processor columns (i.e.,
%Cpmc), and
it is assigned m%C ¼ I M
ðn�1Þ!m consecutive subcolumns if itbelongs to any of the remaining processor columns (i.e.,
%C4mc). So, P
%R;
%C is assigned a submatrix formed by l
%R
consecutive subrows and m%C consecutive subcolumns.
Fig. 4 shows an example of distributing a 6� 6 matrixon an S4 using the LMD function. It can be observed thatin LMD, each matrix column resides in a set of n
processors connected in a linear array.The following observations about SMD and LMD can be
easily verified:
1. Using SMD, elements of the same row are stored in thesame Sn�1: Therefore, the farthest distance between
Fig. 3. Matrix distribution using the SMD on S4:
Fig. 4. Matrix distribution using the LMD on S4:
R. Chandra, C.S.R. Murthy / J. Parallel Distrib. Comput. 63 (2003) 465–480 471
any pair of processors holding the elements of thesame row in SMD is I3
2ðn � 2Þm:
2. Applying Proposition 3 to LMD, we observe that thecolumn elements in LMD are connected in a lineararray. So, the farthest distance between any twoprocessors holding the elements of the same columnis n � 1:
3. Using Proposition 1 we see that any processoru ¼ ðp1p2ypnÞ denoted by PR;C in SMD is connectedto the processor uðnÞ denoted by P
%R;
%C in LMD through
a link of type gn; where R ¼ pn; C ¼ Gn�1ðuÞ;
%R ¼ p1;
%C ¼ Gn�1ðuðnÞÞ:
4. In LMD, a processor u denoted by P%R;
%C is directly
connected to a processor v denoted by P%Rþ1;
%C with a
link of type gað%Rþ1Þ in Sn; where aðiÞ is the position
occupied by the symbol i in u: Similarly, a processorP
%R;
%C is directly connected to a processor P
%R�1;
%C with
a link of type gað%R�1Þ in Sn:
Since, the function SMD distributes matrix rows over n
disjoint n � 1-dimensional substars, simultaneous sub-column broadcasts are possible in SMD. In LMD the matrixcolumns are distributed over ðn � 1Þ! disjoint sets ofprocessors, where each set of processors forms a lineararray. Therefore, subrow broadcasts can be performedsimultaneously in LMD. So we see that LMD allows forefficient row broadcasts and SMD allows for efficientcolumn broadcasts. Also switching between SMD and LMD
requires a single submatrix exchange step. A group ofconsecutive rows or consecutive columns reside in thesame processor in SMD and LMD. This reduces commu-nication overhead involved in performing matrix opera-tions on consecutive rows or columns and hence weemploy SMD and LMD in our algorithm.
3.3. Our algorithm
In this section, we present our algorithm for solving alinear system of equations on the star graph. Ouralgorithm is based on the SGE algorithm and providesgreater concurrency than the AD algorithm. Further-more, our algorithm employs the communicationefficient SMD and LMD distribution techniques. Thesetwo features of our algorithm greatly enhance itsperformance and make it an attractive method forthe parallel solution of linear equations on the stargraph.
Initially, the N � 1 known vector b of the linearsystem Ax ¼ b; is appended to the N � N coefficientmatrix A to obtain the N � ðN þ 1Þ augmented matrix,E ¼ A j b ¼ fai;j j 1pipN; 1pjpN þ 1g: As a firststep toward developing our parallel algorithm on Sn;we need to identify the various tasks performed on E
during the course of our algorithm. We give below thesteps identifying the tasks performed during ouralgorithm.
R. Chandra, C.S.R. Murthy / J. Parallel Distrib. Comput. 63 (2003) 465–480472
As shown above, our algorithm requires log N stages tosolve a system of N linear equations. The current stageof the algorithm is denoted by ‘s’, 1psplog N: At thestage s; there are 2s�1 augmented submatrices each ofsize N
2s�1 � ð N2s�1 þ 1Þ; which have to be reduced. In the
above task listing, the submatrix currently being reducedin the forward direction is denoted by Es
num and thesubmatrix currently being reduced in the backwarddirection is denoted by E
0snum; where Es
num ¼ fai;j j boioe; bojpeg and E
0snum ¼ fa0
i;j j boioe; bojpeg: Thevariables ‘b’ and ‘e’ denote the beginning and end-ing matrix rows (and columns) of Es
num (and E0snum),
respectively. The reduction of Esnum and E
0snum during
the stage s of our algorithm requires N2s steps. The
forward tasks, backward tasks, duplication tasks, anddivision tasks are represented by F, B, D, and P,respectively.
After identifying the tasks as above, we need todistribute them among the processors of Sn: Theaugmented matrix E is distributed on Sn using thenon-cyclic matrix distribution techniques presentedabove. Our algorithm supports partial pivoting toensure numerical stability. In partial pivoting, the setof processors holding the pivot subcolumns find themaximum element in the subcolumn and exchangesubrows such that the maximum element becomes thepivot element.
At any step in the stage s; a processor reducing Esnum
and E0snum can be in any one of the following states:
* broadcasting pivot subrow or multiplier subcolumn,* eliminating a submatrix,* involved in determining the new pivot row,* exchanging submatrices,
* waiting for a pivot subrow and/or a multiplierssubcolumn, or
* idle holding final matrix elements.
Since similar steps are executed for both forward andbackward elimination, the states given above apply forboth the backward and forward elimination steps; i.e.,corresponding to each state mentioned above, we canhave one forward elimination state and one backwardelimination state.
Now, we specify the tasks executed by each processorofSn during the course of our algorithm. The algorithmpresented below outlines the sequence of operationsperformed by the processor u ¼ ðp1p2ypnÞ of Sn: Thematrix E is initially distributed using the LMD. In thestage s; we denote the N
2s�1 � ð N2s�1 þ 1Þ augmented
submatrix which is used for forward elimination, andwhich contains the lR � mC submatrix allotted to u inSMD, by Es
num; where 0pnumo2s�1 . Similarly, we denotethe forward elimination N
2s�1 � ð N2s�1 þ 1Þ augmented sub-
matrix which contains the l%R � m
%C submatrix allotted to
u in LMD, by Esnum; where 0pnumo2s�1: The copies of
Esnum and Es
num which are used for backward eliminationare denoted by E
0snum and E
0snum; respectively. For the
processor u; lR ðmcÞ and l%R ðm
%CÞ denote the number
of rows (columns) in SMD and LMD, respectively. Fur-thermore, bR ðbCÞ and b
%R ðb
%CÞ; respectively, denote
the starting row (column) in SMD and LMD of the sub-matrix assigned to the processor u: The proceduresbroadcast linear array and broadcast star performbroadcasting in the linear array and in the star graph,respectively. The procedure exchange submatrices isused to exchange submatrices between PR;C and P
%R;
%C
(i.e., between SMD and LMD) and vice versa.
R. Chandra, C.S.R. Murthy / J. Parallel Distrib. Comput. 63 (2003) 465–480 473
The Algorithm
R. Chandra, C.S.R. Murthy / J. Parallel Distrib. Comput. 63 (2003) 465–480474
It is to be noted that the number of stages in the abovealgorithm executed by each processor of Sn is J1þlogðn � 1Þ!n and not log N as given in the task listing.The reason for this is as follows: in both the non-cyclicmatrix distribution techniques, Sn is viewed as a gridwith n processor rows and ðn � 1Þ! processor columns.So, at the end of J1þ logðn � 1Þ!n stages of ouralgorithm, the submatrices would become small enoughso that all computations on each of them would beperformed sequentially on a single processor. Since thissequential execution would not exploit the inherent
concurrency of our algorithm, we employ sequential GEalgorithm on these small submatrices during theremaining stages and obtain the final solution.
In the task listing, the forward partial pivoting taskwas denoted by /Fs
k;kðnumÞS and backward partialpivoting task by /Bs
k;kðnumÞS: Since both these areexecuted concurrently, they are together denoted by asingle compound task /Fs
k;kðnumÞ þ Bsk;kðnumÞS in the
above algorithm listed for the processor u: The task/Fs
k;kðnumÞ þ Bsk;kðnumÞS which performs partial pivot-
ing in our algorithm is outlined below.
R. Chandra, C.S.R. Murthy / J. Parallel Distrib. Comput. 63 (2003) 465–480 475
R. Chandra, C.S.R. Murthy / J. Parallel Distrib. Comput. 63 (2003) 465–480476
The set of processors holding the elements of theforward elimination pivot subcolumn performs theforward partial pivoting. These processors are presentin a linear array (because it is the LMD distribution)and find the row, imax1; of the maximum elementof the subcolumn. The set of processors holdingthe pivot subrow and those holding the subrow imax1;then exchange the relevant subrows. In a similarfashion, the set of processors holding the elementsof the backward elimination pivot subcolumn concur-rently perform backward partial pivoting. The pro-cedures swap�subrows and interchange�subrows areused to swap rows within the same processor andbetween two different processors in a linear array,respectively.
4. Performance analysis
In this section, we evaluate the performance of ourstar-based algorithm presented above. We compare itwith the AD algorithm using the time estimation modelemployed in [3]. In the AD algorithm, the matrixis distributed using the cyclic SCMD and LCMD distributiontechniques [3], while in our algorithm, the matrix isdistributed using the non-cyclic SMD and LMD distri-bution techniques. The matrix size is N � ðN þ 1Þ: In allof the above techniques, the P processors in Sn arearranged as an R � C grid, where P ¼ n!; R ¼ n; andC ¼ ðn � 1Þ!:
In any GE-based algorithm using broadcastingapproach, overlapping of communication and com-putation can be done in two levels: intrastep andinterstep. In both the AD algorithm and our algorithm,
overlapping within each step does not reduceoverall execution time. However, interstep over-lapping reduces the overall execution time. Theeffect of interstep overlapping in each algorithmwould be described in the analysis of the respectivealgorithm.
4.1. Analysis of the AD algorithm
As mentioned before, the AD algorithm consists oftwo phases namely, the matrix-decomposition phase andthe back-substitution phase. The estimated executiontime of the AD algorithm would be the sum of theestimated execution times of each of the individualphases. The performance analysis of both the phases isgiven below.
4.1.1. Matrix-decomposition phase
The matrix-decomposition phase of the AD algorithmrequires N � 1 factorization steps to decompose amatrix A: In the kth step, the following tasks areperformed [3].
* /Tk;kS:* C simultaneous broadcasts in the wc sets ð1pcpCÞ
with message length equal to Nþ1C
; where wc denotesthe set of processors in the R � C grid along the cthprocessor column.
* NRsequential /Ti;kS:
* R simultaneous broadcasts in the rr sets ð1prpRÞwith message length equal to N
R; where rr denotes the
set of processors in the R � C grid along the rthprocessor row.
*NðNþ1Þ
Psequential /Ti;jS:
R. Chandra, C.S.R. Murthy / J. Parallel Distrib. Comput. 63 (2003) 465–480 477
Thus, the maximum execution time of a singlefactorization step, tm; is given by
tm ¼fðTk;kÞ þ fðwcÞ þN
RfðTi;kÞ
þ fðrrÞ þNðN þ 1Þ
PfðTi;jÞ;
where fðTÞ indicates the time taken to execute thetask T :
Since locating the kth pivot row and computing thekth multipliers column can start as soon as the ðk � 1Þthmultipliers column is received, a new step can start every
tb ¼fðTk;kÞ þ fðwcÞ þN
RfðTi;kÞ
þ fðbrÞ þNðN þ 1Þ
PfðTi;jÞ;
where fðbrÞ is the time passed until the set of processorsholding the next multipliers column receives theprevious multipliers column, 0pfðbrÞpfðrrÞ: So, theestimated execution time tdecomp for the triangulariza-tion phase is
tdecomp ¼ ðN � 2Þtb þ tm:
4.1.2. Back-substitution phase
The back-substitution phase of the AD algorithmcomprises N steps. In the kth step the following tasksare performed:
* R simultaneous sends of c vector elements from theprocessors containing the ðk þ 1Þth matrix columnelements to processors containing the kth matrixcolumn elements, with message length equal to N
R: We
denote this task by r0:* /T 0
k;kS:* A single broadcast of xk along the processor set
containing the kth matrix column elements, withmessage length equal to 1. We denote this task by w0:
* NRsequential /T 0
i;kS:
So, the maximum execution time of a single step ofthe back-substitution phase, t0m is given by
t0m ¼ fðr0Þ þ fðT 0k;kÞ þ fðw0Þ þ N
RfðT 0
i;kÞ:
Since, computing the next variable xN�k�1 can bestarted as soon as the present variable xN�k’s value hasupdated cN�k�1; a new step can start every
t0b ¼ fðr0Þ þ fðT 0k;kÞ þ fðb0Þ þ N
RfðT 0
i;kÞ;
where fðb0Þ is the time taken for the value of xN�k toreach the processor containing cN�k�1; 0pfðb0Þpfðw0Þ:Hence, the estimated time, tback of the back-substitutionphase is
tback ¼ ðN � 1Þt0b þ t0m:
Therefore, the total estimated execution time of theAD algorithm is
tAD ¼ tdecomp þ tback:
4.2. Analysis of our algorithm
As mentioned before, our algorithm consists ofJ1þ logðn � 1Þ!n stages. In stage s; there are N
2s factor-ization steps. At the step k of the stage s; the followingtasks are performed.
* /Fsk;kðnumÞ þ Bs
k;kðnumÞS:* Two C simultaneous broadcasts, one in the forward
direction in the wc sets ð1pcpCÞ and the other in thebackward direction in the w0c sets ð1pcpCÞ withmessage length equal to ðNþ1Þ
Cin each, where wc and w0c
denote the set of processors along the cth column inthe R � C processor grid.
* Concurrent execution of NR
sequential /Fsi;kðnumÞS
and NRsequential /Bs
i;kðnumÞS:* Two R simultaneous broadcasts, one in the forward
direction in the rr sets ð1prpRÞ and the other in thebackward direction in the r0r sets ð1prpRÞ withmessage length equal to N
R; where rr and r0r denote the
set of processors along the rth row in the R � C
processor grid.* Sequential execution of
NðNþ1ÞP
sequential/Fs
i;jðnumÞS and NðNþ1ÞP
sequential /Bsi;jðnumÞS:
Hence, the maximum execution time for one step ofthe stage s in our algorithm is
tsm ¼fðF s
k;kðnumÞ þ Bsk;kðnumÞÞ
þ fðwc þ w0cÞ þN
RfðFs
i;kðnumÞÞ
þ fðrr þ r0rÞ þ 2NðN þ 1Þ
PfðF s
i;jðnumÞÞ:
Since, locating the kth pivot row and computing thekth multipliers column can be started as soon as theðk � 1Þth multipliers are received, a new step can startevery
tsb ¼fðFs
k;kðnumÞ þ Bsk;kðnumÞÞ
þ fðwc þ w0cÞ þN
RfðF s
i;kðnumÞÞ
þ fkðbr þ b0rÞ þ 2NðN þ 1Þ
PfðFs
i;jðnumÞÞ;
where fkðbr þ b0rÞ is the time passed until the set ofprocessors holding the next multiplier column receivesthe previous multiplier column values. Since in SMD andLMD, consecutive columns can reside in the sameprocessor, fkðbr þ b0rÞ is zero in such cases. It is non-zero only when consecutive elements reside in differentprocessors. Consequently, interstep overlap is muchhigher in our algorithm than in the AD algorithm. This
R. Chandra, C.S.R. Murthy / J. Parallel Distrib. Comput. 63 (2003) 465–480478
reduces the communication overhead in our algorithmand hence improves the performance of our algorithm.
Also at the end of the J1þ logðn � 1Þ!n stages,sequential GE is performed on the reduced N
C� N
C
submatrices. Let the time taken by the sequential GEbe tseq: Hence, the total estimated time taken by ouralgorithm is
tOUR ¼XJ1þlogðn�1Þ!n
s¼1
tsm þ
XN2s
k¼2
tsb
0B@
1CAþ tseq:
4.3. Comparison of the two algorithms
In estimating the execution time of our algorithm, weuse the same communication model employed in [3].This facilitates a ready and direct comparison of ouralgorithm with that of [3]. In this model, the time takenfor communication of a message with length M istcommðMÞ ¼ b þ aM; where ‘b’ is the message latencyand ‘a’ is the unit transmission cost. Also, the time takenfor broadcasting a message of length M in a graph ofdiameter d is dðb þ aMÞ: The time taken for twosimultaneous broadcasts each with a message of lengthM is taken to be 2dðb þ aMÞ: The time taken foreach computation tcomp is taken to be 0:25 ns andthe parameters a and b are set to 8 ns and 30 ms;respectively.
Using the above model, we plot the estimatedexecution times for the AD algorithm and our algorithmon Sn in Figs. 5–8 using different values of n; thedimension of the star. The x-axis in the plots gives theproblem size, i.e., the number of equations in the linearsystem and the y-axis gives the estimated execution timein nanoseconds. The figures suggest that our algorithmperforms better than the AD algorithm for a range ofproblem sizes. This improvement in performance is dueto the higher concurrency of our algorithm as compared
to the AD algorithm and the lower communicationoverheads of our non-cyclic matrix distribution meth-ods, the SMD and the LMD as compared to the cyclicmatrix distribution techniques, the SCMD and the LCMD.
Fig. 5. Estimated execution times for n ¼ 5:
Fig. 6. Estimated execution times for n ¼ 6:
Fig. 7. Estimated execution times for n ¼ 7:
Fig. 8. Estimated execution times for n ¼ 8:
R. Chandra, C.S.R. Murthy / J. Parallel Distrib. Comput. 63 (2003) 465–480 479
5. Conclusions
In this paper, we presented a parallel algorithm forsolving a system of linear equations on the Sn: To thisend, we developed SMD and LMD, the non-cyclic matrixdistribution techniques on the star graph. We evaluatedour algorithm against the AD algorithm presented in [3]and demonstrated the superior performance of ouralgorithm. The proposed algorithm performs better dueto two main reasons, namely, the increased concurrencyof our algorithm and the lesser communication over-head in the non-cyclic matrix distributions. However,the algorithm has a few shortcomings. It takes twice theamount of memory than that of the AD algorithm, andit supports only partial pivoting. Since completepivoting is rarely used, the latter shortcoming is not aproblem in practice.
References
[1] S.B. Akers, D. Harel, B. Krishnamurthy, The star graph: an
attractive alternative to the n-cubes, Proceedings of International
Conference on Parallel Processing, 1987, pp. 393–400.
[2] S.B. Akers, B. Krishnamurthy, A group-theoretic model for
symmetric interconnection networks, IEEE Trans. Comput. 38 (4)
(1989) 555–566.
[3] A. Al-Ayyoub, K. Day, Matrix decomposition on the star graph,
IEEE Trans. Parallel Distrib. Systems 8 (8) (1997) 803–812.
[4] N. Bagherzadeh, M. Dowd, N. Nassif, Embedding an arbitrary
binary tree into the star graph, IEEE Trans. Comput. 45 (4)
(1996) 475–481.
[5] N. Bagherzadeh, N. Nassif, S. Latifi, A routing and broadcasting
scheme in faulty star graphs, IEEE Trans. Comput. 42 (11) (1993)
1398–1403.
[6] K.N.B. Balasubramanya Murthy, C. Siva RamMurthy, Gaussian
elimination based algorithm for solving linear equations on mesh
connected processors, IEE Proc. Comp. Digit. Tech. 143 (4)
(1996) 407–412.
[7] S. Beltayeb, B. Cong, M. Girou, I.H. Sudborough, Embedding
star networks into hypercubes, IEEE Trans. Comput. 45 (2)
(1996) 186–194.
[8] K. Day, A. Tripathi, A comparative study of topological
properties of hypercubes and star graphs, IEEE Trans. Parallel
Distrib. Systems 5 (1) (1994) 31–38.
[9] J.J. Dongarra, F.G. Gustavson, A. Karp, Implementing linear
algebra algorithms for dense matrices on a vector pipeline
machine, SIAM Rev. 26 (1) (1984) 91–112.
[10] P. Fragopoulou, S.G. Akl, A parallel algorithm for computing
Fourier transforms on the star graph, IEEE Trans. Parallel
Distrib. Systems 5 (5) (1994) 525–531.
[11] K. Gallivan, R.J. Plemmons, A.H. Sameh, Parallel algorithms for
dense linear algebra computations, SIAM Rev. 32 (1) (1990)
54–135.
[12] L. Gargano, U. Vaccaro, A. Vozella, Fault tolerant routing in the
star and pancake interconnection networks, Inform. Process. Lett.
45 (6) (1993) 315–320.
[13] D. Heller, A survey of parallel algorithms in numerical linear
algebra, SIAM Rev. 20 (4) (1978) 740–777.
[14] S. Jang-Ping, L. Wen-Hwa, C. Tzung-Shi, A broadcasting
algorithm in the star graph interconnection networks, Inform.
Process. Lett. 48 (5) (1993) 237–241.
[15] I.L. Jung, J.H. Chang, Embedding complete binary trees
in star graphs, J. Korea Inform. Sci. Soc. 21 (2) (1994)
407–415.
[16] J.S. Jwo, S. Lakshmivarahan, S.K. Dhall, Embedding of cycles
and grids in star graphs, J. Circuits Systems Comput. 1 (1) (1991)
43–74.
[17] S. Lakshmivarahan, S.K. Dhall, Analysis and Design of Parallel
Algorithms—Arithmetic and Matrix Problems, McGraw-Hill,
New York, 1990.
[18] S. Latifi, N. Bagherzadeh, Incomplete star: an incrementally
scalable network based on the star graph, IEEE Trans. Parallel
Distrib. Systems 5 (1) (1994) 97–102.
[19] A. Menn, A.K. Somani, An efficient sorting algorithm for the star
graph interconnection network, Proceedings of International
Conference on Parallel Processing 1990, Urbana-Champaign,
IL, USA, August 1990, Vol. 3, pp. 1–8.
[20] S.T. Obenaus, T.H. Szymanski, Embedding of star graphs into
optical meshes without bends, J. Parallel Distrib. Comput. 44 (2)
(1997) 97–107.
[21] W.H. Press, S.A. Teukolsky, W.T. Vetterling, B.P. Flannery,
Numerical Recipes in C—The Art of Scientific Computing,
Cambridge University Press, Cambridge, 1992.
[22] S. Rajasekaran, D.S.L. Wei, Selection, routing and sorting
on the star graph, J. Parallel Distrib. Comput. 41 (1) (1997)
225–234.
[23] S. Ranka, J.C. Wang, N. Yeh, Embedding meshes on the
star graph, J. Parallel Distrib. Comput. 19 (2) (1993)
131–135.
[24] Y.C. Tseng, J.P. Sheu, Toward optimal broadcast in a star graph
using multiple spanning trees, IEEE Trans. Comput. 46 (5) (1997)
593–599.
R. Chandra, C.S.R. Murthy / J. Parallel Distrib. Comput. 63 (2003) 465–480480