[ACM Press the 30th annual Southeast regional conference - Raleigh, North Carolina...

8
Sparse Cholesky Factorization on a Simulated Hypercube* Venu M. l)adaka,nti l)epartment of Computer Science Old Donlinion University N()rf()lk, VA 23520-0162 t, ml,i@s, odlt. cdu Abstract Solutions to large systems of linear equations are used in a wide range of problem~. Ma~iy of these systems can be represented by symmetric sparse pos- itive definite malriccs, and the Cholesky dccompo- sztton has been proved to be one of the best meth- ods for solving symmetrw positive definite ,nat,'wes. Our work deals wHh Ihc parallel Cholesky decom- posilion of such matrzces and their implementahon on a simulated hypercubc. We decompose the so- lution into subtasks that have precedence relations among themselves. These precedence relattoTis are expressed in the form of an elimination tree where each node represents a subtask assiglled to a pro- cessor for execution. The scheduli~tg of these tasks is based on the Highest Level First pohcy which has been proved to be an optimal one i, case of static aad deterministic scheduhng. A flcrible hypercubc simulator is developed in C language under UNIX environment. The simulation studies itldicate lhal this algorithm is very eJfictent for large and well or. dered matrices. 1 Introduction Applications such as structural analysis, circuit simulation, aerodynamics etc., often involve large systems of linear equations. These systems are usu- ally formulated in terms of matrices and their so- lutions are obtained by triangular decomposition techniques. Some of these matrices are symmetric *Partially funded by National Science Foundation under grant #IRI-9108610 Permiuion to copy without fee all or part of this material is grant~ provided that the copie~ am not made or ~but~l for dlreet coramc~ial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice it given that copying is by permission of the Association for Computing Machinery. To copy othm'wi~, or to republish, mquLr~ a fee and/~ specific permits/on. and positive definite. To obtain solutions for this subclmqs of matrices, the Cholesky decomposition technique has proven to hc an optimal method with respect to time and space complexity[l, 10]. As solutions involving large matrices are well known to I)e computationally intensive, a sequential imple- mentation becomes a. bottleneck. We exploit the fact that solutions to sparse ma- trices offers a large scope for parallelism. Our work deals with the design of a parallel Cholesky de- composition technique for sparse symmetric posi- tive definite matrices and its implementation on a simulated hypercube. Some of the existing parallel algorithms make use of an elinaination tree [2, 3], in which each node represents a subtask of the algo- rithm. The scheduling of these tasks is done dynam- ically and the communication between these nodes is done by broadcasting. We aim at reducing the commuuication overhead of these methods, by suit- ably choosing a deterministic static scheduling. In our algorithm we determine the set of tasks to be executed and compute their approximate execu- tion time. We obtain an elimination tree based on the precedence relations among the tasks [I]. We then assign the tasks in a bottom up manner using I.he llighcst l,evel First (ItLF) policy; at. each level the processor with the least sum of execution times of the tasks assigned is chosen for the next t;~sk. Af- ter the task assignment is done by the coordinating processor the tasks are scheduled For the implementation of our algorithm we chose the hypercube architecture because of its ver- satility and commercial availability. This architec- ture in particular is suitable because o1" its isotrop- icity, scalability, symmetry and recursive nature [8]. We developed a hyl)ercube simulator in the C language tinder UNIX environment. UNIX system calls and the Inter Processor Communication(IPC) primitives were used to simulate the processors and the communication channels. The paper is organized a.s follows: The triangu- lar decomlmsition is introduced in section 2. The © 1992 ACM 0-89791-506-2/9210003/00300 $1.50

Transcript of [ACM Press the 30th annual Southeast regional conference - Raleigh, North Carolina...

Sparse Cho lesky Factorizat ion on a S imulated Hypercube*

Venu M. l )adaka,nt i

l ) e p a r t m e n t of C o m p u t e r Sc i ence

Old Donlinion U n i v e r s i t y N()rf()lk, VA 23520-0162

t, ml , i@s, odlt. cdu

Abstract

Solutions to large systems of linear equations are used in a wide range of problem~. Ma~iy of these systems can be represented by symmetric sparse pos- itive definite malriccs, and the Cholesky dccompo- sztton has been proved to be one of the best meth- ods for solving symmetrw positive definite ,nat,'wes. Our work deals wHh Ihc parallel Cholesky decom- posilion of such matrzces and their implementahon on a simulated hypercubc. We decompose the so- lution into subtasks that have precedence relations among themselves. These precedence relattoTis are expressed in the form of an elimination tree where each node represents a subtask assiglled to a pro- cessor for execution. The scheduli~tg of these tasks is based on the Highest Level First pohcy which has been proved to be an optimal one i , case of static aad deterministic scheduhng. A flcrible hypercubc simulator is developed in C language under UNIX environment. The simulation studies itldicate lhal this algorithm is very eJfictent for large and well or. dered matrices.

1 I n t r o d u c t i o n

Applications such as structural analysis, circuit simulation, aerodynamics etc., often involve large systems of linear equations. These systems are usu- ally formulated in terms of matrices and their so- lutions are obtained by triangular decomposition techniques. Some of these matrices are symmetric

* Partially funded by National Science Foundation under grant #IRI-9108610

Permiuion to copy without fee all or part of this material is grant~ provided that the copie~ am not made or ~ b u t ~ l for dlreet coramc~ial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice it given that copying is by permission of the Association for Computing Machinery. To copy othm'wi~, or to republish, mquLr~ a fee and/~ specific permits/on.

and positive definite. To obtain solutions for this subclmqs of matrices, the Cholesky decomposition technique has proven to hc an optimal method with respect to time and space complexity[l, 10]. As solutions involving large matrices are well known to I)e computationally intensive, a sequential imple- mentation becomes a. bottleneck.

We exploit the fact that solutions to sparse ma- trices offers a large scope for parallelism. Our work deals with the design of a parallel Cholesky de- composition technique for sparse symmetric posi- tive definite matrices and its implementation on a simulated hypercube. Some of the existing parallel algorithms make use of an elinaination tree [2, 3], in which each node represents a subtask of the algo- rithm. The scheduling of these tasks is done dynam- ically and the communication between these nodes is done by broadcasting. We aim at reducing the commuuication overhead of these methods, by suit- ably choosing a deterministic static scheduling.

In our algorithm we determine the set of tasks to be executed and compute their approximate execu- tion time. We obtain an elimination tree based on the precedence relations among the tasks [I]. We then assign the tasks in a bottom up manner using I.he llighcst l,evel First (ItLF) policy; at. each level the processor with the least sum of execution times of the tasks assigned is chosen for the next t;~sk. Af- ter the task assignment is done by the coordinating processor the tasks are scheduled

For the implementation of our algorithm we chose the hypercube architecture because of its ver- satility and commercial availability. This architec- ture in particular is suitable because o1" its isotrop- icity, scalability, symmetry and recursive nature [8].

We developed a hyl)ercube simulator in the C language tinder UNIX environment. UNIX system calls and the Inter Processor Communication(IPC) primitives were used to simulate the processors and the communication channels.

The paper is organized a.s follows: The triangu- lar decomlmsition is introduced in section 2. The

© 1992 ACM 0-89791-506-2/9210003/00300 $1.50

Cholesky decomposit ion and the Sparse Cholesky decompositions are dealt with in sections 3 and 4 respectively. The Parallel Cholesky decomposition is described in section 5. The hypereube sinmlator is described in section 6. Conclusions and further extensions are given in sections7 and 8 respectively.

2 Tr iangu lar M a t r i x D e c o m p o s i t i o n

A system of linear equations in terms of the ma- trix notation can be expressed as

A X = B ( l )

where A is an invert ible mat r ix . We need to (let,er- mine a solution for the vector X that satisfies the equation (1). An impor t an t technique for finding so- lutions to system of linear equat ions is the method of t r iangular decomposi t ions. A mat r ix A of dimen- sion n can be decomposed into a product of a lower t r iangular ma t r ix L and an upper t r iangular mat r ix U, both of dimension n as

A = L U .

Equation ( l) can now be solved by using the fol- lowing relations:

Y = U X (2)

L Y = B (3)

The above equations can be efficiently solved for Y and X using the forward (backward) subst i tut ion technique.

T h e o r e l n 2.1 [6] An LU decomposition of a ma- trix is not unique, rn

D e f l l 2.1 A decomposition of the form A = I,l)li where L and U are unit lriangular matrices is called the LDU decomposition.

T h e o r e m 2.2 [6] l f A is invertible with an LU de- composition A = LU, then A has a unique LDU decomposition. []

Based on how the diagonal mat r ix D is viewed in the LDU decomposi t ion, three interest ing variants of the LDU decomposi t ions arise [3, 6].

If A = (LD)U, and U is chosen to be unit upper t r iangular then it is termed as Crout decomposition. l fA = L(DU), and L is chosen to be unit lower tri- angular then it is termed as Doliltlc dccompo.sdzon.

If A is symmetr ic and also a posit ive definite ma- trix (see section 3 for definition), then we can have a decomposi t ion of the form

(LDU) T = A T = A = LDU

Or ~"r DT I T . . . . L D I

which is the well-known Cholesky decompo.sdton.

3 C h o l e s k y d e c o m p o s i t i o n

D e f l l 3.1 .,ln n x n rnatriz A, i.s .said to be posit're &finite, if for every vector X , w( have

X :/: 0 ~ X / A X > 0

D e f n 3.2 .,In n x n malrt.r A, i.', .sam 1o be po.sd~ce semi-definite, if for ~.~'~ r.q c(clor .\', we hac~

X ¢ - 0 ~ X I A X > 0

Positive (lefinite matr ices form an impor tant class of matr ices as they occur in various appli- cations involving VI,SI circuil s imulat ions, aoro- dynamic appl icat ions and s t ruclura l analysis. For such matr ices the technique of ( 'holesky decon,po- sition can be efficiently employed. The number of mult ipl icat ions required in this decomposit ion. given by (~ / ) , are half of that r, 'quired either for the C, aussian e l imina t io , or the l)olitt]< deCOml osi- tion. Also this fa('torizal ion techlli(i t . ' can t., shown to be less subjected to rounding otf errors[9]. Also this method has been proved to be tim best method for t r iangular decomposi t ions [,9].

t"or the Cholesky (h'coml)osition we nt-~d to de- termine the (lecoml)osil ion

A = I.'D(I.') ~ = ( I , ' I ) ~ ) ( l ) } l . ' ) r

= (,,.)(I.) "r

WhOl'O

L = (L')(D+) and

i.e, D½ denotes tile diagonal mat r ix obtained bv taking the square roots of the elements of tile ma- trix I). To solve for l.h,' lower t r iangular matr ix L. known as the ( 'hohsky factor t h - diagonal elements must be positive. It. has been prc,w,d that lhis prop- erty is true [7] if the matr ix A is positiv,, definite. The ( 'holesky decomposi t ion can also b,. c(mstruc- t i ve l y &: l ined as fo l lows [3].

ACM 30th Annual Southeast Conference 301

f o r . . . .

for .... f o r ....

a i j : = rl,j -- (rl,I. * at.j )/ttt. l. (~lld

0.11( |

I!ll(l

"['his algorith,n performs in-place operations s tar t ing with A and at the termination of the al- gor i thm we obtain the Chnle.~ky factor. The loop indices of the algorithm can be chosen ill any of the following order [5]:

i jk , ikj, j ik , jk i , kij, kji .

Depending upon the order in which the indices are chosen three main types of doCOmlmsitious arise.

In the R o w C h o l e s k y lhe rows of the (:holesky factor L are computed one by one using the previ- ously computed rows of L. The basic operation is that of solving Triangular systems (also known as the horde ring method).

In the C o l m n n C h o l e s k y the colmnns of the Cholesky factor L are compuled one by ore" using the I)rcviously compul.ed colunms of L. T h e ba- sic operation is matrix-vector multiplication (also known as tile inner product form~daliol~ of symmet- ric decompositio,1).

In the S u b m a t r i x C h o l e s k y the submatr ix modifications from columns of L are applied one by one to the rem~fining subma|r ix Io be factored. The basic operation is symmetric rank-I updat ing (also known as the outer product formldatiml of synunel.- rie decomposition).

W o r k L o a d p a t t e r n s

Let Taskl.Vork(t) dcnol.o Ihe amount of work re- quired to complete a task t. The work prolile is then the graph TasklY'ork(l) Idol.ted againsl lask t. In Cholesky factorizalion wo assume lhat lhe anmunt of work for a task is proporlional 1.o the numher of multiplication operations involved for the modilica- lion. This can be supporled I)y the fact that almost all of tile operations are mt, ltiplicatious, and the number of multiplications indicate the other types of operations also (rofer 1o algorithm in subsection 4 . 1 ) . It can be easily' verified that [31

i * ( i + 1 ) TrowWork ( i ) -

2 Tco lWork( j ) = ( j * ( n - j + 1))

(,, - t- + I ) , ( , , - t. + 2 ) TsubWork(k ) =

"2

T

w 300 " " L ' ~ o \ r

k I00

Row-Cholesky

T

a(~0

w3. f~) o r k i00

Column-Cholesky

T

~vy00

r k I (~

Submatrix-Cholesky

,,1111 [, Task i

,,llllllll,.. Task j

Task k

modified ~ used for modification

Figure 1: Work Load Pat terns for various Cholesky Decompositions

where 7"rowWork(i) denotes tim amount of work required to complete the modification of the i th r o w

for the Row Cholesky Method. Tco lWork( j ) and Tsubl'Vork(k) can be correspondingly deiined for the C.olumu aud Subma.trix Cholcsky decomposi- I.ions. \Vo obse.rw~ that the areas under the three graphs in (l"ig. 1) are the same but the differ- ,'nces in their shapes lead to different processor uti- lization characteristics. With Row Cholesky (Fig. 1), lhe relatively small tasks at the bcgin. ing eu- able all processors to be fully utilized in the initial stages of the execution of the algorithm, llowever, this leads to saving of the larger tasks to the end which is likely to cause a significant number of pro-

Table l: |.ions

H.ow (_:holesky 88.7% Column Cholesky 95% Submat, rix Cholesky 88.7%

Processor utilization for tile decomposi-

302 ACM 30th Annual Southeast Conference

cessors to become idle while other processors corn- plete tasks involving the last few rows. Since these tasks require more time (proportional to ~-,~), the degradation in the overall performance is nontriv- ial. With Submatrix Cholesky (Fig. 1) the pattern is rather opposite. Column Cholesky (Fig. 1) has the best properties of both. Here tasks increase and then decrease in a smooth manner leading to a good processor utilization. The Colunm Cholesky imple- mentation is therefore best-suited for medium grain parallel implementation. The processor utilization is summarized in Table 1.

4 T h e S p a r s e C h o l e s k y D e c o m p o s i -

t i o n

The process of solving large sparse positive deft- uite systems typically involves the following steps

1. Orde r ing : Find a good ordering P for A, i.e., a permutation matrix P ( p T p = I) so that P A P T has a sparse Cholesky factor L.

2. Symbo l i c F a c t o r i z a t i o n : Determine the structure of the Cholesky factor L of P A P T and setup a data structure for the factor [1, 4].

3. N u m e r i c a l F a c t o r i z a t i o n : Place the ele- ments of A into the data structure, as deter- mined by the symbolic factorization and then compute L.

4. T r i a n g u l a r Solu t ion : Using the computed L, solve the triangular systems of equations L Y = P B , L T z ~- Y and then set X = P T z .

Of all the above stages the most computationally intensive step is that of the Numerical factorization. In this work we concentrate on this stage only.

4 .1 T h e b a s i c A l g o r i t h m s

The Column Cholesky algorithm in the dense case is given by Algorithm 4.1 [1].

A l g o r i t h m 4.1

for j : = 1 to n d o beg in

f o r k : = 1 t o ( j - 1 ) d o f o r i : = j t o n do

aij : = aij -- (aik * akj) aj3 :~ V f ~

for k : = ( j + 1) t o n do (lkj :~- a k j / a j j

end

cmod O+ I, j) cmod 0+2, j) cmod 0+3, j ) . . . cmod (n. j)

cdiv ( j )

cmod O, I) cmod(j , 2) cmod(j , 3) . . . cmod( j , j - I )

Figure 2: Subtask precedence graph for Column Cholesky

Let T c o l ( j ) denote the task that computes the j th column of the Cholesky factor. Each such task com- prises of the following subtasks.

1. c m o d ( j , k ) : Modification of column j by ~he column k (k < j).

2. cd iv( j ) : Division of column j by a scalar (Al- gorithm 4.1) This is the step involving the di- vision of the column ele,nents by the term a:j ).

The basic algorithm in terms of these subtasks is shown iu Algorithnl 4.2.

A l g o r i t h m 4.2

for j := 1 t o n do beg in

f o r k := 1 t o ( j - 1) do cmod(j ,k):

cdiv(j); end

The precedence graph for the subtasks is given in (Fig. 2) Tile main difference between the sparse and dense versions of tile algorithm stems from tile fact that for a sparse matrix A. cotunm j may no longer be modified byal lco lu ,nn>k.k < j. Specif- ically a ,olumn j is moditied only by columns k for which aj,. ¢ 0 and after cd i c l j ) i.: ,.xecul,.,l column j is made available on13 to lask., ! . o h r) for which l~j # O. This can by easily d,:r,v,:,l from the algo- rithm since in such cases tim modification step for the inner f o r loop is no longer ,.ssential. We ex- ploit, this feature for llw .,.parse ll|alrlcc.> to obtain an efficient parallel decompositirm

ACM 30th Annual Southeast Conference 303

5 P a r a l l e l C h o l e s k y D e c o m p o s i t i o n

The column dependencies for the mat r ix can be represented in the form of an e l iminat ion tree [8]. l tere each node is identified by a number j which denotes the task of tile jth cOlUlnll modification. The el iminat ion tree for a mat r ix , as i l lustrated in Fig. 3 gives us precise information about the column dependencies. Specifically the cohmm modificat ion for a node j in the e l iminat ion tree cannot be ini- t ia ted unti l all its dependent nodes have completed their modificat ions. For nodes at the same level or nodes tha t have no ancestral relation amongst them the tasks associated with the node can be executed in parallel . For tree s t ructures that arc short and wide, the associated orderings lend themselves to a good parallel implementa t ion . From the el imina- tion tree tha t gives the task precedence relation we can use op t imal scheduling strategies like the HLF policy to obta in an efficient Cholesky decomposi- tion.

The basic a lgor i thm for the parallel sparse Cholesky factorizat ion can be described in Algo- r i thm 5.1.

A l g o r i t h m 5.1

p a r b e g i n determine the parents of each colmnn determine the approx imate t ime necessary for

comput ing each of the column p a r e n d construct the el iminat ion tree determine the scheduling of various tasks p a r b e g i n

w h i l e some more tasks in the queue d o

determine the next task in the queue for tha t processor

await until tha t column and its COmlmted children are available at the processor

execute the task of column modificat ion send the computed column to those processors

onto which its parents and ancestors are to be scheduled

send this column to the host p a r e n d

In our implementa t ion the hosl. processor is re- spousil)le for the ini t ia t ion of tasks, l)erforming a determinis t ic scheduling of tasks to the node pro- cessors and to receive columns from the node pro- cessors after their modif icat ion. A task refers to a column modif icat ion. The sequence of tasks to be executed at a par t icu lar node processor is deter- mined by the scheduling done by the host processor.

Upon complet ion of a task the node processor com- municates the results of the computa t ion to only those processors onto which its parents and ances- tors will be scheduled in the future. It also sends the computed results to the host.

In the following subsections we highlight the fea- tures of our a lgori thm.

5 .1 P a r e n t o f a c o l u m n

Parent of a column j denotes the first column i(i > j) which uses the column j for its modifica- tion. It can be defined by

parent[j] = ,ninlilaij # O, i > j}

i.e., parent[j] denotes the row subscr ipt of the first off-diagonal nonzero in column j of the ma t r i x A. It implies that, creed(parent[j], j) is a nontrivia.l event.

5 .2 T i m e o f a C o l u m n

The t ime of a column denotes the app rox ima te number of t ime units essential for the complet ion of the task. It is approx ima te in that. the t ime denotes only the computa t ion t ime involved in modifying a column but not the communica t ion t ime necessary to pass on the results.

The communica t ion t ime can be computed only at runt ime because each node besides receiving and sending columns necessary for processing, must also assist ill communica t ing the lnessages tha t will be routed through tha t node.

The t ime of a column is necessary to obta in the determinis t ic scheduling. The t ime units assigned are appropr ia te and depend upon the ac tual ma- chine. (The constants defined for our implernenta- tion are propor t ional to the values of intel 's 80286 and 80287 processors).

5.3 E l i m i n a t i o n t r e e

The e l iminat ion tree for a ma t r ix (Fig. 3) deter- mines the precedence relat ion among the set of tasks to be executed [1]. If two nodes lie on the same level in the e l iminat ion tree then the corresponding tasks can be executed in paral lel (note tha t two events can be executed in parallel if one is not the ancestor of the other). For a t~.sk to be initia.ted all its depen- dent tasks must be completed. The d a t a s t ructure for each node includes pointers to its parent , first child, r ight and left, siblings. Since the tree in gen- eral is not a binary tree, we store one child of every node as its first child and all the other children are stored ill a l inked list. The nodes are leveled ill a

304 ACM 30th Annual Southeast Conference

.T

X

3: ,7,

.T .2."

.T X ,7, X X

27 X X :/1

• ~ zero c n t l ' i e s

ic ~ n o n z e r o e l l t r i e 8

(,)

(3 Figure 3: El iminat ion tree associated w i t h a

Cholesky factor

bo t tom up manner . The nodes also have reforma- tion about their processor assignment .

5 . 4 T a s k s c h e d u l i n g

The scheduling of tasks to the processors is done using the HLF policy. This policy has been proved to be an op t imal one in case of s ta t ic and determin- istic scheduling. In this policy the tasks are assigned in a bo t tom up manner . At each level the proces- sor with the least sum of execution of the assigned tasks so far is chosen for the ass ignment of the next task.

5.5 C o l u m n modif icat ion

Each processor has a sequence of tasks to be exe- cuted. If the list is not empty then it picks tile next task and executes it. If the list is empty t, he,1 the processor prints out tile t ime s ta t is t ics and exits. If the j th column modification needs the i °' cohmm (i < j) and if the i th column elements are not avail- able with that processor, then tile processor waits until the required column is available. After the col- umn modification, it sends the results to the host and all tire processors onto which its ancestors will be scheduled in future. This is more efficient instead of broadcast ing tile cohmm to all the processors be- cause some processors may never use l.his partic~,lar column.

5.6 Host process

The lmst processor sets up the el iminat ion tree and determines the task scheduling. The host pro- cessor sends the columns to individual processors for modificat ions. It is also responsible for storing the final results ob ta ined from the node processes. Thus the host is the kernel for synchronizat ion of the activit ies.

6 T h e H y p e r c u b e S i m u l a t o r

Many i m p o r t a n t proper t ies like isotropicity, sym- metry, flexibility, ease of mapp ing of other architec- tures, etc., make the hypercube one of tile most versati le paral lel machines. A hypercube also pro- vides several independent paths for communica- tion between any pair of nodes. Ill principle, all n-dimensional hypercube provides n- independent paths between any two node processors. Tile hy- percube s imula tor has been developed in the C lan- guage under UNIX environment . Tile s imulator de- veloped is a general -purpose one capable of simulat- ing a hypercube of any dimension, Several features of the s imula tor are described below. S i m u l a t i o n o f a n o d e p r o c e s s o r : The process creation facili ty of UNIX ( f o r k ( ) system call) is used to s imulate the node processors. A host pro- cessor i terat ively calls the system call to generate the 2 ~ nodes of the n-dimensional hypercube to be s imulated. Each node thus created ha-, a unique identifier tha t is used to identify the ,lode in the hypercube. C o m m u n i c a t i o n C h a n n e l s : [hese have been established using the [PC facilities in UNIX (msgge l ( ) system call). Each processor of tile , - cube is connected t o n o t h e r processors. Each of the communicat ion link establ ished between two node processors is identified by a uniquo key depen,ling on the id's assigned to the two processors. In gen- eral for two processors i and j the k,:y was deter- mined by the formula,

kC!l_id = m a x { i . j } * lOOI} - , , d , { , . j }

M e s s a g e P a s s i n g : Th,> routin~ ,A' m(.'~sagt.s i.-. ha . - died by incorporat ing tile dest inat ion mrJ ~h,: col- umn identif ication to the message packet and plac- ing then] on the communica t ion , 'hamlel II~,illg the ms.qsnd() system call. Each nod,' I,r,)o~s.,or ctmcks for the messages on all the comnmnicaticm links adjacent to that node procossor .\ reo'ix, 'r UpOll the receipt of a message packet (msgrcc l~ . ,yst , ln call) takes an appropr ia t e action d~:pending upon

ACM 30th Annual Southeast Conference 305

the packet dest inat ion. If the receiver is not the des- t inat ion it calls a nexthop() algor i thm to determine the adjacent communica t ion channel onto which the message has to be placed.

In The message p a c k e t s t r u c t u r e in the column Cholesky technique comprises of columns of the ma- trix and some identif icat ions for the cohmm and the des t ina t ion processor, and are communicated. The elements are first converted to appropr ia te charac- ter representat ion and sent as message packets.

The n e x t h o p a lgor i thm compares the destina- t ion and source node processor id 's and checks for the least significant bit in which they differ. It de- termines the immedia te des t inat ion by toggling tha t par t icu la r bit in the source node id and returns it. Process S y n c h r o n i z a t i o n : Each process per- forms the scheduled sequence of tasks determined by the s ta t ic scheduling a lgor i thm. When a pro- cess needs to wait because of the nonavai labi l i ty of a par t icu lar column, it repeatedly calls a WAIT [8] routine until the required colunm is available. If the node processor completes all the assigned tasks then it exits after not ifying the dura t ion for which the processor was active since its creation. Host S y n c h r o n i z a t i o n : The host processor forms the kernel of the process synchronizat ion. It. creates the node processors and communicates to them the relevant task list. During the program execution the host processor is also responsible for receiving the modified columns of the mat r ix from the processors and s tor ing them. After all the columns are received it clears tile message queues, pr ints the s ta t is t ics and exits.

7 Conclus ions

In this paper the design of a paral lel Cholesky decomposi t ion a lgor i thm for sparse symmetr ic pos- itive definite matr ices and its imple ,nentat ion o n

a s imulated hypercube are discussed. The conclu- sions obta ined from the s imulat ion studies made are summarized below:

• Parallel execution t ime in the degenerate case is tile t ime on a single node machine. Due to the setup t ime overhead for the parallel algo- r i thm, the total t ime to execute this a lgor i thm on a single node machine is greater than the serial execution t ime and the difference is de- te rmined hy the setup time.

• For the same ma t r ix size, the speedup increases with the number of processors. However the

Number of Processors Speedup

1 0.7647 2 1.2466 4 1.8239 8 2.1913

Table 2: Rela t ionship between Speedup and Num- ber of processors

Matr ix size Speedup

4 0.357 8 1.295 12 1.406 16 2.1913

Table 3: Relat ionship between Speedup and Matr ix size for a 8 node hypereube

speedup is not linear because of tile interde- pendencies among various tasks(Table 2).

• For the same number of processors the speedup increases with the ma t r ix size. This is because to the fact tha t with an increase in the ma- tr ix size the number of tasks increase, and in general the number of pa ths in the e l iminat ion tree increase. C.onsequently the scope for par- allelism and hence the speedup increase(Table 3).

• Speedup depends upon the ordering of tile ma- trix. The speedup is best in the case of elim- inat ion trees tha t are short and wide. In the case of worst ordering, the paral lel a lgor i thm is no be t te r than its serial version.

• Setup t ime for the paral le l a lgor i thm is inde- pendent of the ordering of the ma t r ix and de- pends only on the mat r ix size. The overhead due to setup t ime for a typical machine is tab- ulated ( 'Fable 4). The tame indicates tha t this

Matr ix size Setup t ime (in micro seconds)

8 160 16 384 32 1024 64 3072 128 10240

Table 4: Relat ionship between Setup t ime and ma- tr ix size

306 ACM 30th Annual Southeast Conference

overhead decreases with increase in the matrix dimension.

• Speedup increases with the total execution time as an increase in execution time makes the communication overhead less significant.

The simulation studies reflect a very good speedup compared to the existing algorithms based on the elimination tree. One of the main reasons for this better speedup is due to the reduction in the communication costs because of the deterministic static scheduling.

8 F u r t h e r e x t e n s i o n s

• The algorithm can be made more efficient by mapping subtrees of the elimination tree to a processor instead of mapping individual tasks to processors thereby reducing the communica- tion overhead.

• A more realistic picture can be obtaine.d by running the algorithm on a real machine.

• Since the ordering of matrices also plays an im- portant role, it would be interesting to find bet- ter orderings for the matrix.

• The individual steps of the column modifica- tion could be parallelized.

• The triangular solution can be obtained using the computed Cholesky factor.

[5] Joseph W.lt.Liu, Computational models and task scheduling for parallel sparse Cholesky fac- torization, Parallel Computing, 3, 1986. pp 327- 342.

[6] Stephan 1I F'riedberg and Arnold J lnsel, Intro- duction to Linear Algebra and its applications. Prentice tlall, Engelwood Cliffs N.J. 1986.

[7] G.W.Stewart, Introduction to Matrix Compu- tations, Academic Press, 1973.

Is] V.M.Padakanti, Sparse Cbolesky Factorization on a Simulated Hypercube, Project Report, Dept of Computer Science and Engineering. In- dian Institute of Science, Bangalore, lqdia. April 1990.

[9] J.ll.Wilkinson, The Algebraic Eigen Value Problem, Oxford Univ Press. SY. 1965.

[10] R.P.Tewarson, Sparse matrices, Academic press, NY, 1973.

R e f e r e n c e s

[1] A.George and Joseph W.lt.Liu, Computer Solu- tion of Large Sparse Positive Definite Systems, Prentice Hall,Englewood cliffs,N.J, 1981.

[2] A.George, M.T. Heath, N.J. Esmond and Joseph W.H. Liu Sparse Cholesky factorization on a local memory multiprocessor, Siam J Sci star Comput, 9, No. 2, 1988, pp 327-340.

[3] A. George, M.T. Heath and Joseph Liu, Paral- lel Cholesky Factorization on a Shared Memory Multiprocessor, Linear Algebra and its applica- tions, 77, 1986, pp 165-187.

[4] A. George, M.T. Heath and Joseph Liu, Sym- bolic Cholcsky factorization on a Local Memory M ultiprocessor, Parallel Computing, 5, 1987, pp 85-95.

ACM 30th Annual Southeast Conference 307