[Lecture Notes in Computer Science] Advanced Parallel Processing Technologies Volume 3756 ||...

10
J. Cao, W. Nejdl, and M. Xu (Eds.): APPT 2005, LNCS 3756, pp. 111 – 120, 2005. © Springer-Verlag Berlin Heidelberg 2005 Experiments on Asynchronous Partial Gauss-Seidel Method Hiroshi Nishida and Hairong Kuang Computer Science Department, California State Polytechnic University, Pomona, 3801 West Temple Avenue, CA 91768, USA {hnishida,hkuang}@csupomona.edu Abstract. This paper presents design and experimental results of a parallel linear equation solver by asynchronous partial Gauss-Seidel method. The basic idea of this method is derived from the asynchronous iterative method; newly computed values of unknowns are broadcast to all other processors and are incorporated into computing the next value immediately after they are received. However, since the asynchronous iterative method requires frequent data passing, it is difficult to achieve high performance on practical cluster computing systems due to its enormous communication overhead. To avoid it, the asynchronous partial Gauss- Seidel method reduces frequency of broadcasting new values of unknowns by passing multiple values in a chunk. The experimental results show the advantage of the asynchronous partial Gauss-Seidel method. 1 Introduction The most representative sequential algorithms for solving systems of linear equations are the Jacobi method and the Gauss-Seidel method, while the parallel Jacobi method and the asynchronous iterative method are the parallel algorithms cited most frequently [1, 2, 3]. The sequential Gauss-Seidel method generally converges in less number of iterations than the sequential Jacobi method by incorporating newly computed values of unknowns into the computation of the next value of the unknown. However, the Gauss-Seidel method cannot be parallelized because of its nature of dependency. On the other hand, the Jacobi method is easily parallelizable; partitioning the input matrix into blocks so that one processor is responsible for computing one of the blocks, and exchanging the values of unknowns at the end of each iteration. Although the design of the parallel Jacobi method is simple, it requires barrier synchronization at the end of each iteration, which causes a significant degradation of performance. The asynchronous iterative method, which is based on chaotic relaxation introduced by Chazan and Miranker in 1969, was proposed by Baudet in 1978 [1, 2]. It performs fast parallel computation by using older data received earlier in time and by removing the barrier of synchronization inherent in the parallel Jacobi method. One of sub- methods of the asynchronous iterative method which passes newly computed values of unknowns one by one is called the purely asynchronous method [2]. Baudet's experimental results show that the purely asynchronous iterative method converges in fewer iterations than the parallel Jacobi method [2]. However, as far as the elapsed

Transcript of [Lecture Notes in Computer Science] Advanced Parallel Processing Technologies Volume 3756 ||...

J. Cao, W. Nejdl, and M. Xu (Eds.): APPT 2005, LNCS 3756, pp. 111 – 120, 2005. © Springer-Verlag Berlin Heidelberg 2005

Experiments on Asynchronous Partial Gauss-Seidel Method

Hiroshi Nishida and Hairong Kuang

Computer Science Department, California State Polytechnic University, Pomona, 3801 West Temple Avenue, CA 91768, USA

{hnishida,hkuang}@csupomona.edu

Abstract. This paper presents design and experimental results of a parallel linear equation solver by asynchronous partial Gauss-Seidel method. The basic idea of this method is derived from the asynchronous iterative method; newly computed values of unknowns are broadcast to all other processors and are incorporated into computing the next value immediately after they are received. However, since the asynchronous iterative method requires frequent data passing, it is difficult to achieve high performance on practical cluster computing systems due to its enormous communication overhead. To avoid it, the asynchronous partial Gauss-Seidel method reduces frequency of broadcasting new values of unknowns by passing multiple values in a chunk. The experimental results show the advantage of the asynchronous partial Gauss-Seidel method.

1 Introduction

The most representative sequential algorithms for solving systems of linear equations are the Jacobi method and the Gauss-Seidel method, while the parallel Jacobi method and the asynchronous iterative method are the parallel algorithms cited most frequently [1, 2, 3].

The sequential Gauss-Seidel method generally converges in less number of iterations than the sequential Jacobi method by incorporating newly computed values of unknowns into the computation of the next value of the unknown. However, the Gauss-Seidel method cannot be parallelized because of its nature of dependency. On the other hand, the Jacobi method is easily parallelizable; partitioning the input matrix into blocks so that one processor is responsible for computing one of the blocks, and exchanging the values of unknowns at the end of each iteration. Although the design of the parallel Jacobi method is simple, it requires barrier synchronization at the end of each iteration, which causes a significant degradation of performance. The asynchronous iterative method, which is based on chaotic relaxation introduced by Chazan and Miranker in 1969, was proposed by Baudet in 1978 [1, 2]. It performs fast parallel computation by using older data received earlier in time and by removing the barrier of synchronization inherent in the parallel Jacobi method. One of sub-methods of the asynchronous iterative method which passes newly computed values of unknowns one by one is called the purely asynchronous method [2]. Baudet's experimental results show that the purely asynchronous iterative method converges in fewer iterations than the parallel Jacobi method [2]. However, as far as the elapsed

112 H. Nishida and H. Kuang

time is concerned, it could be disadvantageous on practical cluster computing systems due to its huge communication overhead.

In this paper, we introduce the asynchronous partial Gauss-Seidel method which passes multiple values of unknowns in a chunk and reduces communication overhead. The most important parameters which decide its performance are frequency of data sending and frequency of data receiving. Reduction of the frequency of data sending and the frequency of data receiving decreases communication overhead. However it may increase the number of iterations to converge solving systems of linear equations, since the algorithm becomes closer to that of asynchronous Gauss-Seidel's method [2].

In section 2, we explain the detail of the asynchronous partial Gauss-Seidel method. Section 3 presents and analyzes experimental results. A summary and a discussion of future work are described in section 4.

2 Asynchronous Partial Gauss-Seidel Method

2.1 Basic Concept

A system of linear equations with vector of unknown x whose size is n can be represented in a matrix form as follows:

Ax = bwhere A is an n by n matrix, and x and b are vectors with n elements.

The asynchronous iterative method is a parallel method for solving sparse systems of linear equations. The simplest way of allocating tasks to processors is partitioning A and b equally by rows, as well as the parallel Jacobi method. Each processor is responsible for solving a portion of unknown x. When p processors exist, the matrix Aand the vector b are divided into p tasks, each of which consists of n/p rows of A and b. Each processor is allocated one of the tasks and is in charge of computing x within the range of the given task. For example, processor k computes xnk/p, …, xn(k+1)/p-1, by using the partition k of matrix A consisting of rows nk/p, …, n(k+1)/p and the partition k of b consisting of elements bnk/p, …, bn(k+1)/p-1.

Baudet classifies the asynchronous iterative method into three different sub-methods - asynchronous Jacobi's method, asynchronous Gauss-Seidel's method and purely asynchronous method - according to timing of exchanging new values of unknowns, or choice of the values [2]. The purely asynchronous method releases each new value immediately after its computation, while the asynchronous Jacobi's method and the asynchronous Gauss-Seidel's method exchange new values only at the end of each iteration. The only difference between the asynchronous Jacobi's method and the asynchronous Gauss-Seidel's method is the choice of the values of unknowns within each iteration. The asynchronous Gauss-Seidel's method uses new values of unknowns in its subset as soon as they are computed for further computation in the same iteration, while the asynchronous Jacobi's method uses only values of unknowns known at the beginning of an iteration.

Baudet's experimental results show that the purely asynchronous method converges in less iterations than the asynchronous Jacobi's method and the asynchronous Gauss-Seidel's method [2]. The results also show that the asynchronous Gauss-Seidel's method increases the number of iterations with the increase of processors.

Experiments on Asynchronous Partial Gauss-Seidel Method 113

A drawback of the purely asynchronous method is that the communication overhead by exchanging new values one by one is huge on practical cluster computing systems. As shown by experimental results in section 3, it is obviously difficult to achieve desirable performance on modern cluster computing systems.

The asynchronous partial Gauss-Seidel method, introduced in this paper, lessens the frequency of data passing and improves the drawback of the purely asynchronous method. It sends multiple new values of unknowns in a chunk and reduces the communication overhead. The choice of the values of unknowns used in the computation is the same as that of the purely asynchronous method; the most recent and available values are used. However, the asynchronous partial Gauss-Seidel method differs in the timing of releasing new values. It releases the new values right after the number of unsent values reaches a certain fixed number. For instance, suppose we define the number of values of unknowns passed in a chunk as 50. Each processor computes 50 new values of unknowns using available values including the most recent values computed on the processor. After the computation of the 50 new values, the processor broadcasts them simultaneously to all other processors. Chunks of new values from other processors are received asynchronously. As soon as the values are received, each processor incorporates them into its buffered x and makes them available to the next computation.

The most important parameter in the asynchronous partial Gauss-Seidel method is the frequency of sending new values of unknowns. A decrease in the frequency of data sending reduces communication overhead. However, at the same time, it may cause an increase in the number of iterations to converge. In Baudet's experiments, the asynchronous Gauss-Seidel's method increases the number of iterations to converge with the increase of processors [2]. The asynchronous partial Gauss-Seidel method becomes closer to the asynchronous Gauss-Seidel's method with decrease in frequency of data passing; the same phenomenon may occur in the case of asynchronous partial Gauss-Seidel method. Hence the tradeoff between the reduction of communication and the increase of iterations becomes a significant issue of this method. In section 3, we discuss it with the practical experimental results.

Another important parameter is the frequency of receiving new values from other processors. In order to avoid blocking at receiving new values, our programs periodically check whether new packets from other processors arrive and are stored in the operating system's buffer. A processor calls select() system call on UNIX or equivalent system calls on other operating systems at each time it checks network data buffered in the operating system. Calling a system call and waiting for its return requires a certain period of time. Therefore, frequent receiving, or checking new values from other processors causes the increase of runtime overhead. However, by the immediate incorporation of new values into the processors' buffered x, the asynchronous partial Gauss-Seidel method may finish its computation faster because the new values can be used to evaluate the next value in earlier time. It is not easy to guess the relationship between the frequency of data receiving and the practical speedup. In section 3, we show the experimental results with different frequencies of data receiving.

2.2 Design and Implementation

The basic algorithm to compute new values of unknowns is the same as those of the other asynchronous sub-methods and is expressed as follows:

114 H. Nishida and H. Kuang

The three asynchronous sub-methods - the asynchronous Jacobi's method, the asynchronous Gauss-Seidel's method and the purely asynchronous method - only differ by the choices of the values used in computation. The asynchronous partial Gauss-Seidel method always uses available x to compute a new value, as well as the purely asynchronous method. The difference between the purely asynchronous method and the asynchronous Gauss-Seidel method is the frequency of exchanging new values of unknowns.

Suppose we have 24 unknowns: x0, x1, x2, ..., x23. And suppose 2 processors P0, P1 are used for solving the system of linear equations, each of which is in charge of computing 12 unknowns; P0 computes {x0, ..., x11}, P1 computes {x12, ..., x23} respectively. In the purely asynchronous method, each new value of x is broadcast immediately after its computation. In the asynchronous practical Gauss-Seidel method, multiple values of x are broadcast in a chunk. For example, suppose 4 values of unknowns xk, xk+1, xk+2, xk+3 are broadcast together, they are bundled into a chunk and are broadcast after the computation of these 4 values. Another parameter we must define is the frequency of data receiving. Here we assume that the new values are received after every computation of 4 values. The execution and data exchange of this model is expressed in Figure 1 and 2.

Figure 1 shows a sequence of computation and data exchanges. On processor 0, after computing x0 through x3, the new values are broadcast to other processors – in this case they are sent only to processor 1. Afterwards processor 0 checks the values of unknowns

Fig. 1. Execution of the asynchronous partial Gauss-Seidel method 1

Fig. 2. Execution of the asynchronous partial Gauss-Seidel method 2. Shaded areas represent time spent for communication consisting of broadcasting and receiving x.

Processor 0Computing

x0

Computingx1

Computingx2

Computingx3

Broadcast-ing

x0 - x3

Receiv-ingx

Computing x4

Computingx5

Processor 1Computing

x12

Computingx13

Computingx14

Computingx15

Broadcast-ing

x12 - x15

Receiv-ingx

Computing x16

Computingx17

1st iteration 2nd iteration

Processor 0 Computingx0 - x3

Computing x4 - x7

Computing x8 - x11

Computingx0 - x3

Computingx4 - x7

Computing x8 - x11

Processor 1 Computingx12 - x15

Computing x16 - x19

Computing x20 - x23

Computingx12 - x15

Computingx16 - x19

Computing x20 - x23

Experiments on Asynchronous Partial Gauss-Seidel Method 115

sent from other processors. If any values are stored in the operating system's buffer, processor 0 incorporates them into its x buffer. Figure 2 illustrates a phase of iterations.

As described in 2.1, there are two important parameters in the asynchronous partial Gauss-Seidel method: the frequency of data sending and the frequency of data receiving. In the example described above, we define that after every computation of 4 values, the new values are broadcast and, at the same time, values from other processors are checked. Figure 3 shows another model in which values from other processors are checked after every computation of 2 values.

Fig. 3. Execution of the asynchronous partial Gauss-Seidel method with a different data receiving frequency

If new values are broadcast and checked at the same time one by one, this performs the same algorithm as the purely asynchronous method. And if new values are broadcast at the end of each iteration and values from other processors are checked after every computation of each value, then it becomes the asynchronous Gauss-Seidel's method. If only one processor is used, the algorithms of the purely asynchronous method and the asynchronous partial Gauss-Seidel method equal that of the sequential Gauss-Seidel method.

2.3 Convergence Detection

One of the biggest issues in the asynchronous iterative method is a methodology of con-vergence detection. Chaotic relaxation [1] states the convergence conditions as follows;

there must be a fixed positive integer s such that, in carrying out the evaluation of the ith iterate, a process cannot make use of any value of the components of the jth iterate if j – s [2, 3].

Though we feel the necessity of further research on the convergence detection methodology, we use a simple detection technique in our experiments. First, while computing new values of unknowns, processors compute a difference between a new value and an old value of each unknown. If the difference is within a given error tolerance, the processors set their convergence flags true, otherwise they are set false. These flags are cleared at the beginning of each iteration. Processor 0 collects the values of these flags from all processors, and it terminates the computation if all flags are true. Theoretically, it is ideal to detect convergence using time stamps or periodical synchronization. However, we focus only on practical usage of the asynchronous iterative method, and we assume the conditions in which the asynchronous iterative method is used as follows;

Numbers of unknowns in systems of linear equations are large enough, and time spent for data transmission among processors is much shorter than time spent for an iteration of computation. In other words, no values older than the previous iteration are used for evaluation on any processor.

Processor 0Computing

x01

Computingx1

Receiv-ingx

Computingx2

Computingx3

Broadcast-ing

x0 - x3

Receiv-ingx

Computingx4

116 H. Nishida and H. Kuang

Our experiments show that in our system, delivering a new value of an unknown takes approximately the same time spent for computing 30 values of unknowns. This can be considered small enough compared to the time taken for an iteration of computation in big systems of linear equations.

3 Experimental Results

3.1 Experiments

The experiments have been carried out on 8 machines with 34 different systems of linear equations. Each measurement is repeated 10 times. The following 11 algorithms have been used in the experiments: the asynchronous partial Gauss-Seidel method with 9 different combinations of data sending-receiving frequencies, the purely asynchronous method and the parallel Jacobi method.

The specification of a machine is as follows:

Model: Sun Blade 2500, CPU: Ultra SPARC IIIi 1.6GHz, LAN: 100Mbps, Memory: 2GB, OS: Solaris 9

Input matrices A and vectors b are generated by a random generator with different random seeds [8]. The approximate density of a matrix A is 38%. The generated matrices are compressed into a zeros skipped format. The compression rate is approximately 40%. This compression technique helps reduce not only initial task assignment time but also computation time. The input matrices A and vectors b are equally partitioned and are statically assigned to all machines. The size of unknowns is 3360.

The 9 combinations of data sending-receiving frequencies in the asynchronous Gauss-Seidel method are as follows:

Table 1. Frequencies of data sending and receiving

Frequency of sending

Frequency of receiving

10 1

10 5

10 10

50 10

50 25

50 50

100 10

100 50

100 100

Experiments on Asynchronous Partial Gauss-Seidel Method 117

A combination of the frequency of sending '50' and the frequency of receiving '25' means that after every computation of 50 values, the new 50 values are broadcast and values from other machines are checked after every computation of 25 values. This is expressed as “APGS 50-25” in 3.2.

Table 2 (a). Elapsed time compared to the parallel Jacobi method (%)

# of proc.

PA APGS 10-1

APGS 10-5

APGS10-10

APGS50-10

APGS50-25

APGS50-50

APGS100-10

APGS 100-50

APGS 100-100

1 60.1 60.3 60.1 60.3 60.4 60.3 60.4 60.2 60.1 60.2

2 63.4 55.1 51.1 53.2 49.5 49.1 54.1 52.4 53.4 54.3

3 65.1 64.4 54.2 58.4 52.2 52.5 52.2 60.6 64.1 59.4

4 79.2 55.1 52.5 54.3 61.9 57.9 61.1 56.7 57.8 67.0

5 83.1 62.4 58.4 57.6 62.2 61.4 59.4 64.2 61.0 63.3

6 97.2 59.1 58.3 58.1 56.6 59.9 59.2 63.0 63.8 66.3

7 100.9 65.7 60.3 61.9 58.7 59.3 63.1 62.0 65.8 69.7

8 111.2 62.3 58.6 62.5 61.2 61.9 64.0 62.9 65.4 73.0

Table 2 (b). Number of iterations compared to the parallel Jacobi method (%)

# of proc.

PA APGS 10-1

APGS 10-5

APGS10-10

APGS50-10

APGS50-25

APGS50-50

APGS100-10

APGS 100-50

APGS 100-100

1 59.5 59.5 59.5 59.5 59.5 59.5 59.5 59.5 59.5 59.5

2 52.2 52.5 49.0 51.0 48.0 47.7 52.9 51.5 52.5 53.5

3 45.9 59.6 50.4 54.5 49.4 49.9 49.7 59.3 63.0 58.0

4 50.0 48.3 46.9 48.6 58.9 55.0 58.4 54.3 55.4 65.4

5 46.7 54.0 51.3 50.2 58.5 58.0 56.0 61.9 58.5 61.2

6 48.5 49.1 50.0 49.9 51.6 55.4 55.0 60.0 61.1 64.3

7 46.2 53.9 50.6 52.1 53.4 54.3 57.6 58.6 63.1 68.0

8 46.7 48.9 47.4 51.5 55.5 56.8 58.9 59.1 62.2 71.7

118 H. Nishida and H. Kuang

0

50000

100000

150000

200000

250000

300000

350000

1 2 3 4 5 6 7 8

Elap

sed

time

(ms)

# of processors

[Elapsed time] (N = 3360)

JacobiPA

APGS 10-1APGS 10-5

APGS 10-10APGS 50-10APGS 50-25APGS 50-50

APGS 100-10APGS 100-50

APGS 100-100

(a)

0

50

100

150

200

250

300

1 2 3 4 5 6 7 8

# of

iter

atio

ns# of processors

[# of iteraions] (N = 3360)

JacobiPA

APGS 10-1APGS 10-5

APGS 10-10APGS 50-10APGS 50-25APGS 50-50

APGS 100-10APGS 100-50

APGS 100-100

(b)

Fig. 4. A sample experimental result 1

(a) (b)

Fig. 5. A sample experimental result 2

3.2 Results

The parallel Jacobi method converges with 26 systems of linear equations out of 34 systems. On the other hand, the purely asynchronous method and all the asynchronous partial Gauss-Seidel methods converge with 32 systems. The sequential Gauss-Seidel method converges with all the systems.

The comparisons on the elapsed time and the number of iterations between the parallel Jacobi method and other methods are shown in Table 2. The results are calculated using 26 systems with which the parallel Jacobi method converges. PA stands for the purely asynchronous method. APGS represents the asynchronous partial

0

20000

40000

60000

80000

100000

120000

140000

1 2 3 4 5 6 7 8

Elap

sed

time

(ms)

# of processors

[Elapsed time] (N = 3360)

JacobiPA

APGS 10-1APGS 10-5

APGS 10-10APGS 50-10APGS 50-25APGS 50-50

APGS 100-10APGS 100-50

APGS 100-100

0

20

40

60

80

100

120

1 2 3 4 5 6 7 8

# of

iter

atio

ns

# of processors

[# of iteraions] (N = 3360)

JacobiPA

APGS 10-1APGS 10-5

APGS 10-10APGS 50-10APGS 50-25APGS 50-50

APGS 100-10APGS 100-50

APGS 100-100

Experiments on Asynchronous Partial Gauss-Seidel Method 119

Gauss-Seidel method and the following numbers mean the combinations of data sending-receiving frequencies (see 3.1).

Table 2 (a) shows the average ratios of the elapsed time taken by the PA method or the APGS methods to the elapsed time taken by the parallel Jacobi method with the same number of processors. The measured elapsed time includes the time spent for network communication: the initial task assignment and exchanges of values of unknowns. In all algorithms, the APGS 10-5 method constantly records short elapsed time. On the whole, the elapsed time taken by the APGS 10-5 method is 40-50% shorter than the elapsed time taken by the parallel Jacobi method. The other APGS methods also show fairly good results. The APGS 100-100 method, whose results are probably the worst in the APGS methods', still result in being 28-46% faster than the parallel Jacobi method. On the other hand, the purely asynchronous method performs slow computation in this experiment. It becomes less efficient with the increase of processors. With more than 6 processors, the purely asynchronous method becomes slower than the parallel Jacobi method. As a while, the APGS methods tend to become slower with the decrease of the data sending-receiving frequencies. Table 2 (b) shows the average ratios of the number of iterations. Obviously, both the decrease of the frequency of data sending and the decrease of the frequency of data receiving cause the increase of the number of iterations. The exception is the APGS 10-1 method. In most cases, the APGS 10-1 method takes more iterations than the APGS 10-5 method. Further investigation on this phenomenon is needed.

Sample experimental results are shown in Figure 4 and Figure 5. The horizontal axes in the figures represent the number of processors (machines). The vertical axes in Figure 4 (a) and Figure 5 (a) represent the elapsed time. The vertical axes in Figure 4 (b) and Figure 5 (b) represent the number of iterations.

4 Conclusions and Future Work

In the practical usage of parallel iterative algorithms for solving systems of linear equations, the reduction of the communication overhead and the reduction of the number of iterations are the most important factors which decide the computation speed. In this paper, we describe the design and experimental results of the asynchronous partial Gauss-Seidel method, which requires less communication overhead than the purely asynchronous method and, at the same time, requires less iterations than the parallel Jacobi method. The experimental results show the advantage of the asynchronous partial Gauss-Seidel method. However, the asynchronous partial Gauss-Seidel method has a disadvantage that finding the best combination of the data sending-receiving frequencies is difficult. Further research is needed on this issue.

In our experiments, we use the broadcast to send values of unknowns. It will be interesting to try other message passing methods in order to reduce more communication overhead. Also, our experiments are limited to the computation on small cluster computing systems. Examination on bigger cluster computing systems is needed.

120 H. Nishida and H. Kuang

References

1. D. Chazan and W. Miranker, Chaotic Relaxation. Linear Algebra and its Applications, Vol 2, pp. 199-222, 1969.

2. G. M. Baudet, Asynchronous Iterative Methods for Multiprocessors, Journal of the Association for Computing Machinery, Vol 25, No 2, pp 226-244, 1978

3. B. Wilkinson and M. Allen, PARALLEL PROGRAMMING, Techniques and Applications Using Networked Workstations and Parallel Computers, Second Edition, Ch. 6 and Ch.11, 2004

4. K. Blathras, D. B. Szyld and Y. Shi, Timing Models and Local Stopping Criteria for Asynchronous Iterative Algorithms, Journal of Parallel and Distributed Computing, vol. 58, pages 446-465, 1999.

5. E. J. Lu, M. G. Hilgers and B. McMillin, Asynchronous Parallel Schemes: A Survey,Technical Report, Computer Science Department, University of Missouri - Rolla, 1993

6. J. C. Strikwerda, A Convergence Theorem for Chaotic Asynchronous Relaxation, Linear Algorithms and Applications, 253 (1997) pp.15-24.

7. P. Christen, A parallel iterative linear system solver with dynamic load balancing,Proceedings of the 12th international conference on Supercomputing, Melbourne, Australia, pages: 7 – 12, 1998

8. M. Matsumoto and T. Nishimura, Mersenne Twister: A 623-dimensionally equidistributed uniform pseudorandom number generator, in ACM Transactions on Modeling and Computer Simulation (TOMACS), Special issue on uniform random number generation, Volume 8 , Issue 1 pages 3-30, January 1998