High-Performance Heterogeneous Computing || Performance Analysis of Heterogeneous Algorithms

9
99 CHAPTER 6 Performance Analysis of Heterogeneous Algorithms 6.1 EFFICIENCY ANALYSIS OF HETEROGENEOUS ALGORITHMS The methods for performance analysis of homogeneous parallel algorithms are well studied. They are based on models of parallel computers that assume a parallel computer to be a homogeneous multiprocessor. The theoretical analysis of a homogeneous parallel algorithm is normally accompanied by a relatively small number of experiments on a homogeneous parallel computer system. The purpose of these experiments is to show that the analysis is correct and that the analyzed algorithm is really faster than its counterparts are. Performance analysis of heterogeneous parallel algorithms is a much more difficult task that is wide open for research. Very few techniques have been proposed; none of which is accepted as a fully satisfactory solution. In this chapter, we briefly outline the proposed techniques. One approach to performance analysis of heterogeneous parallel algo- rithms is based on the fact that the design of such algorithms is typically reduced to the problem of optimal data partitioning of one or other mathe- matical object such as a set, a rectangle, and so on. As soon as the correspond- ing mathematical optimization problem is formulated, the quality of its solution is assessed rather than the quality of the solution of the original problem. As the optimization problem is typically NP-hard, some suboptimal solutions are proposed and analyzed. The analysis is mostly statistical: The suboptimal solu- tions for a large number of generated inputs are compared with each other and with the optimal one. This approach is used in many papers (Crandall and Quinn, 1995; Kaddoura, Ranka, and Wang, 1996; Beaumont et al., 2001b,c). It estimates the heterogeneous parallel algorithm indirectly, and additional experiments are still needed to assess its efficiency in real heterogeneous environments. Another approach is to experimentally compare the execution time of the heterogeneous algorithm with that of its homogeneous prototype or High-Performance Heterogeneous Computing, by Alexey L. Lastovetsky and Jack J. Dongarra Copyright © 2009 John Wiley & Sons, Inc.

Transcript of High-Performance Heterogeneous Computing || Performance Analysis of Heterogeneous Algorithms

Page 1: High-Performance Heterogeneous Computing || Performance Analysis of Heterogeneous Algorithms

99

CHAPTER 6

Performance Analysis of Heterogeneous Algorithms

6.1 EFFICIENCY ANALYSIS OF HETEROGENEOUS ALGORITHMS

The methods for performance analysis of homogeneous parallel algorithms are well studied. They are based on models of parallel computers that assume a parallel computer to be a homogeneous multiprocessor. The theoretical analysis of a homogeneous parallel algorithm is normally accompanied by a relatively small number of experiments on a homogeneous parallel computer system. The purpose of these experiments is to show that the analysis is correct and that the analyzed algorithm is really faster than its counterparts are.

Performance analysis of heterogeneous parallel algorithms is a much more diffi cult task that is wide open for research. Very few techniques have been proposed; none of which is accepted as a fully satisfactory solution. In this chapter, we briefl y outline the proposed techniques.

One approach to performance analysis of heterogeneous parallel algo-rithms is based on the fact that the design of such algorithms is typically reduced to the problem of optimal data partitioning of one or other mathe-matical object such as a set, a rectangle, and so on. As soon as the correspond-ing mathematical optimization problem is formulated, the quality of its solution is assessed rather than the quality of the solution of the original problem. As the optimization problem is typically NP - hard, some suboptimal solutions are proposed and analyzed. The analysis is mostly statistical: The suboptimal solu-tions for a large number of generated inputs are compared with each other and with the optimal one. This approach is used in many papers (Crandall and Quinn, 1995 ; Kaddoura, Ranka, and Wang, 1996 ; Beaumont et al. , 2001b,c ). It estimates the heterogeneous parallel algorithm indirectly, and additional experiments are still needed to assess its effi ciency in real heterogeneous environments.

Another approach is to experimentally compare the execution time of the heterogeneous algorithm with that of its homogeneous prototype or

High-Performance Heterogeneous Computing, by Alexey L. Lastovetsky and Jack J. DongarraCopyright © 2009 John Wiley & Sons, Inc.

Page 2: High-Performance Heterogeneous Computing || Performance Analysis of Heterogeneous Algorithms

100 PERFORMANCE ANALYSIS OF HETEROGENEOUS ALGORITHMS

heterogeneous competitor. A particular heterogeneous network is used for such experiments . In particular, this approach is used in Kalinov and Last-ovetsky (1999a) , Kalinov and Lastovetsky (2001) , Dovolnov, Kalinov, and Klimov (2003) , and Ohtaki et al. (2004) . This approach directly estimates the effi ciency of heterogeneous parallel algorithms in some real heterogeneous environment but still leaves an open question about their effi ciency in general heterogeneous environments. Another problem with this approach is that real - life heterogeneous platforms are often shared by multiple users, which makes their performance characteristics, and hence experimental performance results, diffi cult to reproduce.

One possible approach to the problems of diversity of heterogeneous plat-forms and reproducibility of their performance characteristics could be the use of simulators, similar to those used for simulation of grid environments (see Sulistio, Yeo, and Buyya ( 2004 ) for an overview of this topic). Although the accuracy of such simulation can be open to question, it seems to be the only realistic solution for the experimental analysis of algorithms for large - scale distributed environments.

In the case of heterogeneous computational clusters, a more natural and reliable solution would be the use of a reconfi gurable and fully controlled environment for the reproducible experiments. This environment might include both reconfi gurable hardware and software components, altogether providing reproducible performance characteristics of processors and com-munication links. One simple, practical, and easy - for - implementation design of such an environment is proposed in Canon and Jeannot (2006) . The idea is to take a dedicated homogeneous cluster and degrade the performance of its computing nodes and communication links independently by means of a soft-ware in order to build a “ heterogeneous ” cluster. Then, any application can be run on this new cluster without modifi cations. The corresponding framework, called Wrekavoc, targets the degradation of the following characteristics:

• CPU power, • network bandwidth, • network latency, and • memory (not implemented yet).

For degradation of CPU performance, Wrekavoc implements three soft-ware - based methods: managing CPU frequency through a kernel interface; burning a constant portion of the CPU by a CPU burner; and suspending the processes when they have used more than the required fraction of the CPU with a CPU limiter.

Limiting latency and bandwidth is done based on modern software tools allowing advanced IP routing. The tools allow the programmer to control both incoming and outgoing traffi c, control the latency of the network interface, and alter the traffi c using numerous and complicated rules based on IP addresses, ports, and so on.

Page 3: High-Performance Heterogeneous Computing || Performance Analysis of Heterogeneous Algorithms

EFFICIENCY ANALYSIS OF HETEROGENEOUS ALGORITHMS 101

The framework itself is implemented using the client – server model. A server, with administrator privileges, is deployed on each node of the instru-mental homogeneous cluster and runs as a daemon. The client reads a confi gu-ration fi le, which specifi es the required confi guration of the target heterogeneous cluster, and sends orders to each node in the confi guration. The client can also order the nodes to recover the original state.

Another related approach to the performance analysis of heterogeneous parallel algorithms is based on a performance model of heterogeneous envi-ronments. This model - based approach is aimed at the prediction of the execu-tion time of the algorithms without their real execution in heterogeneous environments. This approach is proposed and implemented in the framework of the mpC programming system (Lastovetsky, 2002 ). The algorithm designer can describe the analyzed algorithm in a dedicated specifi cation language. This description is typically parameterized and includes the following information:

• The number of parallel processes executing the algorithm • The absolute volume of computations performed by each of the processes

measured in some fi xed computational units • The absolute volume of data transferred between each pair of processes • The scenario of interaction between the parallel processes during the

algorithm execution

The heterogeneous environment is modeled by a multilevel hierarchy of interconnected sets of heterogeneous multiprocessors. The hierarchy refl ects the heterogeneity of communication links and is represented in the form of an attributed tree. Each internal node of the tree represents a homogeneous communication space of the heterogeneous network. Attributes associated with the node allow one to predict the execution time of communication operations. Each terminal node in the tree represents an individual (homoge-neous) multiprocessor computer that is characterized by the following:

• The time of execution of one computational unit by a processor of the computer; the computational unit is supposed to be the same as the one used in the description of the analyzed algorithm

• The number of physical processors • The attributes of the communication layer provided by the computer

The description of the algorithm is compiled to produce a program that uses the performance model of the heterogeneous environment to predict the execution time of the algorithm for each particular mapping of its processes onto the computers of the environment.

The approach proposed in Lastovetsky and Reddy (2004b) is to carefully design a relatively small number of experiments in a natural or engineered

Page 4: High-Performance Heterogeneous Computing || Performance Analysis of Heterogeneous Algorithms

102 PERFORMANCE ANALYSIS OF HETEROGENEOUS ALGORITHMS

heterogeneous environment in order to experimentally compare the effi ciency of the heterogeneous parallel algorithm with some experimentally obtained ideal effi ciency (namely, the effi ciency of its homogeneous prototype in an equally powerful homogeneous environment). Thus, this approach compares the heterogeneous algorithm with its homogeneous prototype and assesses the heterogeneous modifi cation rather than analyzes this algorithm as an isolated entity. It directly estimates the effi ciency of heterogeneous parallel algorithms, providing relatively high confi dence in the results of such an experimental estimation.

The basic postulate of this approach is that the heterogeneous algorithm cannot be more effi cient than its homogeneous prototype. This means that the heterogeneous algorithm cannot be executed on the heterogeneous network faster than its homogeneous prototype on the equivalent homogeneous network. A homogeneous network of computers is equivalent to the hetero-geneous network if

• its aggregate communication characteristics are the same as that of the heterogeneous network,

• it has the same number of processors, and • the speed of each processor is equal to the average speed of the proces-

sors of the heterogeneous network.

The heterogeneous algorithm is considered optimal if its effi ciency is the same as that of its homogeneous prototype.

This approach is relatively easy to apply if the target architecture of the heterogeneous algorithm is a set of heterogeneous processors interconnected via a homogeneous communication network. In this case, all that is needed is to fi nd a (homogeneous) segment in the instrumental LAN and select two sets of processors in this segment so that

• both sets consist of the same number of processors, • all processors comprising the fi rst set are identical, • the second set includes processors of different speeds, and • the aggregate performance of the fi rst set of processors is the same as that

of the second one.

The fi rst set of interconnected processors represents a homogeneous network of computers, which is equivalent to the heterogeneous network of computers represented by the second set of processors just by design. Indeed, these two networks of computers share the same homogeneous communica-tion segment and, therefore, have the same aggregate communication charac-teristics. Results that are more reliable are obtained if the intersection of the two sets of processors is not empty. This allows us to better control the accu-racy of experiments by checking that the same processor has the same speed

Page 5: High-Performance Heterogeneous Computing || Performance Analysis of Heterogeneous Algorithms

EFFICIENCY ANALYSIS OF HETEROGENEOUS ALGORITHMS 103

in the heterogeneous network running the heterogeneous algorithm and in the homogeneous network running its homogeneous prototype. Higher confi -dence of the experimental assessment can be achieved by experimenting with several different pairs of processor sets from different segments. This approach is used in Lastovetsky and Reddy (2004b) , Plaza, Plaza, and Valencia (2006, 2007) , and Plaza (2007) .

If the target architecture for the heterogeneous algorithm is a set of het-erogeneous processors interconnected via a heterogeneous communication network, the design of the experiments becomes much more complicated. Although comprehensive solution of this problem is still a subject for research, one simple but quite typical case has been analyzed (Lastovetsky and Reddy, 2004b ). Let the communication network of the target heterogeneous architec-ture consist of a number of relatively fast homogeneous communication segments interconnected by slower communication links. Let parallel com-munications between different pairs of processors be enabled within each of the segments (e.g., by using a switch, the number of ports of which is no less than the number of computers in the segment). Let communication links between different segments only support serial communication. Further design depends on the analyzed heterogeneous algorithm. Assume that the commu-nication cost of the algorithm comes mainly from relatively rare point - to - point communications separated by a signifi cant amount of computations, so that it is highly unlikely for two such communication operations to be performed in parallel. Also, assume that each such communication operation consists in passing a relatively long message . Those assumptions allows us to use a very simple linear communication model when time t A → B ( d ) of transferring a data block of size d from processor A to processor B is calculated as t A → B ( d ) = s A → B × d , where s A → B is the constant speed of communication between processors A and B and s A → B = s B → A . Thus, under all these assumptions, the only aggregate characteristic of the communication network that has an impact on the execution time of the algorithm is the average speed of point - to - point communications.

To design experiments on the instrumental LAN in this case, we need to select two sets of processors so that

• both sets consist of the same number of processors, • all processors comprising the fi rst set are identical and belong to the same

homogeneous communication segment, • the second set includes processors of different speeds that span several

communication segments, • the aggregate performance of the fi rst set of processors is the same as that

of the second one, and • the average speed of point - to - point communications between the proces-

sors of the second set is the same as the speed of point - to - point commu-nications between the processors of the fi rst set.

Page 6: High-Performance Heterogeneous Computing || Performance Analysis of Heterogeneous Algorithms

104 PERFORMANCE ANALYSIS OF HETEROGENEOUS ALGORITHMS

The fi rst set of interconnected processors will represent a homogeneous network of computers, equivalent to the heterogeneous network of computers represented by the second set.

In a mathematical form, this problem can be formulated as follows. Let n be the number of processors in the fi rst set, v be their speed, and s be the communication speed of the corresponding segment. Let the second set of processors, P , span m communication segments S 1 , S 2 , … , S m . Let s i be the communication speed of segment S i , n i be the number of processors of set P belonging to S i , and v ij be the speed of the j - th processor belonging to segment S i ( i = 1, … , m ; j = 1, … , n i ). Let s ij be the speed of the communication link between segments S i and S j ( i , j = 1, … , m ). Then,

sn n

n n s

n ns

ii i

i

m

i j ijj i

m

i

m

× × −( ) + × ×

× −( ) == = +=∑ ∑∑1

21

2

1 11 ,

(6.1)

n ni

i

m

==∑

1

,

(6.2)

v n vij

j

n

i

m i

= ×==∑∑

11

.

(6.3)

Equation (6.1) states that the average speed of point - to - point communica-tions between the processors of the second set should be equal to the speed of point - to - point communications between the processors of the fi rst set. Equation (6.2) states that the total number of processors in the second set should be equal to the number of processors in the fi rst set. Equation (6.3) states that the aggregate performance of the processors in the second set should be equal to the aggregate performance of the processors in the fi rst set.

6.2 SCALABILITY ANALYSIS OF HETEROGENEOUS ALGORITHMS

The methods presented in the previous section are mainly aimed at the com-parative analysis of the effi ciency of different homogeneous and heteroge-neous algorithms in given heterogeneous environments. Some researchers consider scalability at least as important a property of heterogeneous parallel algorithms as effi ciency. Scalability refers to the ability of an algorithm to increase the performance of the executing parallel system with incremental addition of processors to the system. The analysis of scalability is particularly important if the algorithm is designed for large - scale parallel systems.

Scalability of a parallel application has two distinct fl avors:

• Strong scalability — where the problem size is fi xed and the number of processors increases, and our goal is to minimize the time to solution.

Page 7: High-Performance Heterogeneous Computing || Performance Analysis of Heterogeneous Algorithms

SCALABILITY ANALYSIS OF HETEROGENEOUS ALGORITHMS 105

Here, scalability means that speedup is roughly proportional to the number of processors used . For example, if you double the number of processors, but you keep the problem size constant, then the problem takes half as long to complete (i.e., the speed doubles).

• Weak scalability — where the problem size and the number of processors expand, and our goal is to achieve constant time to solution for larger problems. In this case, scalability means the ability to maintain a fi xed time to solution for solving larger problems on larger computers. For example, if you double the number of processors and double the problem size, then the problem takes the same amount of time to complete (i.e., the speed doubles).

Weak scaling is easier to achieve, at least relatively speaking. Weak scal-ability means, basically, that as you scale up the problem to a larger size, each processor does the same amount of computing. For example, if you double the size of a three - dimensional mesh in each dimension, you need eight times more processors.

Strong scaling, on the other hand, may not be as easy to achieve. Strong scalability means that for a given problem, as you scale up, say, from 100 to 800 processors, you apply this greater number of processors to the same mesh, so that each processor is now doing one - eighth as much work as before. You would like the job to run eight times faster, and to do that may require restruc-turing how the program divides the work among processors and increases the communication between them.

An example of an implementation demonstrating strong scalability comes from a molecular dynamics application called NAMD (Kumar et al. , 2006 ). NAMD is a C++ - based parallel program that uses object - based decomposition and measurement - based dynamic load balancing to achieve its high perfor-mance, and a combination of spatial decomposition and force decomposition to generate a high degree of parallelism. As a result, NAMD is one of the fastest and most scalable programs for bimolecular simulations. It shows strong scalability up to 8192 processors. For NAMD, the overlapping of communica-tion with computation is critical to achieving such scalability.

Weak scalability of homogeneous algorithms designed for execution on homogeneous parallel systems has been intensively studied (Gustafson, Montry, and Benner, 1988 ; Zorbas, Reble, and VanKooten, 1989 ; Karp and Platt, 1990 ; Grama, Gupta, and Kumar, 1993 ; Zhang, Yan, and He, 1994 ; Don-garra, van de Geijn, and Walker, 1994 ; Chetverushkin et al. , 1998 ). Nonetheless, the studies did not result in a unifi ed method of scalability analysis that would be adopted by the research community. Among others, the isoeffi ciency approach (Grama, Gupta, and Kumar, 1993 ) is probably the most widely used. It is based on the notion of parallel effi ciency , which is defi ned as the ratio of the speedup achieved by the algorithm on the parallel system to the number of processors involved in the execution of the algorithm:

Page 8: High-Performance Heterogeneous Computing || Performance Analysis of Heterogeneous Algorithms

106 PERFORMANCE ANALYSIS OF HETEROGENEOUS ALGORITHMS

E n p

A n pp

,,

.( ) =( )

It is important that the parallel effi ciency is defi ned as a function of the size of the task solved by the algorithm, n , and the number of processors, p . The parallel effi ciency characterizes the level of utilization of the processors by the algorithm. While sometimes the parallel effi ciency can be greater than 1 due to the superlinear speedup effects, it typically ranges from 0 to 1. Moreover, if the task size is fi xed, the parallel effi ciency of the algorithm will typically decrease with the increase of the number of processors involved in its execu-tion. This is due to the increase of the overhead part in the execution time. On the other hand, the effi ciency of the use of the same number of processors by the parallel algorithm usually increases with the increase of the task size .

Intuitively, the algorithm is scalable if it can use, with the same effi ciency, the increasing number of processors for solution of tasks of increasing size. Mathematically, the level of scalability is characterized by a so - called isoeffi -ciency function , n = I ( p ), that determines how the task size should increase with the increase of the number of processors in order to ensure constant effi ciency. The isoeffi ciency function is found from the isoeffi ciency condition , E ( n , p ) = k , where k denotes a positive constant. The isoeffi ciency function of a scalable algorithm should monotonically increase. The slower the function increases, the more scalable the algorithm is.

Extension of the isoeffi ciency approach to the scalability analysis of algo-rithms for heterogeneous parallel systems needs a new, more general, defi ni-tion of parallel effi ciency. Two defi nitions have been given so far (independently by several researchers). The fi rst one defi nes the parallel effi ciency as the ratio of the real and ideal speedups (Chetverushkin et al. , 1998 ; Mazzeo, Mazzocca, Villano, 1998 ; Bazterra et al. , 2005 ). The real speedup is the one achieved by the parallel algorithm on the heterogeneous parallel system. The ideal speedup is defi ned as the sum of speeds of the processors of the executing heteroge-neous parallel system divided by the speed of a base processor (Chetverushkin et al. , 1998 ). All speedups are calculated relative to the serial execution of the algorithm on the base processor.

According to the second defi nition, the parallel effi ciency is the ratio of the ideal and real execution times (Zhang and Yan, 1995 ; Chamberlain, Chace, and Patil, 1998 ; Pastor and Bosque, 2001 ; Kalinov, 2006 ). In particular, in Kalinov (2006) , the parallel effi ciency of the algorithm solving a task of size n on p heterogeneous processors of the speeds S = { s 1 , s 2 , … , s p } is defi ned as

E n p S

T n s p S

T n p S, ,

, , ,, ,

,( ) =( )

( )ideal seq

par

where s seq is the speed of the base processor. The ideal parallel execution time is calculated under the assumptions that the communication and other

Page 9: High-Performance Heterogeneous Computing || Performance Analysis of Heterogeneous Algorithms

SCALABILITY ANALYSIS OF HETEROGENEOUS ALGORITHMS 107

overheads of the parallel execution are free and that the computational load of the processors is perfectly balanced:

T n s p Ss

sT n s

s

psii p

ideal seqseq

seq seqseq

aver

, , , ,

,

( ) = × ( ) = ×

∈[ ]∑

1

TT n sseq seq, ,( )

where s aver is the average speed of the p processors and T seq ( n , s seq ) is the execu-tion time of the sequential solution of this task (of size n ) on the base proces-sor. Thus, the parallel effi ciency will be given by

E n p S

s

s

T n s

pT n p S, ,

,, ,

.( ) = ×( )( )

seq

aver

seq seq

par

For a homogeneous system, s seq = s i = s aver = const , and the formula takes its usual form

E n p

T n

pT n pA n p

p,

,,

.( ) =( )( )

=( )seq

par

For a heterogeneous system, the isoeffi ciency condition E ( n , p , S ) = k is used to fi nd the isoeffi ciency function n = I ( p , S ). In Kalinov (2006) , this approach is used for the scalability analysis of heterogeneous modifi cations of the SUMMA parallel matrix multiplication algorithm (van de Geijn and Watts, 1997 ) under the additional assumption that the minimal ( s min ), maximal ( s max ), and average ( s aver ) processor speeds do not change with the increase in the number of processors ( p ) in the heterogeneous system. In this case, the for-mulae for the isoeffi ciency functions only include the integral characteristics of the heterogeneous parallel system, namely, p , s min , s max , and s aver , making the isoeffi ciency functions only functions of the number of processors, not func-tions of the speeds of the processors. Moreover, for some algorithms, s min and s max appear in the formulae only in the form of their ratio, which characterizes the level of heterogeneity of the considered parallel systems.

Another approach to scalability, the isospeed approach (Sun and Rover, 1994 ), is based on the notion of speed of a computer system defi ned as the work performed by the computer system divided by the execution time. This approach has been extended to heterogeneous systems (Sun, Chen, and Wu, 2005 ). The average speed of a processor is defi ned as the ratio of the speed achieved by the parallel system to the number of processors. The algo-rithm will be scalable if the average speed of a processor can be maintained at a constant level by increasing the size of the problem when the number of processors increases. A software tool for scalability testing and analysis based on the isospeed approach has been also developed (Chen and Sun, 2006 ).