Global FFT, Global EP-STREAM Triad, HPL written in MC# Vadim B. Guzev [email protected] Russian...

Global FFT, Global EP-STREAM Triad, HPL

written in MC#

Vadim B. Guzev [email protected]

Russian People Friendship UniversityOctober 2006

Global FFT

In this submission we’ll show how to implement GlobalFFT in MC# programming language and we’ll try to concentrate on the process of writing parallel distributed programs, but not on the performance/number of lines issues. We are quite sure that the future belongs to very high-level programming languages and that one day the productivity of the programmers will become more important thing, than the productivity of the platforms! That’s why MC# was born…

In object-oriented languages all programs are composed of objects and their interaction. It is natural that when programmer starts thinking about the problem first of all he would like to describe the object model before writing any logic. In Global FFT program these classes are Complex (structure) and GlobalFFT (algorithm). We will start writing our program by defining the Complex class:

public class Complex { public Complex( double re, double im ) { Re = re; Im = im; } public double Re = 0; public double Im = 0;}

The simple math behind the Global FFT problem is the following:

zize zi sincos

m

jj

m

j

m

kji

jk m

kji

m

kjzezZ

2sin

2cos

2

m

jjj m

kji

m

kjziz

2sin

2cosImRe

m

jjjjj m

kjz

m

kjzi

m

kjzi

m

kjz

2sinIm

2cosIm

2sinRe

2cosRe

m

jjjjj m

kjz

m

kjzi

m

kjz

m

kjz

2sinRe

2cosIm

2sinIm

2cosRe

http://u.pereslavl.ru/~vadim/MCSharp/

The natural way to distribute this computation is to split the execution by index k. That’s exactly what we will do. In MC# if you want to create a method which must be executed in different thread/node/cluster all you need to do is to mark this method as movable (distributed analogue of void, or async keyword of C# 3.0). Where exactly movable method will be executed is determined by the Runtime system and the call of movable method on the caller’s side occurs almost immediately (i.e. caller of the method doesn’t wait until the method execution is completed). In our case this movable method will receive as parameters:(a) Array of complex values z, (b) Current number of processor and (c) Special channel into which the result will be sent.

movable Calculate( Complex[] z, int processorNumber, Channel( Complex[] ) channelZ ) { int np = CommWorld.Size; if ( np > 1 ) np -= 1; // When program runs in distributed mode frontend is included in CommWorld.Size int partLength = z.Length / np; int shift = processorNumber * partLength; Complex[] partOfZ = new Complex [partLength]; for ( int k = 0; k < partLength; k++ ) { partOfZ [k] = new Complex(); for ( int j = 0; j < z.Length; j++ ) { double arg = 2 * Math.PI * (j + 1) * (k + shift + 1) / (double) z.Length; double cos = Math.Cos( arg ); double sin = Math.Sin( arg ); partOfZ [k].Re += z [j].Re * cos + z [j].Im * sin; partOfZ [k].Im += z [j].Im * cos - z [j].Re * sin; } } channelZ.Send( partOfZ, processorNumber );}

As you can see, in MC# it is possible to use almost any types of parameters for movable methods. When distributed mode is enabled these parameters will be automatically serialized and sent to remote node. The same applies to channels – it is possible to send values of any .Net type (which supports serialization) through the channels. To get the results from channels you have to connect them with synchronous methods – this is known as bounds in languages like Polyphonic C#, C# 3.0 or MC#. More information about bounds you can get on MC# site.

void Get( ref Complex[] Z ) & Channel CZ( Complex[] partOfZ, int processorNumber ) { int shift = processorNumber * partOfZ.Length; for ( int i = 0; i < partOfZ.Length; i++ ) Z [i + shift].Re = partOfZ [i].Re; Z [i + shift].Im = partOfZ [i].Im;}

And finally let’s write down the main method which will launch the computation:

public static void Main( string[] args ) { GlobalFFT fft = new GlobalFFT(); // Getting m as parameter of the program (m = 2^n) int m = Int32.Parse( args [0] ); Random r = new Random(); Complex[] z = new Complex [m]; // Initializing vector z for ( int i = 0; i < m; i++ ) z [i] = new Complex( r.NextDouble(), r.NextDouble() ); int np = CommWorld.Size; if ( np > 1 ) np -= 1; // Launching processing functions for ( int k = 0; k < np; k++ ) fft.Calculate( z, k, fft.CZ ); // Collecting results – result will be saved to vector z for ( int k = 0; k < np; k++ ) fft.Get( ref z );}

using System;public class Complex { public Complex( double re, double im ) { Re = re; Im = im; } public double Re = 0; public double Im = 0;}

public class GlobalFFT { movable Calculate( Complex[] z, int processorNumber, Channel( Complex[] ) channelZ ) { int np = CommWorld.Size; if ( np > 1 ) np -= 1; // When program runs in distributed mode cluster's frontend is included in CommWorld.Size int partLength = z.Length / np; int shift = processorNumber * partLength; Complex[] partOfZ = new Complex [partLength]; for ( int k = 0; k < partLength; k++ ) { partOfZ [k] = new Complex(); for ( int j = 0; j < z.Length; j++ ) { double arg = 2 * Math.PI * (j + 1) * (k + shift + 1) / (double) z.Length; double cos = Math.Cos( arg ); double sin = Math.Sin( arg ); partOfZ [k].Re += z [j].Re * cos + z [j].Im * sin; partOfZ [k].Im += z [j].Im * cos - z [j].Re * sin; } } channelZ.Send( partOfZ, processorNumber ); } void Get( ref Complex[] Z ) & Channel CZ( Complex[] partOfZ, int processorNumber ) { int shift = processorNumber * partOfZ.Length; for ( int i = 0; i < partOfZ.Length; i++ ) { Z [i + shift].Re = partOfZ [i].Re; Z [i + shift].Im = partOfZ [i].Im; } }

public static void Main( string[] args ) { GlobalFFT fft = new GlobalFFT(); int m = Int32.Parse( args [0] ); Random r = new Random(); Complex[] z = new Complex [m]; for ( int i = 0; i < m; i++ ) z [i] = new Complex( r.NextDouble(), r.NextDouble() ); int np = CommWorld.Size; if ( np > 1 ) np -= 1; for ( int k = 0; k < np; k++ ) fft.Calculate( z, k, fft.CZ ); for ( int k = 0; k < np; k++ ) fft.Get( ref z ); }}

So, the first version of our Global FFT program is the following:

Parallel programs written in MC# language can be executed either in local mode (i.e. as simple .exe files – in this mode all movable methods will be executed in different threads) or in distributed mode – in this case all movable calls will be distributed across nodes of the Cluster/MetaCluster/GRID network (depending on the currently used Runtime). That means that programmer can write and debug his program locally (for example on his Windows machine) and then copy his program to Windows-based or Linux-based cluster and run it in distributed mode. User can even emulate cluster environment on his home computer! MC# makes cluster computations accessible to every programmer, even to those of them who currently do not have access to clusters!Let’s try to run this program in local mode on Windows machine:

R:\projects\MCSharp\hpcchallenge>GlobalFFT.exe 1024________________________________________________==MC# Statistics================================Number of movable calls: 1Number of channel messages: 1Number of movable calls (across network): 0Number of channel messages (across network): 0Total size of movable calls (across network): 0 bytesTotal size of channel messages (across network): 0 bytesTotal time of movable calls serialization: 00:00:00.0156250Total time of channel messages serialization: 00:00:00Total size of transported messages: 0 bytesTotal time of transporting messages: 00:00:00Session initialization time: 00:00:00.0312500 / 0.03125 sec. / 31.25 msec.Total time: 00:00:00.4531250 / 0.453125 sec. / 453.125 msec.________________________________________________

Or we can run this program in local mode on Linux machine:

[vadim@skif gfft]$ mono GlobalFFT.exe 1024MC#.Runtime, v. 1.13.2437.555________________________________________________==MC# Statistics================================Number of movable calls: 1Number of channel messages: 1Number of movable calls (across network): 0Number of channel messages (across network): 0Total size of movable calls (across network): 0 bytesTotal size of channel messages (across network): 0 bytesTotal time of movable calls serialization: 00:00:00.0869900Total time of channel messages serialization: 00:00:00Total size of transported messages: 0 bytesTotal time of transporting messages: 00:00:00Session initialization time: 00:00:00.1355990 / 0.135599 sec. / 135.599 msec.Total time: 00:00:00.5795040 / 0.579504 sec. / 579.504 msec.________________________________________________

OK, it works. Now let’s try to run this program on more serious hardware. We’ll use 16-nodes cluster with the following configuration:

[vadim@skif vadim]$ uname –aLinux skif 2.4.27 #1 SMP Thu Apr 14 15:25:11 MSD 2005 i686 athlon i386 GNU/Linux

Let’s run our program on 16 processors:

[vadim@skif gfft]$ mono GlobalFFT.exe 16384 /np 16MC#.Runtime, v. 1.13.2437.555________________________________________________==MC# Statistics================================Number of movable calls: 16Number of channel messages: 16Number of movable calls (across network): 16Number of channel messages (across network): 16Total size of movable calls (across network): 7871648 bytesTotal size of channel messages (across network): 495824 bytesTotal time of movable calls serialization: 00:00:13.8864840Total time of channel messages serialization: 00:00:01.0427460Total size of transported messages: 8369192 bytesTotal time of transporting messages: 00:00:00.5808540Session initialization time: 00:00:00.4839560 / 0.483956 sec. / 483.956 msec.Total time: 00:00:30.1959760 / 30.195976 sec. / 30195.976 msec.________________________________________________

Here is the result graph:

GlobalFFT.mcs

65536 32768 16384

1 476.55 112.52

2 257.54 59.47

4 604.2 149.11 36.90

8 401.7 106.95 28.68

16 329.1 97.70 30.20

GlobalFFT.mcs

0

100

200

300

400

500

600

700

1 2 4 8 16

Number of processors

Tim

e (

sec)

65536

32768

16384

Not bad, especially if we take into account that we wrote this program on modern high-level object-oriented language without thinking about any optimization issues or the physical structure of computational platform. Now let’s try to optimize it a little bit. The main problem in the first version of our program is that we need to move thousands of complex user-defined objects from frontend to cluster nodes and back. Serialization/Deserialization process of such objects takes a lot of resources and time. We can significantly reduce the execution time if we replace “arrays of Complex” to “arrays of doubles”. Here is the modified version of the program (lines 46, NCSL 43, TPtoks 490):

using System;public class GlobalFFT { movable FFT( double[] zRe, double[] zIm, int processorNumber, Channel( double[], double[], int ) channelZ ) { int np = CommWorld.Size; if ( np > 1 ) np -= 1; // When program runs in distributed mode then cluster's frontend is included in CommWorld.Size int partLength = zRe.Length / np; int shift = processorNumber * partLength; double[] partOfZRe = new double [partLength]; double[] partOfZIm = new double [partLength]; double multiplier = 2 * Math.PI / (double) zRe.Length; for ( int k = 0; k < partLength; k++ ) { for ( int j = 0; j < zRe.Length; j++ ) { double arg = multiplier * (j + 1) * (k + shift + 1); double cos = Math.Cos( arg ); double sin = Math.Sin( arg ); partOfZRe [k] += zRe [j] * cos + zIm [j] * sin; partOfZIm [k] += zIm [j] * cos - zRe [j] * sin; } } channelZ.Send( partOfZRe, partOfZIm, processorNumber ); } void Get( ref double[] ZRe, ref double[] ZIm ) & Channel CZ( double[] partOfZRe, double[] partOfZIm, int processorNumber ) { int shift = processorNumber * partOfZRe.Length; for ( int i = 0; i < partOfZRe.Length; i++ ) { ZRe [i + shift] = partOfZRe [i]; ZIm [i + shift] = partOfZIm [i]; } } public static void Main( string[] args ) { GlobalFFT fft = new GlobalFFT(); int m = Int32.Parse( args [0] ); Random r = new Random(); double[] zRe = new double [m]; double[] zIm = new double [m]; for ( int i = 0; i < m; i++ ) { zRe [i] = r.NextDouble(); zIm [i] = r.NextDouble(); } int np = CommWorld.Size; if ( np > 1 ) np -= 1; for ( int k = 0; k < np; k++ ) fft.Calculate( zRe, zIm, k, fft.CZ ); for ( int k = 0; k < np; k++ ) fft.Get( ref zRe, ref zIm ); }}

[vadim@skif gfft]$ mono GlobalFFT_arrays.exe 32768 /np 16MC#.Runtime, v. 1.13.2437.555________________________________________________==MC# Statistics================================Number of movable calls: 32Number of channel messages: 32Number of movable calls (across network): 32Number of channel messages (across network): 32Total size of movable calls (across network): 16792256 bytesTotal size of channel messages (across network): 1054848 bytesTotal time of movable calls serialization: 00:00:00.2735930Total time of channel messages serialization: 00:00:00.3594260Total size of transported messages: 17849757 bytesTotal time of transporting messages: 00:00:01.2621410Session initialization time: 00:00:00.4820490 / 0.482049 sec. / 482.049 msec.Total time: 00:00:28.8684640 / 28.868464 sec. / 28868.464 msec.________________________________________________

Now we can get some performance numbers for the second version of our Global FFT program:

Let’s run this version of the program:

GlobalFFT_arrays.mcs

131072 65536 32768 16384

- 844.50 210.07 52.51

- 425.47 106.15 26.82

858.79 219.76 53.70 13.78

432.52 108.12 27.69 7.39

218.22 55.45 14.93 4.37

GlobalFFT_arrays.mcs

0

100

200

300

400

500

600

700

800

900

1000

1 2 4 8 16

Number of processors

Tim

e (

sec)

131072

65536

32768

16384

Global EP-STREAM-Triad

There is another one task in Class 2 Specification which is entitled as Global EP-STREAM-Triad. Although C# currently doesn’t support kernel vector operations we think that it is still good example to demonstrate the syntax of MC#. We’ll write a simple program which will make the same calculations on several nodes simultaneously and then print the average time taken on all nodes in general.There will be only one movable method in this program which will accept special Channel through which only objects of class TimeSpan can be sent:

movable fun( Channel( TimeSpan ) result ) { int m = 2000000; int n = 1000; Random r = new Random(); double[] a = new double[m]; double[] b = new double[m]; double[] c = new double[m]; for ( int i = 0; i < m; i++ ) { b [i] = r.NextDouble(); c [i] = r.NextDouble(); } double alpha = r.NextDouble(); TimeSpan ts = new TimeSpan(0); for ( int i = 0; i < n; i++ ) { DateTime from = DateTime.Now; for ( int j = 0; j < m; j++ ) a [j] = b [j] + alpha * c[j]; ts += DateTime.Now.Subtract( from ); } result.Send( ts );}

TimeSpan GetResult() & Channel result( TimeSpan ts ) { return ts; }

Movable methods cannot return any values. Channels must be used instead to pass the information between nodes.To read from “semi-directional” channels bounds must be used (special syntax constructs) which can synchronize multiple threads. In our case we need only one bound:

When Thread A is calling method GetResult then Runtime system checks whether any object has been delivered to theresult channel and queued in the special channel queue. If no objects have been received then Thread A is suspended until result channel receives some object. When this object will be received then Thread A will be resumed and the reading from the channel occurs.When object is sent to result channel and there is no waiting callers of method GetResult then this object is put into specialchannel queue. Object will be read when corresponding GetResult method will be called.

using System;public class EPStream { movable fun( Channel( TimeSpan ) result ) { int m = 2000000; int n = 1000; Random r = new Random(); double[] a = new double[m]; double[] b = new double[m]; double[] c = new double[m]; for ( int i = 0; i < m; i++ ) { b [i] = r.NextDouble(); c [i] = r.NextDouble(); } double alpha = r.NextDouble(); TimeSpan ts = new TimeSpan(0); for ( int i = 0; i < n; i++ ) { DateTime from = DateTime.Now; for ( int j = 0; j < m; j++ ) a [j] = b [j] + alpha * c[j]; ts += DateTime.Now.Subtract( from ); } result.Send( ts ); } TimeSpan GetResult() & Channel result( TimeSpan ts ) { return ts; } public static void Main( string[] args ) { EPStream e = new EPStream(); for ( int i = 0; i < CommWorld.Size; i++ ) e.fun( e.result ); TimeSpan total = new TimeSpan(0); for ( int i = 0; i < CommWorld.Size; i++ ) total += e.GetResult(); Console.WriteLine("Average: " + new TimeSpan( total.Ticks/CommWorld.Size )); }}

Let’s run this program:

[vadim@skif gfft]$ mono EPStream.exe /np 16MC#.Runtime, v. 1.13.2437.555Average: 00:01:13.4992476________________________________________________==MC# Statistics================================Number of movable calls: 17Number of channel messages: 17Number of movable calls (across network): 17Number of channel messages (across network): 17Total size of movable calls (across network): 6647 bytesTotal size of channel messages (across network): 2958 bytesTotal time of movable calls serialization: 00:00:00.0483080Total time of channel messages serialization: 00:00:00.3021390Total size of transported messages: 11321 bytesTotal time of transporting messages: 00:00:00.0160980Session initialization time: 00:00:00.4875540 / 0.487554 sec. / 487.554 msec.Total time: 00:01:35.2171990 / 95.217199 sec. / 95217.199 msec.________________________________________________

Here is the final version of this EP-STREAM Triad program (lines 36, NCSL 33, TPtoks 311):

HPL


)j(ii

kkjuiklija

iibiju

,bja

ju

11

1

1

11

11

n

ikkikii

nn

nixuyx

yx

1

).(

,

)(ii

kkyiklni,a

iiliy

,ln,a

y

11

11

1

11

111

1

11

11j

k)j(ikjuiklijaijl

,iail

HPL solves a linear system of equations of order n: Ax=b, by first computing A=LU factorization and then solving the equations Ly=b and Ux=y one by one. In this scheme L and U matrixes are triangular matrixes, so it is not a problem to solve them. The real problem is the calculation of these matrixes L and U.

The simple math behind this problem is the following:

L11 (1) L11 (2)

L21 (2)

U11 (1) U12 (2) U13 (3)

L21+22 (3)

U11 (2) U12+22 (3)

L31 (3)

U13+23 (4)

L31+32 (4)

And the calculation dependencies graph is the following:

1.1

2.3

1.2 2.2 2.5

3.4

1.4

3.5

2.6

4.5

4.6

1.51.6

2.6

2.6 2.7 2.7

2.7

4.7 4.8 4.9

3.02.8

1.4

2.74.8

2.8

1.6 2.5

4.9

1.7 2.6

2.8

6.1

5.0

3.03.1

3.1

2.9 3.0 3.0

3.2

3.0 3.1

3.13.2

7.2

7.2

2.9

2.7 2.8 2.91.5

2.42.3

2.7

3.6

3.7

4.75.9

2.6

4.6

6.02.8

4.97.0 7.2

2.7

4.8

6.0

7.3

2.8

2.4

2.5

2.9

2.95.0

7.3 8.4

2.95.1 7.3

8.4

2.8

4.9

7.2

8.3

2.8

5.0

7.2

8.3

1.3

3.7

3.8

4.8

4.9

6.0

3.8

3.9

4.7

4.8 2.9

5.0

5.1

7.1

5.1

5.7

6.97.0

5.2

8.37.4

3.0

6.1

3.05.2

5.3

7.48.3

4.9

5.0

7.1

5.9

7.3

2.9 7.1

5.0

5.1

8.2

7.39.5

7.1

5.18.2

10.55.2 3.1

9.4

10.6

5.8

11.6

Actually there are a lot of modifications of HPL algorithm out there. One of the most important thing in HPL is the way how calculated panels are broadcasted to other nodes. For example, in Increasing Ring algorithm process 0 sends two messages and process 1 only receives one message (0 1, 0 2, 2 3 and so on). This algorithm is almost always better, if not the best. Let's suppose that for a given matrix A calculations in one node take approximately ~1 minute, and transfer of matrixes between nodes takes ~1/10 minute. Using these primitive estimations we can calculate that it takes 11.6 minutes in Increasing Ring version vs. 36 minutes in sequential version. In practice it is even better, because this algorithm saves the bandwidth:

0 1 2 3

1.12.2

2.3

1.2 2.3 3.4

3.4

2.4 3.5

3.5

4.6

4.6

4.45.5

5.6

4.5 5.6 6.6

6.7

5.7 6.8

6.8

7.8

7.95.7

4.5

3.3

4.75.8

5.9

4.8 5.9 7.0

7.0

6.0 7.1

7.1

8.2

8.2

8.09.1

9.2

8.1 9.2 10.2

10.3

9.3 10.4

10.4

11.4

11.59.3

8.1

6.9

5.8 6.9 8.0 9.04.73.6

12.6

So, we know that there do exist better algorithms, but they are quite complex to understand and the purpose of our submission is not to get the highest performance results, but to show the principles of programming in MC# language. So, for simplicity reasons we’ll use the most simple communication structure where each process is communicating directly with top, left, bottom and right processes in the processes grid. In our case each process will be connected with their neighbors by bi-directional channels (BDChannel). Using these bi-directional channels processes can communicate to each other by sending and receiving messages.

public static void Main( string[] args ) {

if ( args.Length < 3 ) {

Console.WriteLine( "Usage: HPL.exe n p q" );

Console.WriteLine( "Where n - size of matrix A, p - height of process grid, q - width of process grid" );

return;

}

int n = Int32.Parse( args [0] );

int p = Int32.Parse( args [1] );

int q = Int32.Parse( args [2] );

double[,] a = new double [n, n];

double[] b = new double [n];

int maxRandNum = 100;

// Generate matrix A - Expected mean must be equal to zero

Random rand = new Random();

for ( int i = 0; i < n; i++ )

for ( int j = 0; j < n; j++ ) a [i,j] = rand.NextDouble() * maxRandNum - maxRandNum/2;

// Generate vector b - Expected mean must be equal to zero

for ( int i = 0; i < n; i++ ) b[i] = rand.NextDouble() * maxRandNum - maxRandNum/2;

DateTime dt1, dt2;

TimeSpan dt;

dt1 = DateTime.Now;

HPLAlgorithm hpl = new HPLAlgorithm(); // Creating an instance of HPL algorithm double[] x = hpl.Solve( a, b, n, p, q ); // launching the algorithm dt2 = DateTime.Now;

dt = dt2.Subtract( dt1 );

Console.WriteLine( "\nElapsed time: " + dt.TotalSeconds + " sec.\n" );

double performance = ( 2.0 * n * n * n / 3.0 + 3.0 * n * n / 2.0 ) * 10.0e-9 / (double) dt.TotalSeconds;

Console.WriteLine( "Performance = " + performance + " Gflop/sec" );

bool correctness = Verify( a, b, x, n );

if ( correctness == true )

Console.WriteLine( "Solution is correct" );

else

Console.WriteLine( "Solution is incorrect" );

}

The Main method of our program is quite simple. Actually it is written in pure C# code (MC# specific syntax is not used here). First of all we generate matrix A and vector b, and after that we instantiate the HPLAlgorithm object and solve the equation by calling the Solve method. After that we verify the solution. See comments in the code to get the better understanding of the code.

public static bool Verify( double[,] A, double[] b, double[] x, int n ) {

int i, j;

double tmp;

double eps = Double.MinValue;

double Ax_b_infin = 0.0; // || A x - b ||_infinity

for ( i = 0; i < n; i++ ) {

tmp = 0.0;

for ( j = 0; j < n; j++ ) tmp += A [i,j] * x [ j ];

tmp = Math.Abs( tmp - b [i] );

if ( tmp > Ax_b_infin ) Ax_b_infin = tmp;

}

double A_infin = 0.0; // || A ||_infinity

for ( i = 0; i < n; i++ ) {

tmp = 0.0;

for ( j = 0; j < n; j++ ) tmp += Math.Abs( A [i,j] );

if ( tmp > A_infin ) A_infin = tmp;

}

double A_1 = 0.0; // || A ||_1

for ( j = 0; j < n; j++ ) {

tmp = 0.0;

for ( i = 0; i < n; i++ ) tmp += Math.Abs( A [i,j] );

if ( tmp > A_1 ) A_1 = tmp;

}

double x_1 = 0.0; // || x ||_1

for ( i = 0; i < n; i++ ) x_1 += Math.Abs( x [i] );

double x_infin = 0.0; // || x ||_infinity

for ( i = 0; i < n; i++ )

if ( Math.Abs( x [ i ] ) > x_infin ) x_infin = Math.Abs( x [ i ] );

double r1 = Ax_b_infin / ( eps * A_1 * n );

double r2 = Ax_b_infin / ( eps * A_1 * x_1 );

double r3 = Ax_b_infin / ( eps * A_infin * x_infin * n );

double r;

if ( r1 > r2 ) r = r1; else r = r2;

if ( r3 > r ) r = r3;

Console.WriteLine( "r1 = " + r1 + "\n" + "r2 = " + r2 + "\n" + "r3 = " + r3 );

Console.WriteLine( "Max ri = " + r );

if ( r < 16 ) return true; else return false;

}

First of all let’s look at accessory methods. These are Verify, GetSubMatrix and GetSubVector.Verify method verifies the solution based on the criteria mentioned in the HPC Challenge Awards: Class 2 Specification:

i

ijj

aA max1

iixx max

1

na

bAx

nA

bAxr

iij

i

ii

max

)(max

1

1

ii

iij

i

ii

xa

bAx

xA

bAxr

maxmax

)(max

11

2

ii

iij

i

ii

xa

bAx

nxA

bAxr

maxmax

)(max3

16),,max( 321 rrr

public static double[,] GetSubMatrix( double[,] a,

int fromH, int toH, int fromW, int toW ) {

int h = toH - fromH;

int w = toW - fromW;

double[,] b = new double[h, w];

for ( int i = 0; i < h; i++ )

for ( int j = 0; j < w; j++ ) b [i, j] =

a [fromH+i, fromW+j];

return b;

}

public static double[] GetSubVector( double[] b,

int fromH, int toH ) {

int h = toH - fromH;

double[] c = new double[h];

for ( int i = 0; i < h; i++ ) c [i] = b [fromH + i];

return c;

}

public double[] Solve( double[,] a, double[] b, int n, int p, int q ) {

BDChannel[,] bdc = new BDChannel [p,q];

// Creating n * n bdchannels, where n - number of processors

for ( int i = 0; i < p; i++ )

for ( int j = 0; j < q; j++ ) bdc [i, j] = new BDChannel();

int diffP = n / p;

int diffQ = n / q;

for ( int i = 0; i < p; i++ ) {

int h = diffP;

if ( i == p - 1 ) h += n % p;

double[] partOfB = GetSubVector( b, i * diffP, i * diffP + h );

for ( int j = 0; j < q; j++ ) {

int w = diffQ;

if ( j == q - 1 ) w += n % q;

double[,] partOfA = GetSubMatrix( a, i * diffP, i * diffP + h, j * diffQ, j * diffQ + w );

BDChannel top = null, left = null, bottom = null, right = null;

if ( i != 0 ) top = bdc [i - 1, j];

if ( j != 0 ) left = bdc [i, j - 1];

if ( i < p - 1 ) bottom = bdc [i + 1, j];

if ( j < q - 1 ) right = bdc [i, j+1];

if ( j != 0 ) partOfB = null; // we need vector b only in the first processors column

hplDistributed( partOfA, partOfB, n, p, q, i * diffP, i * diffP + h, j * diffQ, j * diffQ + w,

top, left, bdc [i,j], bottom, right, xChannel );

}

}

double[] x = new double [n];

for ( int i = 0; i < p; i++ ) Get( ref x );

return x;

}

void Get( ref double[] x ) & Channel xChannel( double[] partOfX, int wFrom ) {

for ( int i = 0; i < partOfX.Length; i++ ) x [wFrom + i] = partOfX [i];

}

Now let’s have a look at the main Solve method. In this method we are creating p-by-q grid of bi-directional channels and thenlaunching p * q movable methods giving them as parameters corresponding parts of matrix a (and if it is necessary corresponding parts of vector b). Also all movable methods receive bi-directional channels pointing to process’s neighbors and to current process itself, as well as the semi-directional channel to return the result of computations. Actually only p processes will return values. These processes are located in the diagonal of p-by-q processes grid.

We also have here one Get & xChannel bound which is used to receive parts of calculated vector x from running movable methods and to “merge” these parts into the resulting vector.

movable hplDistributed( int c, double[,] a, double[] b, int n, int p, int q, int yStart, int yEnd, int xStart, int xEnd,

BDChannel top, BDChannel left, BDChannel current, BDChannel bottom, BDChannel right, Channel(double[], int) xChannel ) {

double[,] l = new double [yEnd - yStart, xEnd];

double[,] u = new double [yEnd, xEnd - xStart];

double[] y = new double [yEnd - yStart];

double[] ySum = new double [yEnd - yStart];

double[] x = new double [yEnd - yStart];

int i = 0, j = 0, k = 0; // counters

if ( b != null ) // if it is the first column in the row then copy part of vector b to ySum

for ( i = 0; i < b.Length; i++ ) ySum [i] = b [i];

// Phase 1: Calculate vector y

int nTimes = 0; // How many arrays we can receive from neighbour processes?

if ( left != null && top != null ) nTimes = 2;

else if ( left != null || top != null ) nTimes = 1;

for ( k = 0; k < nTimes; k++ ) {

// Receiving part of matrixes L or U from left or top processes

object[] o = current.Receive();

int t = (int) o [0]; int h = (int) o [2]; int w = (int) o [3];

double[,] prev = (double[,]) o [1];

if ( t == 0 ) { // Received part of matrix L from the left process

// Part of matrix L needed already has been calculated -

// pass it to right processor in the row with the calculated part of matrix y[k]

ySum = (double[]) o[4];

if ( xStart > yStart && xEnd > yEnd && right != null )

right.Send( 0, prev, h, w, ySum ); // "0" here means that value was sent from the left

// copying prev to l

for ( i = 0; i < h; i++ )

for ( j = 0; j < w; j++ ) l [i,j] = prev [i,j];

}

else if ( t == 1 ) { // Received part of matrix L from top process

y = (double[]) o[4];

// Part of matrix L needed already has been calculated - pass it to bottom processor in the column

if ( yStart > xStart && yEnd > xEnd && bottom != null )

bottom.Send( 1, prev, h, w, y ); // "1" here means that value was sent from the top process

// copying prev to u

for ( i = 0; i < h; i++ )

for ( j = 0; j < w; j++ ) u [i,j] = prev [i,j];

}

}

…

And finally here is our movable method hplDistributed:

// Calculate parts of matrixes L, U

for ( i = 0; i < yEnd - yStart; i++ ) {

int max = xEnd;

if ( max > n ) max = n;

for ( j = 0; j < max - xStart; j++ ) {

int iPos = i + yStart;

int jPos = j + xStart;

if ( iPos == jPos ) {

// Diagonal

u [iPos,j] = 1;

l [i,jPos] = a [i,j];

for ( k = 0; k < jPos; k++ ) l [i,jPos] -= l [i,k] * u [k,j];

}

else if ( iPos < jPos ) {

// Above diagonal

l [i,jPos] = 0;

u [iPos,j] = a [i,j];

for ( k = 0; k < iPos; k++ ) u [iPos,j] = u [iPos,j] - l [i,k] * u [k,j];

u [iPos,j] = u [iPos,j] / l [i,iPos];

}

else {

// Beyond diagonal

u [iPos,j] = 0;

l [i,jPos] = a [i,j];

for ( k = 0; k < jPos; k++ ) l [i,jPos] -= l [i,k] * u [k,j];

}

if ( xStart < yStart ) ySum [i] -= l [i, jPos] * y [j];

}

}

// Calculating y[i]

if ( xStart == yStart )

for ( i = 0; i < yEnd - yStart; i++ ) {

y[i] = ySum [i];

for ( j = 0; j < i; j++ ) y [i] -= y[j] * l [i,j+xStart];

y [i] = y[i] / l [i,i+xStart];

}

// Sending L matrix to right channel if it hasn't been sent before + partial sum ySum

if ( right != null && (xStart <= yStart || xEnd <= yEnd) ) {

right.Send( 0, l, yEnd - yStart, xEnd, ySum ); // "0" means that value was sent from the left process

}

// Sending U matrix to bottom channel if it hasn't been sent before + part of vector y

if ( bottom != null && yStart <= xStart )

bottom.Send( 1, u, yEnd, xEnd - xStart, y ); // "1" means that value was sent from the top process

…

// STEP 2: Backward substitution - vector x calculation

nTimes = 0;

if ( xStart == yStart && bottom != null && right != null ) nTimes = 1;

else if ( xStart > yStart && (bottom != null && right != null) ) nTimes = 2;

else if ( xStart > yStart && (bottom != null || right != null) ) nTimes = 1;

for ( i = 0; i < yEnd - yStart; i++ ) ySum [i] = 0;

for ( i = 0; i < nTimes; i++ ) {


int t = (int) o [0];

if ( t == 2 ) { // Received part of vector x from the bottom process

x = (double[]) o [1];

if ( top != null ) top.Send( 2, x );

}

else if ( t == 3 ) // Received partial sum of vector x from the right process

ySum = (double[]) o [1];

}

// Calculate x vector and pass it to the top channel if necessary

if ( xStart == yStart ) {

for ( i = yEnd - yStart - 1; i >= 0; i-- ) {

x [i] = y [i] - ySum [i];

for ( j = xEnd - xStart - 1; j > i; j-- ) x [i] -= x[j] * u[i + yStart, j];

x [i] = x [i] / u [i +yStart, i];

}

if ( top != null ) top.Send( 2, x ); // "2" means that the value was sent from the bottom process

}

else if ( xStart > yStart ) {

if ( left != null ) {

for ( i = yEnd - yStart - 1; i >= 0; i-- )

for ( j = xEnd - xStart - 1; j >= 0; j-- ) {

ySum [i] += u [i,j] * x [j];

}

left.Send( 3, ySum ); // "3" means that the value was sent from the right process

}

}

if ( xStart == yStart ) xChannel.Send( x, yStart );

}

The communication scheme is described more closely in the next slides.

Aabottom

left right

top10 32 54 6

100

0

300

200

500

400

600

10

0

0

30

0

20

0

50

0

40

0

60

1

60

0

1

3

2

5

4

6

NP

QN

+1

Vector b

i

j

yStart

yEnd

xStart

xEnd

Calculate y[0]Step 1

Calculatingvector y


Aabottom

left right

top10 32 54 6

100

0

300

200

500

400

600

10

0

0

30

0

20

0

50

0

40

0

60

1

60

0

1

3

2

5

4

6

NP

QN

+1

Vector b

i

j

yStart

yEnd

xStart

xEnd

y[0], U00

L00


Aabottom

left right

top10 32 54 6

100

0

300

200

500

400

600

10

0

0

30

0

20

0

50

0

40

0

60

1

60

0

1

3

2

5

4

6

NP

QN

+1

Vector b

i

j

yStart

yEnd

xStart

xEnd

Calculate y[1]

L00

y[0], U00

U01

L10, ySum


Aabottom

left right

top10 32 54 6

100

0

300

200

500

400

600

10

0

0

30

0

20

0

50

0

40

0

60

1

60

0

1

3

2

5

4

6

NP

QN

+1

Vector b

i

j

yStart

yEnd

xStart

xEnd

L00

y[0], U00

L20, ySum

U02

L10+11

U01+11


Aabottom

left right

top10 32 54 6

100

0

300

200

500

400

600

10

0

0

30

0

20

0

50

0

40

0

60

1

60

0

1

3

2

5

4

6

NP

QN

+1

Vector b

i

j

yStart

yEnd

xStart

xEnd

Calculate y[2]


Aabottom

left right

top10 32 54 6

100

0

300

200

500

400

600

10

0

0

30

0

20

0

50

0

40

0

60

1

60

0

1

3

2

5

4

6

NP

QN

+1

Vector b

i

j

yStart

yEnd

xStart

xEnd


Aabottom

left right

top10 32 54 6

100

0

300

200

500

400

600

10

0

0

30

0

20

0

50

0

40

0

60

1

60

0

1

3

2

5

4

6

NP

QN

+1

Vector b

i

j

yStart

yEnd

xStart

xEnd

Calculate y[3]


Aabottom

left right

top10 32 54 6

100

0

300

200

500

400

600

10

0

0

30

0

20

0

50

0

40

0

60

1

60

0

1

3

2

5

4

6

NP

QN

+1

Vector b

i

j

yStart

yEnd

xStart

xEnd


Aabottom

left right

top10 32 54 6

100

0

300

200

500

400

600

10

0

0

30

0

20

0

50

0

40

0

60

1

60

0

1

3

2

5

4

6

NP

QN

+1

Vector b

i

j

yStart

yEnd

xStart

xEnd

Calculate y[4]


Aabottom

left right

top10 32 54 6

100

0

300

200

500

400

600

10

0

0

30

0

20

0

50

0

40

0

60

1

60

0

1

3

2

5

4

6

NP

QN

+1

Vector b

i

j

yStart

yEnd

xStart

xEnd


Aabottom

left right

top10 32 54 6

100

0

300

200

500

400

600

10

0

0

30

0

20

0

50

0

40

0

60

1

60

0

1

3

2

5

4

6

NP

QN

+1

Vector b

i

j

yStart

yEnd

xStart

xEnd

Calculate y[5]


Aabottom

left right

top10 32 54 6

100

0

300

200

500

400

600

1000 300

200

500

400

601

600

1

3

2

5

4

6

NP

QN

+1

Vector b

i

j

yStart

yEnd

xStart

xEnd

L00, U00 L00, U01 L00, U02 L00, U03 L00, U04 L00, U05

L10, U00 L10+11, U01+11

L10+11, U02+12

L10+11, U03+13

L10+11, U04+14

L10+11, U05+15

L20, U00

L20+21, U01+11

L20+21+22, U02+12+22

L20+21+22, U03+13+23

L20+21+22, U04+14+24

L20+21+22, U05+15+25

L30, U00

L40, U00

L50, U00

L30+31, U01+11

L40+41, U01+11

L50+51, U01+11

L30+31+32, U02+12+22

L40+41+42, U02+12+22

L50+51+52, U02+12+22

L30+31+32+33, U03+13+23+33

L40+41+42+43, U03+13+23+33

L50+51+52+53, U03+13+23+33

L30+31+32+33, U04+14+24+34

L30+31+32+33, U05+15+25+35

L40+41+42+43+44, U04+14+24+34+44

L50+51+52+53+54, U04+14+24+34+44

L40+41+42+43+44, U05+15+25+35+45

L50+51+52+53+54+55, U05+15+25+35+45+55

The final matrixes L and U fragments distribution.


Aabottom

left right

top10 32 54 6

100

0

300

200

500

400

600

10

0

0

30

0

20

0

50

0

40

0

60

1

60

0

1

3

2

5

4

6

NP

QN

+1

Vector b

i

j

yStart

yEnd

xStart

xEnd

Calculate x [5]

Pass x [5] tothe main method

Step 2Calculating

vector x


Aabottom

left right

top10 32 54 6

100

0

300

200

500

400

600

10

0

0

30

0

20

0

50

0

40

0

60

1

60

0

1

3

2

5

4

6

NP

QN

+1

Vector b

i

j

yStart

yEnd

xStart

xEnd

Calculate x [4]


Aabottom

left right

top10 32 54 6

100

0

300

200

500

400

600

10

0

0

30

0

20

0

50

0

40

0

60

1

60

0

1

3

2

5

4

6

NP

QN

+1

Vector b

i

j

yStart

yEnd

xStart

xEnd

Pass x [4] tothe main method


Aabottom

left right

top10 32 54 6

100

0

300

200

500

400

600

10

0

0

30

0

20

0

50

0

40

0

60

1

60

0

1

3

2

5

4

6

NP

QN

+1

Vector b

i

j

yStart

yEnd

xStart

xEnd

Calculate x[3]


Aabottom

left right

top10 32 54 6

100

0

300

200

500

400

600

10

0

0

30

0

20

0

50

0

40

0

60

1

60

0

1

3

2

5

4

6

NP

QN

+1

Vector b

i

j

yStart

yEnd

xStart

xEnd

Pass x[3] tothe main method


Aabottom

left right

top10 32 54 6

100

0

300

200

500

400

600

10

0

0

30

0

20

0

50

0

40

0

60

1

60

0

1

3

2

5

4

6

NP

QN

+1

Vector b

i

j

yStart

yEnd

xStart

xEnd

Calculate x[2]


Aabottom

left right

top10 32 54 6

100

0

300

200

500

400

600

10

0

0

30

0

20

0

50

0

40

0

60

1

60

0

1

3

2

5

4

6

NP

QN

+1

Vector b

i

j

yStart

yEnd

xStart

xEnd



Aabottom

left right

top10 32 54 6

100

0

300

200

500

400

600

10

0

0

30

0

20

0

50

0

40

0

60

1

60

0

1

3

2

5

4

6

NP

QN

+1

Vector b

i

j

yStart

yEnd

xStart

xEnd

Calculate x[1]


Aabottom

left right

top10 32 54 6

100

0

300

200

500

400

600

10

0

0

30

0

20

0

50

0

40

0

60

1

60

0

1

3

2

5

4

6

NP

QN

+1

Vector b

i

j

yStart

yEnd

xStart

xEnd



Aabottom

left right

top10 32 54 6

100

0

300

200

500

400

600

10

0

0

30

0

20

0

50

0

40

0

60

1

60

0

1

3

2

5

4

6

NP

QN

+1

Vector b

i

j

yStart

yEnd

xStart

xEnd

Calculate x[0] and pass it to the

main method


Aabottom

left right

top10 32 54 6

100

0

300

200

500

400

600

10

0

0

30

0

20

0

50

0

40

0

60

1

60

0

1

3

2

5

4

6

NP

QN

+1

Vector b

i

j

yStart

yEnd

xStart

xEnd

Calculate x[0] and pass it to the

main methodVector x now can be merged

from fragments on the main node!


This implementation has a lack that all communications go through the cluster’s frontend. This happens because all bi-directional Channels were initially created on the frontend machine. It is possible to reduce the execution time by using the mutual exchange of bi-directional channels between neighbor processes (see next slides to understand how it can be done).

0.00

1000.00

2000.00

3000.00

4000.00

5000.00

6000.00

100

250

500

750

1000

1250

1500

1750

2000

2250

2500

2750

3000

3250

3500

3750

4000

N

Tim

e (i

n s

eco

nd

s)

hpl_notparallel.mcs hpl_7.mcs

Here are the measurements for algorithm described in the previous slides.

hpl_notparallel.mcs: Not parallel versionSKIF 16x2 nodes cluster http://skif.botik.ru/ [vadim@skif gfft]$ uname -aLinux skif 2.4.27 #1 SMP Thu Apr 14 15:25:11 MSD 2005 i686 athlon i386 GNU/Linux

hpl_7.mcs: P=10 Q=10 NP=32 mono hpl_7.exe N 10 10 /np 32Note: This version includes times needed for matrix A and vector b generation.

Aabottom

left right

top10 32 54 6

100

0

300

200

500

400

600

10

0

0

30

0

20

0

50

0

40

0

60

1

60

0

1

3

2

5

4

6

NP

QN

+1

Vector b

i

j

yStart

yEnd

xStart

xEnd

Step 0Exchange ofbi-directional

channels

Each process creates bi-directional channel locally and pass it to the right process


Aabottom

left right

top10 32 54 6

100

0

300

200

500

400

600

10

0

0

30

0

20

0

50

0

40

0

60

1

60

0

1

3

2

5

4

6

NP

QN

+1

Vector b

i

j

yStart

yEnd

xStart

xEnd

Each process passes local bi-directional channel to theleft process


Aabottom

left right

top10 32 54 6

100

0

300

200

500

400

600

10

0

0

30

0

20

0

50

0

40

0

60

1

60

0

1

3

2

5

4

6

NP

QN

+1

Vector b

i

j

yStart

yEnd

xStart

xEnd

Each process passes local bi-directional channel to thebottom process


Aabottom

left right

top10 32 54 6

100

0

300

200

500

400

600

10

0

0

30

0

20

0

50

0

40

0

60

1

60

0

1

3

2

5

4

6

NP

QN

+1

Vector b

i

j

yStart

yEnd

xStart

xEnd

Each process passes local bi-directional channel to thetop process


// STEP 0: Fasten the data exchange by mutual exchange of the bdchannels

// otherwise all network traffic will go through the cluster frontend...

BDChannel currentNew = new BDChannel();

int nTimes = 0; // How many neighbors this process has?

if ( right != null ) { right.Send( 1, currentNew ); nTimes++; }

if ( left != null ) { left.Send( 2, currentNew ); nTimes++; }

if ( bottom != null ) { bottom.Send( 3, currentNew ); nTimes++; }

if ( top != null ) { top.Send( 4, currentNew ); nTimes++; }

for ( i = 0; i < nTimes; i++ ) {


int t = (int) o[0];

if ( t == 1 ) left = (BDChannel) o[1]; // BDChannel came from the left process

else if ( t == 2 ) right = (BDChannel) o[1]; // BDChannel came from the right process

else if ( t == 3 ) top = (BDChannel) o[1]; // BDChannel came from the top process

else if ( t == 4 ) bottom = (BDChannel) o[1]; // BDChannel came from the bottom process

}

current = currentNew;

This is how the previous four slides can be written in MC# language (it should be inserted at the beginning of the method):

0.00

100.00

200.00

300.00

400.00

500.00

600.00

700.00

N

Tim

e (i

n s

eco

nd

s)

hpl_7.mcs hpl_8.mcs

And this is the differencein execution time:

If we compare these two implementations and have a look at the statistics provided by MC# runtime we’ll see that in hpl_8.mcs during the calculations of 4000x4000 matrix 974'032'636 bytes of information are transferred between nodes

[skif gfft]$ mono hpl_8.exe 4000 10 10 /np 32________________________________________________==MC# Statistics================================Number of movable calls: 100Number of channel messages: 10Number of movable calls (across network): 100Number of channel messages (across network): 10Total size of movable calls (across network): 128114740 bytesTotal size of channel messages (across network): 33890 bytesTotal time of movable calls serialization: 00:00:14.0937610Total time of channel messages serialization: 00:00:00.0106820Total size of transported messages: 974032636 bytesTotal time of transporting messages: 00:01:09.7663320Session initialization time: 00:00:00.4988500 / 0.49885 sec. / 498.85 msec. Total time: 00:20:32.4156470 / 1232.415647 sec. / 1232415.647 msec.________________________________________________

While in hpl_7.mcs during the calculations of the same 4000x4000 matrix 1'819'549'593 bytes of information are transferred between nodes

[skif gfft]$ mono hpl_7.exe 4000 10 10 /np 32________________________________________________==MC# Statistics================================Number of movable calls: 100Number of channel messages: 10Number of movable calls (across network): 100Number of channel messages (across network): 10Total size of movable calls (across network): 128114740 bytesTotal size of channel messages (across network): 33890 bytesTotal time of movable calls serialization: 00:00:20.9258120Total time of channel messages serialization: 00:00:00.0719850Total size of transported messages: 1819549593 bytesTotal time of transporting messages: 00:02:38.6718660Session initialization time: 00:00:00.4900100 / 0.49001 sec. / 490.01 msec.Total time: 00:21:12.4946390 / 1272.494639 sec. / 1272494.639 msec.________________________________________________

Explaining the figures/Limitations of implementation

1) Implemented HPL algorithm was selected by the principle "as simple as possible to understand and to read the final code". It is possible to significantly improve the productivity by using more advanced algorithms of Panel Broadcasting and Update and Look-Ahead heuristics

2) MC# Runtime system has not been optimized yet for really big number of processors - it is working quite good for clusters with number of processors NP<=16. Note that MC# is still a research project. It is just a matter of time, before we get the really effective runtime system.

3) Currently there are no broadcast operations in the MC# syntax. It looks like that we'll have to add such capability to the language in the future

4) HPL requires intensive usage of network bandwidth. The speedup is possible in case of using SCI network adapters by MC# runtime (currently in development). In these measurements we used standard Ethernet adapters

5) MC# is using standard .Net Binary Serializer for transferring objects from one node to another one. This operation is quite memory-consuming. Improved performance can be achieved by writing custom serializers.

6) Mono implementation of .Net platform is not yet as fast as implementation from Microsoft

Thanks for your time!

MC# Homepage: http://u.pereslavl.ru/~vadim/MCSharp/ (the site of the project can be temporary down @ October/November due to hardware upgrade works)

Special thanks to:

Yury P. Serdyuk – for his great work on MC# project and help in preparing this document

Program Systems Institute / University of Pereslavl for hosting MC# project homepage

Global FFT, Global EP-STREAM Triad, HPL written in MC# Vadim B. Guzev [email protected] Russian...

Documents

Transcript of Global FFT, Global EP-STREAM Triad, HPL written in MC# Vadim B. Guzev [email protected] Russian...