Global FFT, Global EP-STREAM Triad, HPL written in MC# Vadim B. Guzev [email protected] Russian...
Transcript of Global FFT, Global EP-STREAM Triad, HPL written in MC# Vadim B. Guzev [email protected] Russian...
Global FFT, Global EP-STREAM Triad, HPL
written in MC#
Vadim B. Guzev [email protected]
Russian People Friendship UniversityOctober 2006
Global FFT
In this submission we’ll show how to implement GlobalFFT in MC# programming language and we’ll try to concentrate on the process of writing parallel distributed programs, but not on the performance/number of lines issues. We are quite sure that the future belongs to very high-level programming languages and that one day the productivity of the programmers will become more important thing, than the productivity of the platforms! That’s why MC# was born…
In object-oriented languages all programs are composed of objects and their interaction. It is natural that when programmer starts thinking about the problem first of all he would like to describe the object model before writing any logic. In Global FFT program these classes are Complex (structure) and GlobalFFT (algorithm). We will start writing our program by defining the Complex class:
public class Complex { public Complex( double re, double im ) { Re = re; Im = im; } public double Re = 0; public double Im = 0;}
The simple math behind the Global FFT problem is the following:
zize zi sincos
m
jj
m
j
m
kji
jk m
kji
m
kjzezZ
2sin
2cos
2
m
jjj m
kji
m
kjziz
2sin
2cosImRe
m
jjjjj m
kjz
m
kjzi
m
kjzi
m
kjz
2sinIm
2cosIm
2sinRe
2cosRe
m
jjjjj m
kjz
m
kjzi
m
kjz
m
kjz
2sinRe
2cosIm
2sinIm
2cosRe
The natural way to distribute this computation is to split the execution by index k. That’s exactly what we will do. In MC# if you want to create a method which must be executed in different thread/node/cluster all you need to do is to mark this method as movable (distributed analogue of void, or async keyword of C# 3.0). Where exactly movable method will be executed is determined by the Runtime system and the call of movable method on the caller’s side occurs almost immediately (i.e. caller of the method doesn’t wait until the method execution is completed). In our case this movable method will receive as parameters:(a) Array of complex values z, (b) Current number of processor and (c) Special channel into which the result will be sent.
movable Calculate( Complex[] z, int processorNumber, Channel( Complex[] ) channelZ ) { int np = CommWorld.Size; if ( np > 1 ) np -= 1; // When program runs in distributed mode frontend is included in CommWorld.Size int partLength = z.Length / np; int shift = processorNumber * partLength; Complex[] partOfZ = new Complex [partLength]; for ( int k = 0; k < partLength; k++ ) { partOfZ [k] = new Complex(); for ( int j = 0; j < z.Length; j++ ) { double arg = 2 * Math.PI * (j + 1) * (k + shift + 1) / (double) z.Length; double cos = Math.Cos( arg ); double sin = Math.Sin( arg ); partOfZ [k].Re += z [j].Re * cos + z [j].Im * sin; partOfZ [k].Im += z [j].Im * cos - z [j].Re * sin; } } channelZ.Send( partOfZ, processorNumber );}
As you can see, in MC# it is possible to use almost any types of parameters for movable methods. When distributed mode is enabled these parameters will be automatically serialized and sent to remote node. The same applies to channels – it is possible to send values of any .Net type (which supports serialization) through the channels. To get the results from channels you have to connect them with synchronous methods – this is known as bounds in languages like Polyphonic C#, C# 3.0 or MC#. More information about bounds you can get on MC# site.
void Get( ref Complex[] Z ) & Channel CZ( Complex[] partOfZ, int processorNumber ) { int shift = processorNumber * partOfZ.Length; for ( int i = 0; i < partOfZ.Length; i++ ) Z [i + shift].Re = partOfZ [i].Re; Z [i + shift].Im = partOfZ [i].Im;}
And finally let’s write down the main method which will launch the computation:
public static void Main( string[] args ) { GlobalFFT fft = new GlobalFFT(); // Getting m as parameter of the program (m = 2^n) int m = Int32.Parse( args [0] ); Random r = new Random(); Complex[] z = new Complex [m]; // Initializing vector z for ( int i = 0; i < m; i++ ) z [i] = new Complex( r.NextDouble(), r.NextDouble() ); int np = CommWorld.Size; if ( np > 1 ) np -= 1; // Launching processing functions for ( int k = 0; k < np; k++ ) fft.Calculate( z, k, fft.CZ ); // Collecting results – result will be saved to vector z for ( int k = 0; k < np; k++ ) fft.Get( ref z );}
using System;public class Complex { public Complex( double re, double im ) { Re = re; Im = im; } public double Re = 0; public double Im = 0;}
public class GlobalFFT { movable Calculate( Complex[] z, int processorNumber, Channel( Complex[] ) channelZ ) { int np = CommWorld.Size; if ( np > 1 ) np -= 1; // When program runs in distributed mode cluster's frontend is included in CommWorld.Size int partLength = z.Length / np; int shift = processorNumber * partLength; Complex[] partOfZ = new Complex [partLength]; for ( int k = 0; k < partLength; k++ ) { partOfZ [k] = new Complex(); for ( int j = 0; j < z.Length; j++ ) { double arg = 2 * Math.PI * (j + 1) * (k + shift + 1) / (double) z.Length; double cos = Math.Cos( arg ); double sin = Math.Sin( arg ); partOfZ [k].Re += z [j].Re * cos + z [j].Im * sin; partOfZ [k].Im += z [j].Im * cos - z [j].Re * sin; } } channelZ.Send( partOfZ, processorNumber ); } void Get( ref Complex[] Z ) & Channel CZ( Complex[] partOfZ, int processorNumber ) { int shift = processorNumber * partOfZ.Length; for ( int i = 0; i < partOfZ.Length; i++ ) { Z [i + shift].Re = partOfZ [i].Re; Z [i + shift].Im = partOfZ [i].Im; } }
public static void Main( string[] args ) { GlobalFFT fft = new GlobalFFT(); int m = Int32.Parse( args [0] ); Random r = new Random(); Complex[] z = new Complex [m]; for ( int i = 0; i < m; i++ ) z [i] = new Complex( r.NextDouble(), r.NextDouble() ); int np = CommWorld.Size; if ( np > 1 ) np -= 1; for ( int k = 0; k < np; k++ ) fft.Calculate( z, k, fft.CZ ); for ( int k = 0; k < np; k++ ) fft.Get( ref z ); }}
So, the first version of our Global FFT program is the following:
Parallel programs written in MC# language can be executed either in local mode (i.e. as simple .exe files – in this mode all movable methods will be executed in different threads) or in distributed mode – in this case all movable calls will be distributed across nodes of the Cluster/MetaCluster/GRID network (depending on the currently used Runtime). That means that programmer can write and debug his program locally (for example on his Windows machine) and then copy his program to Windows-based or Linux-based cluster and run it in distributed mode. User can even emulate cluster environment on his home computer! MC# makes cluster computations accessible to every programmer, even to those of them who currently do not have access to clusters!Let’s try to run this program in local mode on Windows machine:
R:\projects\MCSharp\hpcchallenge>GlobalFFT.exe 1024________________________________________________==MC# Statistics================================Number of movable calls: 1Number of channel messages: 1Number of movable calls (across network): 0Number of channel messages (across network): 0Total size of movable calls (across network): 0 bytesTotal size of channel messages (across network): 0 bytesTotal time of movable calls serialization: 00:00:00.0156250Total time of channel messages serialization: 00:00:00Total size of transported messages: 0 bytesTotal time of transporting messages: 00:00:00Session initialization time: 00:00:00.0312500 / 0.03125 sec. / 31.25 msec.Total time: 00:00:00.4531250 / 0.453125 sec. / 453.125 msec.________________________________________________
Or we can run this program in local mode on Linux machine:
[vadim@skif gfft]$ mono GlobalFFT.exe 1024MC#.Runtime, v. 1.13.2437.555________________________________________________==MC# Statistics================================Number of movable calls: 1Number of channel messages: 1Number of movable calls (across network): 0Number of channel messages (across network): 0Total size of movable calls (across network): 0 bytesTotal size of channel messages (across network): 0 bytesTotal time of movable calls serialization: 00:00:00.0869900Total time of channel messages serialization: 00:00:00Total size of transported messages: 0 bytesTotal time of transporting messages: 00:00:00Session initialization time: 00:00:00.1355990 / 0.135599 sec. / 135.599 msec.Total time: 00:00:00.5795040 / 0.579504 sec. / 579.504 msec.________________________________________________
OK, it works. Now let’s try to run this program on more serious hardware. We’ll use 16-nodes cluster with the following configuration:
[vadim@skif vadim]$ uname –aLinux skif 2.4.27 #1 SMP Thu Apr 14 15:25:11 MSD 2005 i686 athlon i386 GNU/Linux
Let’s run our program on 16 processors:
[vadim@skif gfft]$ mono GlobalFFT.exe 16384 /np 16MC#.Runtime, v. 1.13.2437.555________________________________________________==MC# Statistics================================Number of movable calls: 16Number of channel messages: 16Number of movable calls (across network): 16Number of channel messages (across network): 16Total size of movable calls (across network): 7871648 bytesTotal size of channel messages (across network): 495824 bytesTotal time of movable calls serialization: 00:00:13.8864840Total time of channel messages serialization: 00:00:01.0427460Total size of transported messages: 8369192 bytesTotal time of transporting messages: 00:00:00.5808540Session initialization time: 00:00:00.4839560 / 0.483956 sec. / 483.956 msec.Total time: 00:00:30.1959760 / 30.195976 sec. / 30195.976 msec.________________________________________________
Here is the result graph:
GlobalFFT.mcs
65536 32768 16384
1 476.55 112.52
2 257.54 59.47
4 604.2 149.11 36.90
8 401.7 106.95 28.68
16 329.1 97.70 30.20
GlobalFFT.mcs
0
100
200
300
400
500
600
700
1 2 4 8 16
Number of processors
Tim
e (
sec)
65536
32768
16384
Not bad, especially if we take into account that we wrote this program on modern high-level object-oriented language without thinking about any optimization issues or the physical structure of computational platform. Now let’s try to optimize it a little bit. The main problem in the first version of our program is that we need to move thousands of complex user-defined objects from frontend to cluster nodes and back. Serialization/Deserialization process of such objects takes a lot of resources and time. We can significantly reduce the execution time if we replace “arrays of Complex” to “arrays of doubles”. Here is the modified version of the program (lines 46, NCSL 43, TPtoks 490):
using System;public class GlobalFFT { movable FFT( double[] zRe, double[] zIm, int processorNumber, Channel( double[], double[], int ) channelZ ) { int np = CommWorld.Size; if ( np > 1 ) np -= 1; // When program runs in distributed mode then cluster's frontend is included in CommWorld.Size int partLength = zRe.Length / np; int shift = processorNumber * partLength; double[] partOfZRe = new double [partLength]; double[] partOfZIm = new double [partLength]; double multiplier = 2 * Math.PI / (double) zRe.Length; for ( int k = 0; k < partLength; k++ ) { for ( int j = 0; j < zRe.Length; j++ ) { double arg = multiplier * (j + 1) * (k + shift + 1); double cos = Math.Cos( arg ); double sin = Math.Sin( arg ); partOfZRe [k] += zRe [j] * cos + zIm [j] * sin; partOfZIm [k] += zIm [j] * cos - zRe [j] * sin; } } channelZ.Send( partOfZRe, partOfZIm, processorNumber ); } void Get( ref double[] ZRe, ref double[] ZIm ) & Channel CZ( double[] partOfZRe, double[] partOfZIm, int processorNumber ) { int shift = processorNumber * partOfZRe.Length; for ( int i = 0; i < partOfZRe.Length; i++ ) { ZRe [i + shift] = partOfZRe [i]; ZIm [i + shift] = partOfZIm [i]; } } public static void Main( string[] args ) { GlobalFFT fft = new GlobalFFT(); int m = Int32.Parse( args [0] ); Random r = new Random(); double[] zRe = new double [m]; double[] zIm = new double [m]; for ( int i = 0; i < m; i++ ) { zRe [i] = r.NextDouble(); zIm [i] = r.NextDouble(); } int np = CommWorld.Size; if ( np > 1 ) np -= 1; for ( int k = 0; k < np; k++ ) fft.Calculate( zRe, zIm, k, fft.CZ ); for ( int k = 0; k < np; k++ ) fft.Get( ref zRe, ref zIm ); }}
[vadim@skif gfft]$ mono GlobalFFT_arrays.exe 32768 /np 16MC#.Runtime, v. 1.13.2437.555________________________________________________==MC# Statistics================================Number of movable calls: 32Number of channel messages: 32Number of movable calls (across network): 32Number of channel messages (across network): 32Total size of movable calls (across network): 16792256 bytesTotal size of channel messages (across network): 1054848 bytesTotal time of movable calls serialization: 00:00:00.2735930Total time of channel messages serialization: 00:00:00.3594260Total size of transported messages: 17849757 bytesTotal time of transporting messages: 00:00:01.2621410Session initialization time: 00:00:00.4820490 / 0.482049 sec. / 482.049 msec.Total time: 00:00:28.8684640 / 28.868464 sec. / 28868.464 msec.________________________________________________
Now we can get some performance numbers for the second version of our Global FFT program:
Let’s run this version of the program:
GlobalFFT_arrays.mcs
131072 65536 32768 16384
- 844.50 210.07 52.51
- 425.47 106.15 26.82
858.79 219.76 53.70 13.78
432.52 108.12 27.69 7.39
218.22 55.45 14.93 4.37
GlobalFFT_arrays.mcs
0
100
200
300
400
500
600
700
800
900
1000
1 2 4 8 16
Number of processors
Tim
e (
sec)
131072
65536
32768
16384
Global EP-STREAM-Triad
There is another one task in Class 2 Specification which is entitled as Global EP-STREAM-Triad. Although C# currently doesn’t support kernel vector operations we think that it is still good example to demonstrate the syntax of MC#. We’ll write a simple program which will make the same calculations on several nodes simultaneously and then print the average time taken on all nodes in general.There will be only one movable method in this program which will accept special Channel through which only objects of class TimeSpan can be sent:
movable fun( Channel( TimeSpan ) result ) { int m = 2000000; int n = 1000; Random r = new Random(); double[] a = new double[m]; double[] b = new double[m]; double[] c = new double[m]; for ( int i = 0; i < m; i++ ) { b [i] = r.NextDouble(); c [i] = r.NextDouble(); } double alpha = r.NextDouble(); TimeSpan ts = new TimeSpan(0); for ( int i = 0; i < n; i++ ) { DateTime from = DateTime.Now; for ( int j = 0; j < m; j++ ) a [j] = b [j] + alpha * c[j]; ts += DateTime.Now.Subtract( from ); } result.Send( ts );}
TimeSpan GetResult() & Channel result( TimeSpan ts ) { return ts; }
Movable methods cannot return any values. Channels must be used instead to pass the information between nodes.To read from “semi-directional” channels bounds must be used (special syntax constructs) which can synchronize multiple threads. In our case we need only one bound:
When Thread A is calling method GetResult then Runtime system checks whether any object has been delivered to theresult channel and queued in the special channel queue. If no objects have been received then Thread A is suspended until result channel receives some object. When this object will be received then Thread A will be resumed and the reading from the channel occurs.When object is sent to result channel and there is no waiting callers of method GetResult then this object is put into specialchannel queue. Object will be read when corresponding GetResult method will be called.
using System;public class EPStream { movable fun( Channel( TimeSpan ) result ) { int m = 2000000; int n = 1000; Random r = new Random(); double[] a = new double[m]; double[] b = new double[m]; double[] c = new double[m]; for ( int i = 0; i < m; i++ ) { b [i] = r.NextDouble(); c [i] = r.NextDouble(); } double alpha = r.NextDouble(); TimeSpan ts = new TimeSpan(0); for ( int i = 0; i < n; i++ ) { DateTime from = DateTime.Now; for ( int j = 0; j < m; j++ ) a [j] = b [j] + alpha * c[j]; ts += DateTime.Now.Subtract( from ); } result.Send( ts ); } TimeSpan GetResult() & Channel result( TimeSpan ts ) { return ts; } public static void Main( string[] args ) { EPStream e = new EPStream(); for ( int i = 0; i < CommWorld.Size; i++ ) e.fun( e.result ); TimeSpan total = new TimeSpan(0); for ( int i = 0; i < CommWorld.Size; i++ ) total += e.GetResult(); Console.WriteLine("Average: " + new TimeSpan( total.Ticks/CommWorld.Size )); }}
Let’s run this program:
[vadim@skif gfft]$ mono EPStream.exe /np 16MC#.Runtime, v. 1.13.2437.555Average: 00:01:13.4992476________________________________________________==MC# Statistics================================Number of movable calls: 17Number of channel messages: 17Number of movable calls (across network): 17Number of channel messages (across network): 17Total size of movable calls (across network): 6647 bytesTotal size of channel messages (across network): 2958 bytesTotal time of movable calls serialization: 00:00:00.0483080Total time of channel messages serialization: 00:00:00.3021390Total size of transported messages: 11321 bytesTotal time of transporting messages: 00:00:00.0160980Session initialization time: 00:00:00.4875540 / 0.487554 sec. / 487.554 msec.Total time: 00:01:35.2171990 / 95.217199 sec. / 95217.199 msec.________________________________________________
Here is the final version of this EP-STREAM Triad program (lines 36, NCSL 33, TPtoks 311):
)j(ii
kkjuiklija
iibiju
,bja
ju
11
1
1
11
11
n
ikkikii
nn
nixuyx
yx
1
).(
,
)(ii
kkyiklni,a
iiliy
,ln,a
y
11
11
1
11
111
1
11
11j
k)j(ikjuiklijaijl
,iail
HPL solves a linear system of equations of order n: Ax=b, by first computing A=LU factorization and then solving the equations Ly=b and Ux=y one by one. In this scheme L and U matrixes are triangular matrixes, so it is not a problem to solve them. The real problem is the calculation of these matrixes L and U.
The simple math behind this problem is the following:
L11 (1) L11 (2)
L21 (2)
U11 (1) U12 (2) U13 (3)
L21+22 (3)
U11 (2) U12+22 (3)
L31 (3)
U13+23 (4)
L31+32 (4)
And the calculation dependencies graph is the following:
1.1
2.3
1.2 2.2 2.5
3.4
1.4
3.5
2.6
4.5
4.6
1.51.6
2.6
2.6 2.7 2.7
2.7
4.7 4.8 4.9
3.02.8
1.4
2.74.8
2.8
1.6 2.5
4.9
1.7 2.6
2.8
6.1
5.0
3.03.1
3.1
2.9 3.0 3.0
3.2
3.0 3.1
3.13.2
7.2
7.2
2.9
2.7 2.8 2.91.5
2.42.3
2.7
3.6
3.7
4.75.9
2.6
4.6
6.02.8
4.97.0 7.2
2.7
4.8
6.0
7.3
2.8
2.4
2.5
2.9
2.95.0
7.3 8.4
2.95.1 7.3
8.4
2.8
4.9
7.2
8.3
2.8
5.0
7.2
8.3
1.3
3.7
3.8
4.8
4.9
6.0
3.8
3.9
4.7
4.8 2.9
5.0
5.1
7.1
5.1
5.7
6.97.0
5.2
8.37.4
3.0
6.1
3.05.2
5.3
7.48.3
4.9
5.0
7.1
5.9
7.3
2.9 7.1
5.0
5.1
8.2
7.39.5
7.1
5.18.2
10.55.2 3.1
9.4
10.6
5.8
11.6
Actually there are a lot of modifications of HPL algorithm out there. One of the most important thing in HPL is the way how calculated panels are broadcasted to other nodes. For example, in Increasing Ring algorithm process 0 sends two messages and process 1 only receives one message (0 1, 0 2, 2 3 and so on). This algorithm is almost always better, if not the best. Let's suppose that for a given matrix A calculations in one node take approximately ~1 minute, and transfer of matrixes between nodes takes ~1/10 minute. Using these primitive estimations we can calculate that it takes 11.6 minutes in Increasing Ring version vs. 36 minutes in sequential version. In practice it is even better, because this algorithm saves the bandwidth:
0 1 2 3
1.12.2
2.3
1.2 2.3 3.4
3.4
2.4 3.5
3.5
4.6
4.6
4.45.5
5.6
4.5 5.6 6.6
6.7
5.7 6.8
6.8
7.8
7.95.7
4.5
3.3
4.75.8
5.9
4.8 5.9 7.0
7.0
6.0 7.1
7.1
8.2
8.2
8.09.1
9.2
8.1 9.2 10.2
10.3
9.3 10.4
10.4
11.4
11.59.3
8.1
6.9
5.8 6.9 8.0 9.04.73.6
12.6
So, we know that there do exist better algorithms, but they are quite complex to understand and the purpose of our submission is not to get the highest performance results, but to show the principles of programming in MC# language. So, for simplicity reasons we’ll use the most simple communication structure where each process is communicating directly with top, left, bottom and right processes in the processes grid. In our case each process will be connected with their neighbors by bi-directional channels (BDChannel). Using these bi-directional channels processes can communicate to each other by sending and receiving messages.
public static void Main( string[] args ) {
if ( args.Length < 3 ) {
Console.WriteLine( "Usage: HPL.exe n p q" );
Console.WriteLine( "Where n - size of matrix A, p - height of process grid, q - width of process grid" );
return;
}
int n = Int32.Parse( args [0] );
int p = Int32.Parse( args [1] );
int q = Int32.Parse( args [2] );
double[,] a = new double [n, n];
double[] b = new double [n];
int maxRandNum = 100;
// Generate matrix A - Expected mean must be equal to zero
Random rand = new Random();
for ( int i = 0; i < n; i++ )
for ( int j = 0; j < n; j++ ) a [i,j] = rand.NextDouble() * maxRandNum - maxRandNum/2;
// Generate vector b - Expected mean must be equal to zero
for ( int i = 0; i < n; i++ ) b[i] = rand.NextDouble() * maxRandNum - maxRandNum/2;
DateTime dt1, dt2;
TimeSpan dt;
dt1 = DateTime.Now;
HPLAlgorithm hpl = new HPLAlgorithm(); // Creating an instance of HPL algorithm double[] x = hpl.Solve( a, b, n, p, q ); // launching the algorithm dt2 = DateTime.Now;
dt = dt2.Subtract( dt1 );
Console.WriteLine( "\nElapsed time: " + dt.TotalSeconds + " sec.\n" );
double performance = ( 2.0 * n * n * n / 3.0 + 3.0 * n * n / 2.0 ) * 10.0e-9 / (double) dt.TotalSeconds;
Console.WriteLine( "Performance = " + performance + " Gflop/sec" );
bool correctness = Verify( a, b, x, n );
if ( correctness == true )
Console.WriteLine( "Solution is correct" );
else
Console.WriteLine( "Solution is incorrect" );
}
The Main method of our program is quite simple. Actually it is written in pure C# code (MC# specific syntax is not used here). First of all we generate matrix A and vector b, and after that we instantiate the HPLAlgorithm object and solve the equation by calling the Solve method. After that we verify the solution. See comments in the code to get the better understanding of the code.
public static bool Verify( double[,] A, double[] b, double[] x, int n ) {
int i, j;
double tmp;
double eps = Double.MinValue;
double Ax_b_infin = 0.0; // || A x - b ||_infinity
for ( i = 0; i < n; i++ ) {
tmp = 0.0;
for ( j = 0; j < n; j++ ) tmp += A [i,j] * x [ j ];
tmp = Math.Abs( tmp - b [i] );
if ( tmp > Ax_b_infin ) Ax_b_infin = tmp;
}
double A_infin = 0.0; // || A ||_infinity
for ( i = 0; i < n; i++ ) {
tmp = 0.0;
for ( j = 0; j < n; j++ ) tmp += Math.Abs( A [i,j] );
if ( tmp > A_infin ) A_infin = tmp;
}
double A_1 = 0.0; // || A ||_1
for ( j = 0; j < n; j++ ) {
tmp = 0.0;
for ( i = 0; i < n; i++ ) tmp += Math.Abs( A [i,j] );
if ( tmp > A_1 ) A_1 = tmp;
}
double x_1 = 0.0; // || x ||_1
for ( i = 0; i < n; i++ ) x_1 += Math.Abs( x [i] );
double x_infin = 0.0; // || x ||_infinity
for ( i = 0; i < n; i++ )
if ( Math.Abs( x [ i ] ) > x_infin ) x_infin = Math.Abs( x [ i ] );
double r1 = Ax_b_infin / ( eps * A_1 * n );
double r2 = Ax_b_infin / ( eps * A_1 * x_1 );
double r3 = Ax_b_infin / ( eps * A_infin * x_infin * n );
double r;
if ( r1 > r2 ) r = r1; else r = r2;
if ( r3 > r ) r = r3;
Console.WriteLine( "r1 = " + r1 + "\n" + "r2 = " + r2 + "\n" + "r3 = " + r3 );
Console.WriteLine( "Max ri = " + r );
if ( r < 16 ) return true; else return false;
}
First of all let’s look at accessory methods. These are Verify, GetSubMatrix and GetSubVector.Verify method verifies the solution based on the criteria mentioned in the HPC Challenge Awards: Class 2 Specification:
i
ijj
aA max1
iixx max
1
na
bAx
nA
bAxr
iij
i
ii
max
)(max
1
1
ii
iij
i
ii
xa
bAx
xA
bAxr
maxmax
)(max
11
2
ii
iij
i
ii
xa
bAx
nxA
bAxr
maxmax
)(max3
16),,max( 321 rrr
public static double[,] GetSubMatrix( double[,] a,
int fromH, int toH, int fromW, int toW ) {
int h = toH - fromH;
int w = toW - fromW;
double[,] b = new double[h, w];
for ( int i = 0; i < h; i++ )
for ( int j = 0; j < w; j++ ) b [i, j] =
a [fromH+i, fromW+j];
return b;
}
public static double[] GetSubVector( double[] b,
int fromH, int toH ) {
int h = toH - fromH;
double[] c = new double[h];
for ( int i = 0; i < h; i++ ) c [i] = b [fromH + i];
return c;
}
public double[] Solve( double[,] a, double[] b, int n, int p, int q ) {
BDChannel[,] bdc = new BDChannel [p,q];
// Creating n * n bdchannels, where n - number of processors
for ( int i = 0; i < p; i++ )
for ( int j = 0; j < q; j++ ) bdc [i, j] = new BDChannel();
int diffP = n / p;
int diffQ = n / q;
for ( int i = 0; i < p; i++ ) {
int h = diffP;
if ( i == p - 1 ) h += n % p;
double[] partOfB = GetSubVector( b, i * diffP, i * diffP + h );
for ( int j = 0; j < q; j++ ) {
int w = diffQ;
if ( j == q - 1 ) w += n % q;
double[,] partOfA = GetSubMatrix( a, i * diffP, i * diffP + h, j * diffQ, j * diffQ + w );
BDChannel top = null, left = null, bottom = null, right = null;
if ( i != 0 ) top = bdc [i - 1, j];
if ( j != 0 ) left = bdc [i, j - 1];
if ( i < p - 1 ) bottom = bdc [i + 1, j];
if ( j < q - 1 ) right = bdc [i, j+1];
if ( j != 0 ) partOfB = null; // we need vector b only in the first processors column
hplDistributed( partOfA, partOfB, n, p, q, i * diffP, i * diffP + h, j * diffQ, j * diffQ + w,
top, left, bdc [i,j], bottom, right, xChannel );
}
}
double[] x = new double [n];
for ( int i = 0; i < p; i++ ) Get( ref x );
return x;
}
void Get( ref double[] x ) & Channel xChannel( double[] partOfX, int wFrom ) {
for ( int i = 0; i < partOfX.Length; i++ ) x [wFrom + i] = partOfX [i];
}
Now let’s have a look at the main Solve method. In this method we are creating p-by-q grid of bi-directional channels and thenlaunching p * q movable methods giving them as parameters corresponding parts of matrix a (and if it is necessary corresponding parts of vector b). Also all movable methods receive bi-directional channels pointing to process’s neighbors and to current process itself, as well as the semi-directional channel to return the result of computations. Actually only p processes will return values. These processes are located in the diagonal of p-by-q processes grid.
We also have here one Get & xChannel bound which is used to receive parts of calculated vector x from running movable methods and to “merge” these parts into the resulting vector.
movable hplDistributed( int c, double[,] a, double[] b, int n, int p, int q, int yStart, int yEnd, int xStart, int xEnd,
BDChannel top, BDChannel left, BDChannel current, BDChannel bottom, BDChannel right, Channel(double[], int) xChannel ) {
double[,] l = new double [yEnd - yStart, xEnd];
double[,] u = new double [yEnd, xEnd - xStart];
double[] y = new double [yEnd - yStart];
double[] ySum = new double [yEnd - yStart];
double[] x = new double [yEnd - yStart];
int i = 0, j = 0, k = 0; // counters
if ( b != null ) // if it is the first column in the row then copy part of vector b to ySum
for ( i = 0; i < b.Length; i++ ) ySum [i] = b [i];
// Phase 1: Calculate vector y
int nTimes = 0; // How many arrays we can receive from neighbour processes?
if ( left != null && top != null ) nTimes = 2;
else if ( left != null || top != null ) nTimes = 1;
for ( k = 0; k < nTimes; k++ ) {
// Receiving part of matrixes L or U from left or top processes
object[] o = current.Receive();
int t = (int) o [0]; int h = (int) o [2]; int w = (int) o [3];
double[,] prev = (double[,]) o [1];
if ( t == 0 ) { // Received part of matrix L from the left process
// Part of matrix L needed already has been calculated -
// pass it to right processor in the row with the calculated part of matrix y[k]
ySum = (double[]) o[4];
if ( xStart > yStart && xEnd > yEnd && right != null )
right.Send( 0, prev, h, w, ySum ); // "0" here means that value was sent from the left
// copying prev to l
for ( i = 0; i < h; i++ )
for ( j = 0; j < w; j++ ) l [i,j] = prev [i,j];
}
else if ( t == 1 ) { // Received part of matrix L from top process
y = (double[]) o[4];
// Part of matrix L needed already has been calculated - pass it to bottom processor in the column
if ( yStart > xStart && yEnd > xEnd && bottom != null )
bottom.Send( 1, prev, h, w, y ); // "1" here means that value was sent from the top process
// copying prev to u
for ( i = 0; i < h; i++ )
for ( j = 0; j < w; j++ ) u [i,j] = prev [i,j];
}
}
…
And finally here is our movable method hplDistributed:
// Calculate parts of matrixes L, U
for ( i = 0; i < yEnd - yStart; i++ ) {
int max = xEnd;
if ( max > n ) max = n;
for ( j = 0; j < max - xStart; j++ ) {
int iPos = i + yStart;
int jPos = j + xStart;
if ( iPos == jPos ) {
// Diagonal
u [iPos,j] = 1;
l [i,jPos] = a [i,j];
for ( k = 0; k < jPos; k++ ) l [i,jPos] -= l [i,k] * u [k,j];
}
else if ( iPos < jPos ) {
// Above diagonal
l [i,jPos] = 0;
u [iPos,j] = a [i,j];
for ( k = 0; k < iPos; k++ ) u [iPos,j] = u [iPos,j] - l [i,k] * u [k,j];
u [iPos,j] = u [iPos,j] / l [i,iPos];
}
else {
// Beyond diagonal
u [iPos,j] = 0;
l [i,jPos] = a [i,j];
for ( k = 0; k < jPos; k++ ) l [i,jPos] -= l [i,k] * u [k,j];
}
if ( xStart < yStart ) ySum [i] -= l [i, jPos] * y [j];
}
}
// Calculating y[i]
if ( xStart == yStart )
for ( i = 0; i < yEnd - yStart; i++ ) {
y[i] = ySum [i];
for ( j = 0; j < i; j++ ) y [i] -= y[j] * l [i,j+xStart];
y [i] = y[i] / l [i,i+xStart];
}
// Sending L matrix to right channel if it hasn't been sent before + partial sum ySum
if ( right != null && (xStart <= yStart || xEnd <= yEnd) ) {
right.Send( 0, l, yEnd - yStart, xEnd, ySum ); // "0" means that value was sent from the left process
}
// Sending U matrix to bottom channel if it hasn't been sent before + part of vector y
if ( bottom != null && yStart <= xStart )
bottom.Send( 1, u, yEnd, xEnd - xStart, y ); // "1" means that value was sent from the top process
…
// STEP 2: Backward substitution - vector x calculation
nTimes = 0;
if ( xStart == yStart && bottom != null && right != null ) nTimes = 1;
else if ( xStart > yStart && (bottom != null && right != null) ) nTimes = 2;
else if ( xStart > yStart && (bottom != null || right != null) ) nTimes = 1;
for ( i = 0; i < yEnd - yStart; i++ ) ySum [i] = 0;
for ( i = 0; i < nTimes; i++ ) {
object[] o = current.Receive();
int t = (int) o [0];
if ( t == 2 ) { // Received part of vector x from the bottom process
x = (double[]) o [1];
if ( top != null ) top.Send( 2, x );
}
else if ( t == 3 ) // Received partial sum of vector x from the right process
ySum = (double[]) o [1];
}
// Calculate x vector and pass it to the top channel if necessary
if ( xStart == yStart ) {
for ( i = yEnd - yStart - 1; i >= 0; i-- ) {
x [i] = y [i] - ySum [i];
for ( j = xEnd - xStart - 1; j > i; j-- ) x [i] -= x[j] * u[i + yStart, j];
x [i] = x [i] / u [i +yStart, i];
}
if ( top != null ) top.Send( 2, x ); // "2" means that the value was sent from the bottom process
}
else if ( xStart > yStart ) {
if ( left != null ) {
for ( i = yEnd - yStart - 1; i >= 0; i-- )
for ( j = xEnd - xStart - 1; j >= 0; j-- ) {
ySum [i] += u [i,j] * x [j];
}
left.Send( 3, ySum ); // "3" means that the value was sent from the right process
}
}
if ( xStart == yStart ) xChannel.Send( x, yStart );
}
The communication scheme is described more closely in the next slides.
Aabottom
left right
top10 32 54 6
100
0
300
200
500
400
600
10
0
0
30
0
20
0
50
0
40
0
60
1
60
0
1
3
2
5
4
6
NP
QN
+1
Vector b
i
j
yStart
yEnd
xStart
xEnd
Calculate y[0]Step 1
Calculatingvector y
Aabottom
left right
top10 32 54 6
100
0
300
200
500
400
600
10
0
0
30
0
20
0
50
0
40
0
60
1
60
0
1
3
2
5
4
6
NP
QN
+1
Vector b
i
j
yStart
yEnd
xStart
xEnd
y[0], U00
L00
Aabottom
left right
top10 32 54 6
100
0
300
200
500
400
600
10
0
0
30
0
20
0
50
0
40
0
60
1
60
0
1
3
2
5
4
6
NP
QN
+1
Vector b
i
j
yStart
yEnd
xStart
xEnd
y[0], U00
L00
Aabottom
left right
top10 32 54 6
100
0
300
200
500
400
600
10
0
0
30
0
20
0
50
0
40
0
60
1
60
0
1
3
2
5
4
6
NP
QN
+1
Vector b
i
j
yStart
yEnd
xStart
xEnd
Calculate y[1]
L00
y[0], U00
U01
L10, ySum
Aabottom
left right
top10 32 54 6
100
0
300
200
500
400
600
10
0
0
30
0
20
0
50
0
40
0
60
1
60
0
1
3
2
5
4
6
NP
QN
+1
Vector b
i
j
yStart
yEnd
xStart
xEnd
L00
y[0], U00
L20, ySum
U02
L10+11
U01+11
Aabottom
left right
top10 32 54 6
100
0
300
200
500
400
600
10
0
0
30
0
20
0
50
0
40
0
60
1
60
0
1
3
2
5
4
6
NP
QN
+1
Vector b
i
j
yStart
yEnd
xStart
xEnd
Calculate y[2]
Aabottom
left right
top10 32 54 6
100
0
300
200
500
400
600
10
0
0
30
0
20
0
50
0
40
0
60
1
60
0
1
3
2
5
4
6
NP
QN
+1
Vector b
i
j
yStart
yEnd
xStart
xEnd
Aabottom
left right
top10 32 54 6
100
0
300
200
500
400
600
10
0
0
30
0
20
0
50
0
40
0
60
1
60
0
1
3
2
5
4
6
NP
QN
+1
Vector b
i
j
yStart
yEnd
xStart
xEnd
Calculate y[3]
Aabottom
left right
top10 32 54 6
100
0
300
200
500
400
600
10
0
0
30
0
20
0
50
0
40
0
60
1
60
0
1
3
2
5
4
6
NP
QN
+1
Vector b
i
j
yStart
yEnd
xStart
xEnd
Aabottom
left right
top10 32 54 6
100
0
300
200
500
400
600
10
0
0
30
0
20
0
50
0
40
0
60
1
60
0
1
3
2
5
4
6
NP
QN
+1
Vector b
i
j
yStart
yEnd
xStart
xEnd
Calculate y[4]
Aabottom
left right
top10 32 54 6
100
0
300
200
500
400
600
10
0
0
30
0
20
0
50
0
40
0
60
1
60
0
1
3
2
5
4
6
NP
QN
+1
Vector b
i
j
yStart
yEnd
xStart
xEnd
Aabottom
left right
top10 32 54 6
100
0
300
200
500
400
600
10
0
0
30
0
20
0
50
0
40
0
60
1
60
0
1
3
2
5
4
6
NP
QN
+1
Vector b
i
j
yStart
yEnd
xStart
xEnd
Calculate y[5]
Aabottom
left right
top10 32 54 6
100
0
300
200
500
400
600
1000 300
200
500
400
601
600
1
3
2
5
4
6
NP
QN
+1
Vector b
i
j
yStart
yEnd
xStart
xEnd
L00, U00 L00, U01 L00, U02 L00, U03 L00, U04 L00, U05
L10, U00 L10+11, U01+11
L10+11, U02+12
L10+11, U03+13
L10+11, U04+14
L10+11, U05+15
L20, U00
L20+21, U01+11
L20+21+22, U02+12+22
L20+21+22, U03+13+23
L20+21+22, U04+14+24
L20+21+22, U05+15+25
L30, U00
L40, U00
L50, U00
L30+31, U01+11
L40+41, U01+11
L50+51, U01+11
L30+31+32, U02+12+22
L40+41+42, U02+12+22
L50+51+52, U02+12+22
L30+31+32+33, U03+13+23+33
L40+41+42+43, U03+13+23+33
L50+51+52+53, U03+13+23+33
L30+31+32+33, U04+14+24+34
L30+31+32+33, U05+15+25+35
L40+41+42+43+44, U04+14+24+34+44
L50+51+52+53+54, U04+14+24+34+44
L40+41+42+43+44, U05+15+25+35+45
L50+51+52+53+54+55, U05+15+25+35+45+55
The final matrixes L and U fragments distribution.
Aabottom
left right
top10 32 54 6
100
0
300
200
500
400
600
10
0
0
30
0
20
0
50
0
40
0
60
1
60
0
1
3
2
5
4
6
NP
QN
+1
Vector b
i
j
yStart
yEnd
xStart
xEnd
Calculate x [5]
Pass x [5] tothe main method
Step 2Calculating
vector x
Aabottom
left right
top10 32 54 6
100
0
300
200
500
400
600
10
0
0
30
0
20
0
50
0
40
0
60
1
60
0
1
3
2
5
4
6
NP
QN
+1
Vector b
i
j
yStart
yEnd
xStart
xEnd
Calculate x [4]
Aabottom
left right
top10 32 54 6
100
0
300
200
500
400
600
10
0
0
30
0
20
0
50
0
40
0
60
1
60
0
1
3
2
5
4
6
NP
QN
+1
Vector b
i
j
yStart
yEnd
xStart
xEnd
Pass x [4] tothe main method
Aabottom
left right
top10 32 54 6
100
0
300
200
500
400
600
10
0
0
30
0
20
0
50
0
40
0
60
1
60
0
1
3
2
5
4
6
NP
QN
+1
Vector b
i
j
yStart
yEnd
xStart
xEnd
Calculate x[3]
Aabottom
left right
top10 32 54 6
100
0
300
200
500
400
600
10
0
0
30
0
20
0
50
0
40
0
60
1
60
0
1
3
2
5
4
6
NP
QN
+1
Vector b
i
j
yStart
yEnd
xStart
xEnd
Pass x[3] tothe main method
Aabottom
left right
top10 32 54 6
100
0
300
200
500
400
600
10
0
0
30
0
20
0
50
0
40
0
60
1
60
0
1
3
2
5
4
6
NP
QN
+1
Vector b
i
j
yStart
yEnd
xStart
xEnd
Calculate x[2]
Aabottom
left right
top10 32 54 6
100
0
300
200
500
400
600
10
0
0
30
0
20
0
50
0
40
0
60
1
60
0
1
3
2
5
4
6
NP
QN
+1
Vector b
i
j
yStart
yEnd
xStart
xEnd
Pass x[2] tothe main method
Aabottom
left right
top10 32 54 6
100
0
300
200
500
400
600
10
0
0
30
0
20
0
50
0
40
0
60
1
60
0
1
3
2
5
4
6
NP
QN
+1
Vector b
i
j
yStart
yEnd
xStart
xEnd
Calculate x[1]
Aabottom
left right
top10 32 54 6
100
0
300
200
500
400
600
10
0
0
30
0
20
0
50
0
40
0
60
1
60
0
1
3
2
5
4
6
NP
QN
+1
Vector b
i
j
yStart
yEnd
xStart
xEnd
Pass x[1] tothe main method
Aabottom
left right
top10 32 54 6
100
0
300
200
500
400
600
10
0
0
30
0
20
0
50
0
40
0
60
1
60
0
1
3
2
5
4
6
NP
QN
+1
Vector b
i
j
yStart
yEnd
xStart
xEnd
Calculate x[0] and pass it to the
main method
Aabottom
left right
top10 32 54 6
100
0
300
200
500
400
600
10
0
0
30
0
20
0
50
0
40
0
60
1
60
0
1
3
2
5
4
6
NP
QN
+1
Vector b
i
j
yStart
yEnd
xStart
xEnd
Calculate x[0] and pass it to the
main methodVector x now can be merged
from fragments on the main node!
This implementation has a lack that all communications go through the cluster’s frontend. This happens because all bi-directional Channels were initially created on the frontend machine. It is possible to reduce the execution time by using the mutual exchange of bi-directional channels between neighbor processes (see next slides to understand how it can be done).
0.00
1000.00
2000.00
3000.00
4000.00
5000.00
6000.00
100
250
500
750
1000
1250
1500
1750
2000
2250
2500
2750
3000
3250
3500
3750
4000
N
Tim
e (i
n s
eco
nd
s)
hpl_notparallel.mcs hpl_7.mcs
Here are the measurements for algorithm described in the previous slides.
hpl_notparallel.mcs: Not parallel versionSKIF 16x2 nodes cluster http://skif.botik.ru/ [vadim@skif gfft]$ uname -aLinux skif 2.4.27 #1 SMP Thu Apr 14 15:25:11 MSD 2005 i686 athlon i386 GNU/Linux
hpl_7.mcs: P=10 Q=10 NP=32 mono hpl_7.exe N 10 10 /np 32Note: This version includes times needed for matrix A and vector b generation.
Aabottom
left right
top10 32 54 6
100
0
300
200
500
400
600
10
0
0
30
0
20
0
50
0
40
0
60
1
60
0
1
3
2
5
4
6
NP
QN
+1
Vector b
i
j
yStart
yEnd
xStart
xEnd
Step 0Exchange ofbi-directional
channels
Each process creates bi-directional channel locally and pass it to the right process
Aabottom
left right
top10 32 54 6
100
0
300
200
500
400
600
10
0
0
30
0
20
0
50
0
40
0
60
1
60
0
1
3
2
5
4
6
NP
QN
+1
Vector b
i
j
yStart
yEnd
xStart
xEnd
Each process passes local bi-directional channel to theleft process
Aabottom
left right
top10 32 54 6
100
0
300
200
500
400
600
10
0
0
30
0
20
0
50
0
40
0
60
1
60
0
1
3
2
5
4
6
NP
QN
+1
Vector b
i
j
yStart
yEnd
xStart
xEnd
Each process passes local bi-directional channel to thebottom process
Aabottom
left right
top10 32 54 6
100
0
300
200
500
400
600
10
0
0
30
0
20
0
50
0
40
0
60
1
60
0
1
3
2
5
4
6
NP
QN
+1
Vector b
i
j
yStart
yEnd
xStart
xEnd
Each process passes local bi-directional channel to thetop process
// STEP 0: Fasten the data exchange by mutual exchange of the bdchannels
// otherwise all network traffic will go through the cluster frontend...
BDChannel currentNew = new BDChannel();
int nTimes = 0; // How many neighbors this process has?
if ( right != null ) { right.Send( 1, currentNew ); nTimes++; }
if ( left != null ) { left.Send( 2, currentNew ); nTimes++; }
if ( bottom != null ) { bottom.Send( 3, currentNew ); nTimes++; }
if ( top != null ) { top.Send( 4, currentNew ); nTimes++; }
for ( i = 0; i < nTimes; i++ ) {
object[] o = current.Receive();
int t = (int) o[0];
if ( t == 1 ) left = (BDChannel) o[1]; // BDChannel came from the left process
else if ( t == 2 ) right = (BDChannel) o[1]; // BDChannel came from the right process
else if ( t == 3 ) top = (BDChannel) o[1]; // BDChannel came from the top process
else if ( t == 4 ) bottom = (BDChannel) o[1]; // BDChannel came from the bottom process
}
current = currentNew;
This is how the previous four slides can be written in MC# language (it should be inserted at the beginning of the method):
0.00
100.00
200.00
300.00
400.00
500.00
600.00
700.00
N
Tim
e (i
n s
eco
nd
s)
hpl_7.mcs hpl_8.mcs
And this is the differencein execution time:
If we compare these two implementations and have a look at the statistics provided by MC# runtime we’ll see that in hpl_8.mcs during the calculations of 4000x4000 matrix 974'032'636 bytes of information are transferred between nodes
[skif gfft]$ mono hpl_8.exe 4000 10 10 /np 32________________________________________________==MC# Statistics================================Number of movable calls: 100Number of channel messages: 10Number of movable calls (across network): 100Number of channel messages (across network): 10Total size of movable calls (across network): 128114740 bytesTotal size of channel messages (across network): 33890 bytesTotal time of movable calls serialization: 00:00:14.0937610Total time of channel messages serialization: 00:00:00.0106820Total size of transported messages: 974032636 bytesTotal time of transporting messages: 00:01:09.7663320Session initialization time: 00:00:00.4988500 / 0.49885 sec. / 498.85 msec. Total time: 00:20:32.4156470 / 1232.415647 sec. / 1232415.647 msec.________________________________________________
While in hpl_7.mcs during the calculations of the same 4000x4000 matrix 1'819'549'593 bytes of information are transferred between nodes
[skif gfft]$ mono hpl_7.exe 4000 10 10 /np 32________________________________________________==MC# Statistics================================Number of movable calls: 100Number of channel messages: 10Number of movable calls (across network): 100Number of channel messages (across network): 10Total size of movable calls (across network): 128114740 bytesTotal size of channel messages (across network): 33890 bytesTotal time of movable calls serialization: 00:00:20.9258120Total time of channel messages serialization: 00:00:00.0719850Total size of transported messages: 1819549593 bytesTotal time of transporting messages: 00:02:38.6718660Session initialization time: 00:00:00.4900100 / 0.49001 sec. / 490.01 msec.Total time: 00:21:12.4946390 / 1272.494639 sec. / 1272494.639 msec.________________________________________________
Explaining the figures/Limitations of implementation
1) Implemented HPL algorithm was selected by the principle "as simple as possible to understand and to read the final code". It is possible to significantly improve the productivity by using more advanced algorithms of Panel Broadcasting and Update and Look-Ahead heuristics
2) MC# Runtime system has not been optimized yet for really big number of processors - it is working quite good for clusters with number of processors NP<=16. Note that MC# is still a research project. It is just a matter of time, before we get the really effective runtime system.
3) Currently there are no broadcast operations in the MC# syntax. It looks like that we'll have to add such capability to the language in the future
4) HPL requires intensive usage of network bandwidth. The speedup is possible in case of using SCI network adapters by MC# runtime (currently in development). In these measurements we used standard Ethernet adapters
5) MC# is using standard .Net Binary Serializer for transferring objects from one node to another one. This operation is quite memory-consuming. Improved performance can be achieved by writing custom serializers.
6) Mono implementation of .Net platform is not yet as fast as implementation from Microsoft
Thanks for your time!
MC# Homepage: http://u.pereslavl.ru/~vadim/MCSharp/ (the site of the project can be temporary down @ October/November due to hardware upgrade works)
Special thanks to:
Yury P. Serdyuk – for his great work on MC# project and help in preparing this document
Program Systems Institute / University of Pereslavl for hosting MC# project homepage