24 June 2015 Universidad Politécnica de Valencia1 Advances in the Optimization of Parallel Routines...

18 April 2023 Universidad Politécnica de Valencia

1

Advances in the Optimization of Parallel

Routines (I)

Domingo GiménezDepartamento de Informática y Sistemas

Universidad de Murcia, Spaindis.um.es/~domingo

18 April 2023 Universidad Politécnica de Valencia 2

Outline A little history Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Peer to peer computing


Collaborations and autoreferences Modelling Linear Algebra Routines

+ J. Cuenca + J. González: Modelling the Behaviour of Linear Algebra

Algorithms with Message-passing. 2001 Towards the Design of an Automatically

Tuned Linear Algebra Library. 2002 + J. Cuenca + L. P. García + J.

González + A. Vidal: Empirical Modelling of Parallel Linear

Algebra Routines. 2003


Colaborations and autoreferences Installation routines

+ G. Carrillo: Installation routines for linear algebra

libraries on LANs. 2000 + G. Carrillo + J. Cuenca + J.

González: Optimización automática de rutinas

paralelas de álgebra lineal. 2000


Colaborations and autoreferences Autotuning routines

+ J. Cuenca + J. González: Automatic parameterization of parallel

linear algebra routines. 2001 + J. Cuenca:

Some considerations about the Automatic Optimization of Parallel Linear Algebra Routines. 2002


Colaborations and autoreferences Modifications to the libraries

hierarchy + J. Cuenca + J. González:

Architecture of an Automatic Tuned Linear Algebra Library. 2002 - 2004


Colaborations and autoreferences Polylibraries

+ P. Alberti + P. Alonso + J. Cuenca + A. Vidal:

Designing Polylibraries to Speed Up Parallel Computations. 2003


Colaborations and autoreferences Algorithmic schemes

+ J. P. Martínez: Automatic Optimization in Parallel

Dynamic Programming Schemes. 2004


Colaborations and autoreferences Heterogeneous systems

+ J. Cuenca + J. Dongarra + J. González + K. Roche:

Automatic Optimization of Parallel Linear Algebra Routines in Systems with Variable Load. 2003

+ J. Cuenca + J. P. Martínez: Heuristics for Work Distribution of a

Homogeneous Parallel Dynamic Programming Scheme on Heterogeneous Systems. 2004


A little history Parallel optimization in the past:

Hand-optimization for each platform Time consuming Incompatible with hardware evolution Incompatible with changes in the system

(architecture and basic libraries) Unsuitable for systems with variable workloads Misuse by non expert users


A little history Initial solutions to this situation:

Problem-specific solutions Polyalgorithms Installation tests


A little history Problem specific solutions:

Brewer (1994): Sorting Algorithms, Differential Equations

Frigo (1997): FFTW: The Fastest Fourier Transform in the West

LAWRA (1997): Linear Algebra With Recursive Algorithms


A little history Polyalgorithms:

Brewer FFTW PHiPAC (1997): Linear Algebra


A little history Installation tests:

ATLAS (2001): Dense Linear Algebra, sequential

Carrillo + Giménez (2000): Gauss elimination, heterogeneous algorithm

I-LIB (2000): some parallel linear algebra routines


A little history Parallel optimization today:

Optimization based on computational kernels

Systematic development of routines Auto-optimization of routines Middleware for auto-optimization


A little history Optimization based on

computational kernels: Efficient kernels (BLAS) and

algorithms based on these kernels Auto-optimization of the basic kernels

(ATLAS)


A little history Systematic development of routines:

FLAME project R. van de Geijn + E. Quintana + … Dense Linear Algebra Based on Object Oriented Design

LAWRA Dense Linear Algebra For Shared Memory Systems


A little history Auto-optimization of routines:

At installation time: ATLAS, Dongarra + Whaley I-LIB, Kanada + Katagiri + Kuroda SOLAR, Cuenca + Giménez + González LFC, Dongarra + Roche

At execution time: Solve a reduced problem in each processor (

Kalinov + Lastovetsky) Use a system evaluation tool (NWS)


A little history Middleware for auto-optimization:

LFC: Middleware for Dense Linear Algebra Software in Clusters.

Hierarchy of autotuning libraries: Include in the libraries installation routines to be used in

the development of higher level libraries FIBER:

Proposal of general middleware Evolution of I-LIB

mpC: For heterogeneous systems


A little history Parallel optimization in the

future?: Skeletons and languages Heterogeneous and variable-load

systems Distributed systems P2P computing


A little history Skeletons and languages:

Develop skeletons for parallel algorithmic schemes

together with execution time modelsand provide the users with these

libraries (MALLBA, Málaga-La Laguna-Barcelona) or languages (P3L, Pisa)


A little history Heterogeneous and variable-load

systems:Heterogeneous algorithms: unbalanced

distribution of data (static or dynamic)Homogeneous algorithms: more processes

than processors and assignation of processes to processors (static or dynamic)

Variable-load systems as dynamic heterogeneous


A little history Distributed systems:

Intrinsically heterogeneous and variable-load

Very high cost of communicationsNecessary special middleware (Globus,

NWS)There can be servers to attend queries of

clients


A little history P2P computing:

Users can go in and out dynamicallyAll the users are the same type

(initially)Is distributed, heterogeneous and

variable-loadBut special middleware is necessary


Outline A little story Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Peer to peer computing


Modelling Linear Algebra Routines

Necessary to predict accurately the execution time and select The number of processes The number of processors Which processors The number of rows and columns of processes (the topology) The processes to processors assignation The computational block size (in linear algebra algorithms) The communication block size The algorithm (polyalgorithms) The routine or library (polylibraries)


Cost of a parallel program:

: arithmetic time: communication time: overhead, for synchronization,

imbalance, processes creation, ...: overlapping of communication and

computation


overlapoverheadcommarithparallel ttttt aritht

commt

overheadt

overlapt


Estimation of the time:

Considering computation and communication divided in a number of steps:

And for each part of the formula that of the process which gives the highest value.


commarithparallel ttt

...2,2,1,1, commarithcommarithparallel ttttt


The time depends on the problem (n) and the system (p) size:

But also on some ALGORITHMIC PARAMETERS like the block size (b) and the number of rows (r) and columns (c) of processors in algorithms for a mesh of processors


),( pnt parallel

),,,( crbnt parallel


And some SYSTEM PARAMETERS which reflect the computation and communication characteristics of the system.

Typically the cost of an arithmetic operation (tc) and the start-up (ts) and

word-sending time (tw)


),,( SPAPnt parallel


LU factorisation (Golub - Van Loan):

=

Step 1: (factorisation LU no blocks) Step 2: (multiple lower triangular systems)

Step 3: (multiple upper triangular systems)

Step 4: (update south-east blocks)


A11

A22

A33A32A31

A23A21

A13A12 L11

L22

L33L32L31

L21

U1

1 U2

2 U3

3

U2

3

U1

3

U1

2

111111 ULA ii ULA 1111

1111 ULA ii

jiijij ULAA 11


The execution time is

If the blocks are of size 1, the operations are all with individual elements, but if the blocks size is b the cost is

With k3 and k2 the cost of operations performed with BLAS 3 or 2


3

3

2)( nnt tcsequential

nbbnnnt kkksequential2

2

3

3

3

3 3

1

3

2)(


But the cost of different operations of the same level is different, and the theoretical cost could be better modelled as:

Thus, the number of SYSTEM PARAMETERS increases (one for each basic routine), and ...


nbbnnnt kkk dgetfdtrsmdgemmsequential2

2_2

3

_3

3

_3 31

32

)(


The value of each System Parameter can depend on the problem size (n) and on the value of the Algorithmic Parameters (b)

The formula has the form:

And what we want is to obtain the values of AP with which the lowest execution time is obtained


nbbnbnbnnbnbnt kkk dgetfdtrsmdgemmsequential2

2_2

3

_3

3

_3 31

),(),(32

),(),(

)),(,,( APnSPAPnt


The values of the System Parameters could be obtained With installation routines associated to each

linear algebra routine From information stored when the library

was installed in the system, thus generating a hierarchy of libraries with auto-optimization

At execution time by testing the system conditions prior to the call to the routine



These values can be obtained as simple values (traditional method) or as function of the Algorithmic Parameters.

In this case a multidimensional table of values as a function of the problem size and the Algorithmic Parameters is stored,

And when a problem of a particular size is being solved the execution time is estimated with the values of the stored size closest to the real size

And the problem is solved with the values of the Algorithmic Parameters which predict the lowest execution time



Parallel block LU factorisation:

matrix

distribution of computations in the first step

processors



Distribution of computations on successive steps:

second step third step



The cost of parallel block LU factorisation:

Tuning Algorithmic Parameters:block size: b

2D-mesh of p proccesors: p = r c d=max(r,c)

System Parameters:cost of arithmetic operations: k2,getf2 k3,trsmm k3,gemm

communication parameters: ts tw


nkbnbkp

crp

nkT getftrsmgemmARI 2,2

22,3

3

,3 31

32

p

dnt

b

ndtT wsCOM

222


The cost of parallel block QR factorisation:

Tuning Algorithmic Parameters:block size: b

2D-mesh of p proccesors: p = r c

System Parameters:cost of arithmetic operations: k2,geqr2 k2,larft k3,gemm k3,trmm

communication parameters: ts tw


r

bkn

r

bkn

c

bkn

p

knT

larftgeqr

trmmgemm

ARI

,22

2,22,3

2,3

3

2

1

4

1

3

4

pnb

r

r

c

r

r

rntcrb

b

ntT wsCOM log

loglog12

2log2log32

2

2


The same basic operations appear repeatedly in different higher level routines: the information generated for one routine (let’s say LU) could be stored and used for other routines (e.g. QR)and a common format is necessary to store the information




Comparison of execution times using different sets of A lgorithm Parameters (8 processors)

0

20

40

60

80

100

120

140

160

180

200

512 1024 1536 2048 2560 3072

Untuned

Tuned with MCSP

Tuned with MVSP

Optimal Execution Time



IBM-SP2. 8 processors0,00

10,00

20,00

30,00

40,00

50,00

60,00

70,00

80,00

512 1024 1536 2048 2560 3072 3584

problem size

time

(sec

onds

)

mean

model

optimum

Parallel QR factorisation

“mean” refers to the mean of the execution times with representative values of the Algorithmic Parameters (execution time which could be obtained by a non-expert user)

“optimum” is the lowest time of all the executions performed with representative values of the Algorithmic Parameters

“model” is the execution time with the values selected with the model



Parameter selection for the QR algorithm

- IBM SP2

p=4 p=8

b r c b r c

1024 16 1 4 16 1 8

2048 32 1 4 16 1 8

3072 32 1 4 32 2 4

4096 32 1 4 32 2 4 Origin 2000

p=4 p=8

b r c b r c

1024 32 4 1 32 4 2

2048 64 4 1 32 4 2

3072 32 4 2

4096 64 4 2 -

Network of Pentium III with Fast Ethernet

p=4 p=8

b r c b r c

1024 16 1 4 16 1 8

2048 16 1 4 16 1 8

3072 32 1 4 32 1 8

4096 32 1 4 32 1 8


In the formulas (parallel block LU factorisation)

The values of the System Parameters (k2,getf2 ,

k3,trsmm , k3,gemm , ts , tw) must be estimated as functions of the problem size (n) and the Algorithmic Parameters (b, r, c)

Installation Routines

ncrbnkbncrbnbkp

crp

ncrbnkcrbnT getftrsmgemmARI ),,,(

31

),,,(),,,(32

),,,( 2,222

,3

3

,3

pdn

crbntbnd

crbntcrbnT wsCOM

22),,,(

2),,,(),,,(


Installation RoutinesBy running at installation time Installation Routines

associated to the linear algebra routineAnd storing the information generated to be used

at running time

Each linear algebra routine must be

designed together with the corresponding installation routines, and the installation process must be detailed


is estimated by performing matrix-matrix multiplications and updatings of

size (n/r b) (b n/c)

Because during the execution the size of the matrix to work with decreases, different values can be estimated for different problem sizes, and the formula can be modified to include the posibility of these estimations with different values, for example, splitting the formula into four formulas with different problem sizes

Installation Routines),,,(,3 crbnk gemm


two multiple triangular systems are solved, one upper triangular of size b n/c , and another lower triangular of sizen/r b

Thus, two parameters are estimated, one of them depending on n, b and c, and the other depending on n, b and r

As for the previous parameter, values can be obtained for different problem sizes

Installation Routines),,,(,3 crbnk trsm


corresponds to a level 2 LU sequential factorisation of size b b

At installation time each of the basic routines is executed varying the value of the parameters they depend on, and with representative values (selected by the routine designer or the system manager),

And the information generated is stored in a file to be used at running time or in the code of the linear algebra routine before its installation

Installation Routines),,,(2,2 crbnk getf


and appear in communications of three types,

In one of them a block of size b b is broadcast in a row, and this parameter depends on b and c

In another a block of size b b is broadcast in a column, and the parameter depends on b and r

And in the other, blocks of sizes b n/c and n/r b are broadcast in each one of the columns and rows of processors. These parameters depend on n, b, r and c


),,,( crbnts ),,,( crbntw


In practice each System Parameter depends on a more reduced number of Algorithmic Parameters, but this is known only after the installation process is completed.

The routine designer also designs the installation process, and can take into consideration the experience he has to guide the installation.

The basic installation process can be designed allowing the intervention of the system manager.



Some results in different systems (physical and logical platform)

Values of k3_DTRMM (≈ k3_DGEMM) on the different platforms (in microseconds)


Block size

System n 16 32 64 128

SUN1 refBLASmacBLAS

ATLAS

512,.., 4096512,.., 4096512,.., 4096

0.02000.01200.0070

0.02000.01100.0060

0.02200.01100.0060

0.02800.01100.0060

SUN5 refBLASmacBLAS

ATLAS

512,.., 4096512,.., 4096512,.., 4096

0.01200.00600.0040

0.01300.00500.0032

0.01400.00500.0025

0.01500.00500.0025

PIII ATLAS 512,.., 4096

0.0038 0.0033 0.0030 0.0030

PPC macBLAS 512,.., 4096

0.0023 0.0019 0.0018 0.0018

R10K macBLAS 512,.., 4096

0.0070 0.0030 0.0025 0.0025


Installation RoutinesValues of k2_DGEQR2 (≈ k2_DLARFT) on the different platforms (in microseconds)

Block size

System n 16 32 64 128

SUN1 refBLASmacBLAS

ATLAS

512,.., 4096512,.., 4096512,.., 4096

0.02000.05000.0700

SUN5 refBLASmacBLAS

ATLAS

512,.., 4096512,.., 4096512,.., 4096

0.00500.03000.0500

PIII ATLAS 512,.., 4096

0.0150

PPC macBLAS 512,.., 4096

0.0100

R10K macBLAS 512,.., 4096

0.0250


Typically the values of the communication parameters are well estimated with a ping-pong


Block size

System n 16 32 64 128

cSUN1 MPICH 512,.., 4096

170 / 7.0

cPIII MPICH 512,.., 4096

60 / 0.7

IBM-SP2 Mac-MPI 512,.., 4096

75 / 0.3

Origin 2K Mac-MPI 512,.., 4096

20 / 0.1

24 June 2015 Universidad Politécnica de Valencia1 Advances in the Optimization of Parallel Routines...

Documents

Transcript of 24 June 2015 Universidad Politécnica de Valencia1 Advances in the Optimization of Parallel Routines...