18 April 2023 Universidad Politécnica de Valencia
1
Advances in the Optimization of Parallel
Routines (I)
Domingo GiménezDepartamento de Informática y Sistemas
Universidad de Murcia, Spaindis.um.es/~domingo
18 April 2023 Universidad Politécnica de Valencia 2
Outline A little history Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Peer to peer computing
18 April 2023 Universidad Politécnica de Valencia 3
Collaborations and autoreferences Modelling Linear Algebra Routines
+ J. Cuenca + J. González: Modelling the Behaviour of Linear Algebra
Algorithms with Message-passing. 2001 Towards the Design of an Automatically
Tuned Linear Algebra Library. 2002 + J. Cuenca + L. P. García + J.
González + A. Vidal: Empirical Modelling of Parallel Linear
Algebra Routines. 2003
18 April 2023 Universidad Politécnica de Valencia 4
Colaborations and autoreferences Installation routines
+ G. Carrillo: Installation routines for linear algebra
libraries on LANs. 2000 + G. Carrillo + J. Cuenca + J.
González: Optimización automática de rutinas
paralelas de álgebra lineal. 2000
18 April 2023 Universidad Politécnica de Valencia 5
Colaborations and autoreferences Autotuning routines
+ J. Cuenca + J. González: Automatic parameterization of parallel
linear algebra routines. 2001 + J. Cuenca:
Some considerations about the Automatic Optimization of Parallel Linear Algebra Routines. 2002
18 April 2023 Universidad Politécnica de Valencia 6
Colaborations and autoreferences Modifications to the libraries
hierarchy + J. Cuenca + J. González:
Architecture of an Automatic Tuned Linear Algebra Library. 2002 - 2004
18 April 2023 Universidad Politécnica de Valencia 7
Colaborations and autoreferences Polylibraries
+ P. Alberti + P. Alonso + J. Cuenca + A. Vidal:
Designing Polylibraries to Speed Up Parallel Computations. 2003
18 April 2023 Universidad Politécnica de Valencia 8
Colaborations and autoreferences Algorithmic schemes
+ J. P. Martínez: Automatic Optimization in Parallel
Dynamic Programming Schemes. 2004
18 April 2023 Universidad Politécnica de Valencia 9
Colaborations and autoreferences Heterogeneous systems
+ J. Cuenca + J. Dongarra + J. González + K. Roche:
Automatic Optimization of Parallel Linear Algebra Routines in Systems with Variable Load. 2003
+ J. Cuenca + J. P. Martínez: Heuristics for Work Distribution of a
Homogeneous Parallel Dynamic Programming Scheme on Heterogeneous Systems. 2004
18 April 2023 Universidad Politécnica de Valencia 10
Outline A little history Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Peer to peer computing
18 April 2023 Universidad Politécnica de Valencia 11
A little history Parallel optimization in the past:
Hand-optimization for each platform Time consuming Incompatible with hardware evolution Incompatible with changes in the system
(architecture and basic libraries) Unsuitable for systems with variable workloads Misuse by non expert users
18 April 2023 Universidad Politécnica de Valencia 12
A little history Initial solutions to this situation:
Problem-specific solutions Polyalgorithms Installation tests
18 April 2023 Universidad Politécnica de Valencia 13
A little history Problem specific solutions:
Brewer (1994): Sorting Algorithms, Differential Equations
Frigo (1997): FFTW: The Fastest Fourier Transform in the West
LAWRA (1997): Linear Algebra With Recursive Algorithms
18 April 2023 Universidad Politécnica de Valencia 14
A little history Polyalgorithms:
Brewer FFTW PHiPAC (1997): Linear Algebra
18 April 2023 Universidad Politécnica de Valencia 15
A little history Installation tests:
ATLAS (2001): Dense Linear Algebra, sequential
Carrillo + Giménez (2000): Gauss elimination, heterogeneous algorithm
I-LIB (2000): some parallel linear algebra routines
18 April 2023 Universidad Politécnica de Valencia 16
A little history Parallel optimization today:
Optimization based on computational kernels
Systematic development of routines Auto-optimization of routines Middleware for auto-optimization
18 April 2023 Universidad Politécnica de Valencia 17
A little history Optimization based on
computational kernels: Efficient kernels (BLAS) and
algorithms based on these kernels Auto-optimization of the basic kernels
(ATLAS)
18 April 2023 Universidad Politécnica de Valencia 18
A little history Systematic development of routines:
FLAME project R. van de Geijn + E. Quintana + … Dense Linear Algebra Based on Object Oriented Design
LAWRA Dense Linear Algebra For Shared Memory Systems
18 April 2023 Universidad Politécnica de Valencia 19
A little history Auto-optimization of routines:
At installation time: ATLAS, Dongarra + Whaley I-LIB, Kanada + Katagiri + Kuroda SOLAR, Cuenca + Giménez + González LFC, Dongarra + Roche
At execution time: Solve a reduced problem in each processor (
Kalinov + Lastovetsky) Use a system evaluation tool (NWS)
18 April 2023 Universidad Politécnica de Valencia 20
A little history Middleware for auto-optimization:
LFC: Middleware for Dense Linear Algebra Software in Clusters.
Hierarchy of autotuning libraries: Include in the libraries installation routines to be used in
the development of higher level libraries FIBER:
Proposal of general middleware Evolution of I-LIB
mpC: For heterogeneous systems
18 April 2023 Universidad Politécnica de Valencia 21
A little history Parallel optimization in the
future?: Skeletons and languages Heterogeneous and variable-load
systems Distributed systems P2P computing
18 April 2023 Universidad Politécnica de Valencia 22
A little history Skeletons and languages:
Develop skeletons for parallel algorithmic schemes
together with execution time modelsand provide the users with these
libraries (MALLBA, Málaga-La Laguna-Barcelona) or languages (P3L, Pisa)
18 April 2023 Universidad Politécnica de Valencia 23
A little history Heterogeneous and variable-load
systems:Heterogeneous algorithms: unbalanced
distribution of data (static or dynamic)Homogeneous algorithms: more processes
than processors and assignation of processes to processors (static or dynamic)
Variable-load systems as dynamic heterogeneous
18 April 2023 Universidad Politécnica de Valencia 24
A little history Distributed systems:
Intrinsically heterogeneous and variable-load
Very high cost of communicationsNecessary special middleware (Globus,
NWS)There can be servers to attend queries of
clients
18 April 2023 Universidad Politécnica de Valencia 25
A little history P2P computing:
Users can go in and out dynamicallyAll the users are the same type
(initially)Is distributed, heterogeneous and
variable-loadBut special middleware is necessary
18 April 2023 Universidad Politécnica de Valencia 26
Outline A little story Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Peer to peer computing
18 April 2023 Universidad Politécnica de Valencia 27
Modelling Linear Algebra Routines
Necessary to predict accurately the execution time and select The number of processes The number of processors Which processors The number of rows and columns of processes (the topology) The processes to processors assignation The computational block size (in linear algebra algorithms) The communication block size The algorithm (polyalgorithms) The routine or library (polylibraries)
18 April 2023 Universidad Politécnica de Valencia 28
Cost of a parallel program:
: arithmetic time: communication time: overhead, for synchronization,
imbalance, processes creation, ...: overlapping of communication and
computation
Modelling Linear Algebra Routines
overlapoverheadcommarithparallel ttttt aritht
commt
overheadt
overlapt
18 April 2023 Universidad Politécnica de Valencia 29
Estimation of the time:
Considering computation and communication divided in a number of steps:
And for each part of the formula that of the process which gives the highest value.
Modelling Linear Algebra Routines
commarithparallel ttt
...2,2,1,1, commarithcommarithparallel ttttt
18 April 2023 Universidad Politécnica de Valencia 30
The time depends on the problem (n) and the system (p) size:
But also on some ALGORITHMIC PARAMETERS like the block size (b) and the number of rows (r) and columns (c) of processors in algorithms for a mesh of processors
Modelling Linear Algebra Routines
),( pnt parallel
),,,( crbnt parallel
18 April 2023 Universidad Politécnica de Valencia 31
And some SYSTEM PARAMETERS which reflect the computation and communication characteristics of the system.
Typically the cost of an arithmetic operation (tc) and the start-up (ts) and
word-sending time (tw)
Modelling Linear Algebra Routines
),,( SPAPnt parallel
18 April 2023 Universidad Politécnica de Valencia 32
LU factorisation (Golub - Van Loan):
=
Step 1: (factorisation LU no blocks) Step 2: (multiple lower triangular systems)
Step 3: (multiple upper triangular systems)
Step 4: (update south-east blocks)
Modelling Linear Algebra Routines
A11
A22
A33A32A31
A23A21
A13A12 L11
L22
L33L32L31
L21
U1
1 U2
2 U3
3
U2
3
U1
3
U1
2
111111 ULA ii ULA 1111
1111 ULA ii
jiijij ULAA 11
18 April 2023 Universidad Politécnica de Valencia 33
The execution time is
If the blocks are of size 1, the operations are all with individual elements, but if the blocks size is b the cost is
With k3 and k2 the cost of operations performed with BLAS 3 or 2
Modelling Linear Algebra Routines
3
3
2)( nnt tcsequential
nbbnnnt kkksequential2
2
3
3
3
3 3
1
3
2)(
18 April 2023 Universidad Politécnica de Valencia 34
But the cost of different operations of the same level is different, and the theoretical cost could be better modelled as:
Thus, the number of SYSTEM PARAMETERS increases (one for each basic routine), and ...
Modelling Linear Algebra Routines
nbbnnnt kkk dgetfdtrsmdgemmsequential2
2_2
3
_3
3
_3 31
32
)(
18 April 2023 Universidad Politécnica de Valencia 35
The value of each System Parameter can depend on the problem size (n) and on the value of the Algorithmic Parameters (b)
The formula has the form:
And what we want is to obtain the values of AP with which the lowest execution time is obtained
Modelling Linear Algebra Routines
nbbnbnbnnbnbnt kkk dgetfdtrsmdgemmsequential2
2_2
3
_3
3
_3 31
),(),(32
),(),(
)),(,,( APnSPAPnt
18 April 2023 Universidad Politécnica de Valencia 36
The values of the System Parameters could be obtained With installation routines associated to each
linear algebra routine From information stored when the library
was installed in the system, thus generating a hierarchy of libraries with auto-optimization
At execution time by testing the system conditions prior to the call to the routine
Modelling Linear Algebra Routines
18 April 2023 Universidad Politécnica de Valencia 37
These values can be obtained as simple values (traditional method) or as function of the Algorithmic Parameters.
In this case a multidimensional table of values as a function of the problem size and the Algorithmic Parameters is stored,
And when a problem of a particular size is being solved the execution time is estimated with the values of the stored size closest to the real size
And the problem is solved with the values of the Algorithmic Parameters which predict the lowest execution time
Modelling Linear Algebra Routines
18 April 2023 Universidad Politécnica de Valencia 38
Parallel block LU factorisation:
matrix
distribution of computations in the first step
processors
Modelling Linear Algebra Routines
18 April 2023 Universidad Politécnica de Valencia 39
Distribution of computations on successive steps:
second step third step
Modelling Linear Algebra Routines
18 April 2023 Universidad Politécnica de Valencia 40
The cost of parallel block LU factorisation:
Tuning Algorithmic Parameters:block size: b
2D-mesh of p proccesors: p = r c d=max(r,c)
System Parameters:cost of arithmetic operations: k2,getf2 k3,trsmm k3,gemm
communication parameters: ts tw
Modelling Linear Algebra Routines
nkbnbkp
crp
nkT getftrsmgemmARI 2,2
22,3
3
,3 31
32
p
dnt
b
ndtT wsCOM
222
18 April 2023 Universidad Politécnica de Valencia 41
The cost of parallel block QR factorisation:
Tuning Algorithmic Parameters:block size: b
2D-mesh of p proccesors: p = r c
System Parameters:cost of arithmetic operations: k2,geqr2 k2,larft k3,gemm k3,trmm
communication parameters: ts tw
Modelling Linear Algebra Routines
r
bkn
r
bkn
c
bkn
p
knT
larftgeqr
trmmgemm
ARI
,22
2,22,3
2,3
3
2
1
4
1
3
4
pnb
r
r
c
r
r
rntcrb
b
ntT wsCOM log
loglog12
2log2log32
2
2
18 April 2023 Universidad Politécnica de Valencia 42
The same basic operations appear repeatedly in different higher level routines: the information generated for one routine (let’s say LU) could be stored and used for other routines (e.g. QR)and a common format is necessary to store the information
Modelling Linear Algebra Routines
18 April 2023 Universidad Politécnica de Valencia 43
Modelling Linear Algebra Routines
Comparison of execution times using different sets of A lgorithm Parameters (8 processors)
0
20
40
60
80
100
120
140
160
180
200
512 1024 1536 2048 2560 3072
Untuned
Tuned with MCSP
Tuned with MVSP
Optimal Execution Time
18 April 2023 Universidad Politécnica de Valencia 44
Modelling Linear Algebra Routines
IBM-SP2. 8 processors0,00
10,00
20,00
30,00
40,00
50,00
60,00
70,00
80,00
512 1024 1536 2048 2560 3072 3584
problem size
time
(sec
onds
)
mean
model
optimum
Parallel QR factorisation
“mean” refers to the mean of the execution times with representative values of the Algorithmic Parameters (execution time which could be obtained by a non-expert user)
“optimum” is the lowest time of all the executions performed with representative values of the Algorithmic Parameters
“model” is the execution time with the values selected with the model
18 April 2023 Universidad Politécnica de Valencia 45
Modelling Linear Algebra Routines
Parameter selection for the QR algorithm
- IBM SP2
p=4 p=8
b r c b r c
1024 16 1 4 16 1 8
2048 32 1 4 16 1 8
3072 32 1 4 32 2 4
4096 32 1 4 32 2 4 Origin 2000
p=4 p=8
b r c b r c
1024 32 4 1 32 4 2
2048 64 4 1 32 4 2
3072 32 4 2
4096 64 4 2 -
Network of Pentium III with Fast Ethernet
p=4 p=8
b r c b r c
1024 16 1 4 16 1 8
2048 16 1 4 16 1 8
3072 32 1 4 32 1 8
4096 32 1 4 32 1 8
18 April 2023 Universidad Politécnica de Valencia 46
Outline A little history Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Peer to peer computing
18 April 2023 Universidad Politécnica de Valencia 47
In the formulas (parallel block LU factorisation)
The values of the System Parameters (k2,getf2 ,
k3,trsmm , k3,gemm , ts , tw) must be estimated as functions of the problem size (n) and the Algorithmic Parameters (b, r, c)
Installation Routines
ncrbnkbncrbnbkp
crp
ncrbnkcrbnT getftrsmgemmARI ),,,(
31
),,,(),,,(32
),,,( 2,222
,3
3
,3
pdn
crbntbnd
crbntcrbnT wsCOM
22),,,(
2),,,(),,,(
18 April 2023 Universidad Politécnica de Valencia 48
Installation RoutinesBy running at installation time Installation Routines
associated to the linear algebra routineAnd storing the information generated to be used
at running time
Each linear algebra routine must be
designed together with the corresponding installation routines, and the installation process must be detailed
18 April 2023 Universidad Politécnica de Valencia 49
is estimated by performing matrix-matrix multiplications and updatings of
size (n/r b) (b n/c)
Because during the execution the size of the matrix to work with decreases, different values can be estimated for different problem sizes, and the formula can be modified to include the posibility of these estimations with different values, for example, splitting the formula into four formulas with different problem sizes
Installation Routines),,,(,3 crbnk gemm
18 April 2023 Universidad Politécnica de Valencia 50
two multiple triangular systems are solved, one upper triangular of size b n/c , and another lower triangular of sizen/r b
Thus, two parameters are estimated, one of them depending on n, b and c, and the other depending on n, b and r
As for the previous parameter, values can be obtained for different problem sizes
Installation Routines),,,(,3 crbnk trsm
18 April 2023 Universidad Politécnica de Valencia 51
corresponds to a level 2 LU sequential factorisation of size b b
At installation time each of the basic routines is executed varying the value of the parameters they depend on, and with representative values (selected by the routine designer or the system manager),
And the information generated is stored in a file to be used at running time or in the code of the linear algebra routine before its installation
Installation Routines),,,(2,2 crbnk getf
18 April 2023 Universidad Politécnica de Valencia 52
and appear in communications of three types,
In one of them a block of size b b is broadcast in a row, and this parameter depends on b and c
In another a block of size b b is broadcast in a column, and the parameter depends on b and r
And in the other, blocks of sizes b n/c and n/r b are broadcast in each one of the columns and rows of processors. These parameters depend on n, b, r and c
Installation Routines
),,,( crbnts ),,,( crbntw
18 April 2023 Universidad Politécnica de Valencia 53
In practice each System Parameter depends on a more reduced number of Algorithmic Parameters, but this is known only after the installation process is completed.
The routine designer also designs the installation process, and can take into consideration the experience he has to guide the installation.
The basic installation process can be designed allowing the intervention of the system manager.
Installation Routines
18 April 2023 Universidad Politécnica de Valencia 54
Some results in different systems (physical and logical platform)
Values of k3_DTRMM (≈ k3_DGEMM) on the different platforms (in microseconds)
Installation Routines
Block size
System n 16 32 64 128
SUN1 refBLASmacBLAS
ATLAS
512,.., 4096512,.., 4096512,.., 4096
0.02000.01200.0070
0.02000.01100.0060
0.02200.01100.0060
0.02800.01100.0060
SUN5 refBLASmacBLAS
ATLAS
512,.., 4096512,.., 4096512,.., 4096
0.01200.00600.0040
0.01300.00500.0032
0.01400.00500.0025
0.01500.00500.0025
PIII ATLAS 512,.., 4096
0.0038 0.0033 0.0030 0.0030
PPC macBLAS 512,.., 4096
0.0023 0.0019 0.0018 0.0018
R10K macBLAS 512,.., 4096
0.0070 0.0030 0.0025 0.0025
18 April 2023 Universidad Politécnica de Valencia 55
Installation RoutinesValues of k2_DGEQR2 (≈ k2_DLARFT) on the different platforms (in microseconds)
Block size
System n 16 32 64 128
SUN1 refBLASmacBLAS
ATLAS
512,.., 4096512,.., 4096512,.., 4096
0.02000.05000.0700
SUN5 refBLASmacBLAS
ATLAS
512,.., 4096512,.., 4096512,.., 4096
0.00500.03000.0500
PIII ATLAS 512,.., 4096
0.0150
PPC macBLAS 512,.., 4096
0.0100
R10K macBLAS 512,.., 4096
0.0250
18 April 2023 Universidad Politécnica de Valencia 56
Typically the values of the communication parameters are well estimated with a ping-pong
Installation Routines
Block size
System n 16 32 64 128
cSUN1 MPICH 512,.., 4096
170 / 7.0
cPIII MPICH 512,.., 4096
60 / 0.7
IBM-SP2 Mac-MPI 512,.., 4096
75 / 0.3
Origin 2K Mac-MPI 512,.., 4096
20 / 0.1
Top Related