24 June 2015 Universidad Politécnica de Valencia1 Advances in the Optimization of Parallel Routines...

56
21 March 2022 Universidad Politécnica d e Valencia 1 Advances in the Optimization of Parallel Routines (I) Domingo Giménez Departamento de Informática y Sistemas Universidad de Murcia, Spain dis.um.es/~domingo
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    214
  • download

    1

Transcript of 24 June 2015 Universidad Politécnica de Valencia1 Advances in the Optimization of Parallel Routines...

18 April 2023 Universidad Politécnica de Valencia

1

Advances in the Optimization of Parallel

Routines (I)

Domingo GiménezDepartamento de Informática y Sistemas

Universidad de Murcia, Spaindis.um.es/~domingo

18 April 2023 Universidad Politécnica de Valencia 2

Outline A little history Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Peer to peer computing

18 April 2023 Universidad Politécnica de Valencia 3

Collaborations and autoreferences Modelling Linear Algebra Routines

+ J. Cuenca + J. González: Modelling the Behaviour of Linear Algebra

Algorithms with Message-passing. 2001 Towards the Design of an Automatically

Tuned Linear Algebra Library. 2002 + J. Cuenca + L. P. García + J.

González + A. Vidal: Empirical Modelling of Parallel Linear

Algebra Routines. 2003

18 April 2023 Universidad Politécnica de Valencia 4

Colaborations and autoreferences Installation routines

+ G. Carrillo: Installation routines for linear algebra

libraries on LANs. 2000 + G. Carrillo + J. Cuenca + J.

González: Optimización automática de rutinas

paralelas de álgebra lineal. 2000

18 April 2023 Universidad Politécnica de Valencia 5

Colaborations and autoreferences Autotuning routines

+ J. Cuenca + J. González: Automatic parameterization of parallel

linear algebra routines. 2001 + J. Cuenca:

Some considerations about the Automatic Optimization of Parallel Linear Algebra Routines. 2002

18 April 2023 Universidad Politécnica de Valencia 6

Colaborations and autoreferences Modifications to the libraries

hierarchy + J. Cuenca + J. González:

Architecture of an Automatic Tuned Linear Algebra Library. 2002 - 2004

18 April 2023 Universidad Politécnica de Valencia 7

Colaborations and autoreferences Polylibraries

+ P. Alberti + P. Alonso + J. Cuenca + A. Vidal:

Designing Polylibraries to Speed Up Parallel Computations. 2003

18 April 2023 Universidad Politécnica de Valencia 8

Colaborations and autoreferences Algorithmic schemes

+ J. P. Martínez: Automatic Optimization in Parallel

Dynamic Programming Schemes. 2004

18 April 2023 Universidad Politécnica de Valencia 9

Colaborations and autoreferences Heterogeneous systems

+ J. Cuenca + J. Dongarra + J. González + K. Roche:

Automatic Optimization of Parallel Linear Algebra Routines in Systems with Variable Load. 2003

+ J. Cuenca + J. P. Martínez: Heuristics for Work Distribution of a

Homogeneous Parallel Dynamic Programming Scheme on Heterogeneous Systems. 2004

18 April 2023 Universidad Politécnica de Valencia 10

Outline A little history Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Peer to peer computing

18 April 2023 Universidad Politécnica de Valencia 11

A little history Parallel optimization in the past:

Hand-optimization for each platform Time consuming Incompatible with hardware evolution Incompatible with changes in the system

(architecture and basic libraries) Unsuitable for systems with variable workloads Misuse by non expert users

18 April 2023 Universidad Politécnica de Valencia 12

A little history Initial solutions to this situation:

Problem-specific solutions Polyalgorithms Installation tests

18 April 2023 Universidad Politécnica de Valencia 13

A little history Problem specific solutions:

Brewer (1994): Sorting Algorithms, Differential Equations

Frigo (1997): FFTW: The Fastest Fourier Transform in the West

LAWRA (1997): Linear Algebra With Recursive Algorithms

18 April 2023 Universidad Politécnica de Valencia 14

A little history Polyalgorithms:

Brewer FFTW PHiPAC (1997): Linear Algebra

18 April 2023 Universidad Politécnica de Valencia 15

A little history Installation tests:

ATLAS (2001): Dense Linear Algebra, sequential

Carrillo + Giménez (2000): Gauss elimination, heterogeneous algorithm

I-LIB (2000): some parallel linear algebra routines

18 April 2023 Universidad Politécnica de Valencia 16

A little history Parallel optimization today:

Optimization based on computational kernels

Systematic development of routines Auto-optimization of routines Middleware for auto-optimization

18 April 2023 Universidad Politécnica de Valencia 17

A little history Optimization based on

computational kernels: Efficient kernels (BLAS) and

algorithms based on these kernels Auto-optimization of the basic kernels

(ATLAS)

18 April 2023 Universidad Politécnica de Valencia 18

A little history Systematic development of routines:

FLAME project R. van de Geijn + E. Quintana + … Dense Linear Algebra Based on Object Oriented Design

LAWRA Dense Linear Algebra For Shared Memory Systems

18 April 2023 Universidad Politécnica de Valencia 19

A little history Auto-optimization of routines:

At installation time: ATLAS, Dongarra + Whaley I-LIB, Kanada + Katagiri + Kuroda SOLAR, Cuenca + Giménez + González LFC, Dongarra + Roche

At execution time: Solve a reduced problem in each processor (

Kalinov + Lastovetsky) Use a system evaluation tool (NWS)

18 April 2023 Universidad Politécnica de Valencia 20

A little history Middleware for auto-optimization:

LFC: Middleware for Dense Linear Algebra Software in Clusters.

Hierarchy of autotuning libraries: Include in the libraries installation routines to be used in

the development of higher level libraries FIBER:

Proposal of general middleware Evolution of I-LIB

mpC: For heterogeneous systems

18 April 2023 Universidad Politécnica de Valencia 21

A little history Parallel optimization in the

future?: Skeletons and languages Heterogeneous and variable-load

systems Distributed systems P2P computing

18 April 2023 Universidad Politécnica de Valencia 22

A little history Skeletons and languages:

Develop skeletons for parallel algorithmic schemes

together with execution time modelsand provide the users with these

libraries (MALLBA, Málaga-La Laguna-Barcelona) or languages (P3L, Pisa)

18 April 2023 Universidad Politécnica de Valencia 23

A little history Heterogeneous and variable-load

systems:Heterogeneous algorithms: unbalanced

distribution of data (static or dynamic)Homogeneous algorithms: more processes

than processors and assignation of processes to processors (static or dynamic)

Variable-load systems as dynamic heterogeneous

18 April 2023 Universidad Politécnica de Valencia 24

A little history Distributed systems:

Intrinsically heterogeneous and variable-load

Very high cost of communicationsNecessary special middleware (Globus,

NWS)There can be servers to attend queries of

clients

18 April 2023 Universidad Politécnica de Valencia 25

A little history P2P computing:

Users can go in and out dynamicallyAll the users are the same type

(initially)Is distributed, heterogeneous and

variable-loadBut special middleware is necessary

18 April 2023 Universidad Politécnica de Valencia 26

Outline A little story Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Peer to peer computing

18 April 2023 Universidad Politécnica de Valencia 27

Modelling Linear Algebra Routines

Necessary to predict accurately the execution time and select The number of processes The number of processors Which processors The number of rows and columns of processes (the topology) The processes to processors assignation The computational block size (in linear algebra algorithms) The communication block size The algorithm (polyalgorithms) The routine or library (polylibraries)

18 April 2023 Universidad Politécnica de Valencia 28

Cost of a parallel program:

: arithmetic time: communication time: overhead, for synchronization,

imbalance, processes creation, ...: overlapping of communication and

computation

Modelling Linear Algebra Routines

overlapoverheadcommarithparallel ttttt aritht

commt

overheadt

overlapt

18 April 2023 Universidad Politécnica de Valencia 29

Estimation of the time:

Considering computation and communication divided in a number of steps:

And for each part of the formula that of the process which gives the highest value.

Modelling Linear Algebra Routines

commarithparallel ttt

...2,2,1,1, commarithcommarithparallel ttttt

18 April 2023 Universidad Politécnica de Valencia 30

The time depends on the problem (n) and the system (p) size:

But also on some ALGORITHMIC PARAMETERS like the block size (b) and the number of rows (r) and columns (c) of processors in algorithms for a mesh of processors

Modelling Linear Algebra Routines

),( pnt parallel

),,,( crbnt parallel

18 April 2023 Universidad Politécnica de Valencia 31

And some SYSTEM PARAMETERS which reflect the computation and communication characteristics of the system.

Typically the cost of an arithmetic operation (tc) and the start-up (ts) and

word-sending time (tw)

Modelling Linear Algebra Routines

),,( SPAPnt parallel

18 April 2023 Universidad Politécnica de Valencia 32

LU factorisation (Golub - Van Loan):

=

Step 1: (factorisation LU no blocks) Step 2: (multiple lower triangular systems)

Step 3: (multiple upper triangular systems)

Step 4: (update south-east blocks)

Modelling Linear Algebra Routines

A11

A22

A33A32A31

A23A21

A13A12 L11

L22

L33L32L31

L21

U1

1 U2

2 U3

3

U2

3

U1

3

U1

2

111111 ULA ii ULA 1111

1111 ULA ii

jiijij ULAA 11

18 April 2023 Universidad Politécnica de Valencia 33

The execution time is

If the blocks are of size 1, the operations are all with individual elements, but if the blocks size is b the cost is

With k3 and k2 the cost of operations performed with BLAS 3 or 2

Modelling Linear Algebra Routines

3

3

2)( nnt tcsequential

nbbnnnt kkksequential2

2

3

3

3

3 3

1

3

2)(

18 April 2023 Universidad Politécnica de Valencia 34

But the cost of different operations of the same level is different, and the theoretical cost could be better modelled as:

Thus, the number of SYSTEM PARAMETERS increases (one for each basic routine), and ...

Modelling Linear Algebra Routines

nbbnnnt kkk dgetfdtrsmdgemmsequential2

2_2

3

_3

3

_3 31

32

)(

18 April 2023 Universidad Politécnica de Valencia 35

The value of each System Parameter can depend on the problem size (n) and on the value of the Algorithmic Parameters (b)

The formula has the form:

And what we want is to obtain the values of AP with which the lowest execution time is obtained

Modelling Linear Algebra Routines

nbbnbnbnnbnbnt kkk dgetfdtrsmdgemmsequential2

2_2

3

_3

3

_3 31

),(),(32

),(),(

)),(,,( APnSPAPnt

18 April 2023 Universidad Politécnica de Valencia 36

The values of the System Parameters could be obtained With installation routines associated to each

linear algebra routine From information stored when the library

was installed in the system, thus generating a hierarchy of libraries with auto-optimization

At execution time by testing the system conditions prior to the call to the routine

Modelling Linear Algebra Routines

18 April 2023 Universidad Politécnica de Valencia 37

These values can be obtained as simple values (traditional method) or as function of the Algorithmic Parameters.

In this case a multidimensional table of values as a function of the problem size and the Algorithmic Parameters is stored,

And when a problem of a particular size is being solved the execution time is estimated with the values of the stored size closest to the real size

And the problem is solved with the values of the Algorithmic Parameters which predict the lowest execution time

Modelling Linear Algebra Routines

18 April 2023 Universidad Politécnica de Valencia 38

Parallel block LU factorisation:

matrix

distribution of computations in the first step

processors

Modelling Linear Algebra Routines

18 April 2023 Universidad Politécnica de Valencia 39

Distribution of computations on successive steps:

second step third step

Modelling Linear Algebra Routines

18 April 2023 Universidad Politécnica de Valencia 40

The cost of parallel block LU factorisation:

Tuning Algorithmic Parameters:block size: b

2D-mesh of p proccesors: p = r c d=max(r,c)

System Parameters:cost of arithmetic operations: k2,getf2 k3,trsmm k3,gemm

communication parameters: ts tw

Modelling Linear Algebra Routines

nkbnbkp

crp

nkT getftrsmgemmARI 2,2

22,3

3

,3 31

32

p

dnt

b

ndtT wsCOM

222

18 April 2023 Universidad Politécnica de Valencia 41

The cost of parallel block QR factorisation:

Tuning Algorithmic Parameters:block size: b

2D-mesh of p proccesors: p = r c

System Parameters:cost of arithmetic operations: k2,geqr2 k2,larft k3,gemm k3,trmm

communication parameters: ts tw

Modelling Linear Algebra Routines

r

bkn

r

bkn

c

bkn

p

knT

larftgeqr

trmmgemm

ARI

,22

2,22,3

2,3

3

2

1

4

1

3

4

pnb

r

r

c

r

r

rntcrb

b

ntT wsCOM log

loglog12

2log2log32

2

2

18 April 2023 Universidad Politécnica de Valencia 42

The same basic operations appear repeatedly in different higher level routines: the information generated for one routine (let’s say LU) could be stored and used for other routines (e.g. QR)and a common format is necessary to store the information

Modelling Linear Algebra Routines

18 April 2023 Universidad Politécnica de Valencia 43

Modelling Linear Algebra Routines

Comparison of execution times using different sets of A lgorithm Parameters (8 processors)

0

20

40

60

80

100

120

140

160

180

200

512 1024 1536 2048 2560 3072

Untuned

Tuned with MCSP

Tuned with MVSP

Optimal Execution Time

18 April 2023 Universidad Politécnica de Valencia 44

Modelling Linear Algebra Routines

IBM-SP2. 8 processors0,00

10,00

20,00

30,00

40,00

50,00

60,00

70,00

80,00

512 1024 1536 2048 2560 3072 3584

problem size

time

(sec

onds

)

mean

model

optimum

Parallel QR factorisation

“mean” refers to the mean of the execution times with representative values of the Algorithmic Parameters (execution time which could be obtained by a non-expert user)

“optimum” is the lowest time of all the executions performed with representative values of the Algorithmic Parameters

“model” is the execution time with the values selected with the model

18 April 2023 Universidad Politécnica de Valencia 45

Modelling Linear Algebra Routines

Parameter selection for the QR algorithm

- IBM SP2

p=4 p=8

b r c b r c

1024 16 1 4 16 1 8

2048 32 1 4 16 1 8

3072 32 1 4 32 2 4

4096 32 1 4 32 2 4 Origin 2000

p=4 p=8

b r c b r c

1024 32 4 1 32 4 2

2048 64 4 1 32 4 2

3072 32 4 2

4096 64 4 2 -

Network of Pentium III with Fast Ethernet

p=4 p=8

b r c b r c

1024 16 1 4 16 1 8

2048 16 1 4 16 1 8

3072 32 1 4 32 1 8

4096 32 1 4 32 1 8

18 April 2023 Universidad Politécnica de Valencia 46

Outline A little history Modelling Linear Algebra Routines Installation routines Autotuning routines Modifications to libraries’ hierarchy Polylibraries Algorithmic schemes Heterogeneous systems Peer to peer computing

18 April 2023 Universidad Politécnica de Valencia 47

In the formulas (parallel block LU factorisation)

The values of the System Parameters (k2,getf2 ,

k3,trsmm , k3,gemm , ts , tw) must be estimated as functions of the problem size (n) and the Algorithmic Parameters (b, r, c)

Installation Routines

ncrbnkbncrbnbkp

crp

ncrbnkcrbnT getftrsmgemmARI ),,,(

31

),,,(),,,(32

),,,( 2,222

,3

3

,3

pdn

crbntbnd

crbntcrbnT wsCOM

22),,,(

2),,,(),,,(

18 April 2023 Universidad Politécnica de Valencia 48

Installation RoutinesBy running at installation time Installation Routines

associated to the linear algebra routineAnd storing the information generated to be used

at running time

Each linear algebra routine must be

designed together with the corresponding installation routines, and the installation process must be detailed

18 April 2023 Universidad Politécnica de Valencia 49

is estimated by performing matrix-matrix multiplications and updatings of

size (n/r b) (b n/c)

Because during the execution the size of the matrix to work with decreases, different values can be estimated for different problem sizes, and the formula can be modified to include the posibility of these estimations with different values, for example, splitting the formula into four formulas with different problem sizes

Installation Routines),,,(,3 crbnk gemm

18 April 2023 Universidad Politécnica de Valencia 50

two multiple triangular systems are solved, one upper triangular of size b n/c , and another lower triangular of sizen/r b

Thus, two parameters are estimated, one of them depending on n, b and c, and the other depending on n, b and r

As for the previous parameter, values can be obtained for different problem sizes

Installation Routines),,,(,3 crbnk trsm

18 April 2023 Universidad Politécnica de Valencia 51

corresponds to a level 2 LU sequential factorisation of size b b

At installation time each of the basic routines is executed varying the value of the parameters they depend on, and with representative values (selected by the routine designer or the system manager),

And the information generated is stored in a file to be used at running time or in the code of the linear algebra routine before its installation

Installation Routines),,,(2,2 crbnk getf

18 April 2023 Universidad Politécnica de Valencia 52

and appear in communications of three types,

In one of them a block of size b b is broadcast in a row, and this parameter depends on b and c

In another a block of size b b is broadcast in a column, and the parameter depends on b and r

And in the other, blocks of sizes b n/c and n/r b are broadcast in each one of the columns and rows of processors. These parameters depend on n, b, r and c

Installation Routines

),,,( crbnts ),,,( crbntw

18 April 2023 Universidad Politécnica de Valencia 53

In practice each System Parameter depends on a more reduced number of Algorithmic Parameters, but this is known only after the installation process is completed.

The routine designer also designs the installation process, and can take into consideration the experience he has to guide the installation.

The basic installation process can be designed allowing the intervention of the system manager.

Installation Routines

18 April 2023 Universidad Politécnica de Valencia 54

Some results in different systems (physical and logical platform)

Values of k3_DTRMM (≈ k3_DGEMM) on the different platforms (in microseconds)

Installation Routines

Block size

System n 16 32 64 128

SUN1 refBLASmacBLAS

ATLAS

512,.., 4096512,.., 4096512,.., 4096

0.02000.01200.0070

0.02000.01100.0060

0.02200.01100.0060

0.02800.01100.0060

SUN5 refBLASmacBLAS

ATLAS

512,.., 4096512,.., 4096512,.., 4096

0.01200.00600.0040

0.01300.00500.0032

0.01400.00500.0025

0.01500.00500.0025

PIII ATLAS 512,.., 4096

0.0038 0.0033 0.0030 0.0030

PPC macBLAS 512,.., 4096

0.0023 0.0019 0.0018 0.0018

R10K macBLAS 512,.., 4096

0.0070 0.0030 0.0025 0.0025

18 April 2023 Universidad Politécnica de Valencia 55

Installation RoutinesValues of k2_DGEQR2 (≈ k2_DLARFT) on the different platforms (in microseconds)

Block size

System n 16 32 64 128

SUN1 refBLASmacBLAS

ATLAS

512,.., 4096512,.., 4096512,.., 4096

0.02000.05000.0700

SUN5 refBLASmacBLAS

ATLAS

512,.., 4096512,.., 4096512,.., 4096

0.00500.03000.0500

PIII ATLAS 512,.., 4096

0.0150

PPC macBLAS 512,.., 4096

0.0100

R10K macBLAS 512,.., 4096

0.0250

18 April 2023 Universidad Politécnica de Valencia 56

Typically the values of the communication parameters are well estimated with a ping-pong

Installation Routines

Block size

System n 16 32 64 128

cSUN1 MPICH 512,.., 4096

170 / 7.0

cPIII MPICH 512,.., 4096

60 / 0.7

IBM-SP2 Mac-MPI 512,.., 4096

75 / 0.3

Origin 2K Mac-MPI 512,.., 4096

20 / 0.1