Javier Cuenca, José González Department of Ingeniería y Tecnología de Computadores Domingo...

Post on 17-Jan-2016

262 views 0 download

Tags:

Transcript of Javier Cuenca, José González Department of Ingeniería y Tecnología de Computadores Domingo...

Javier Cuenca, José GonzálezDepartment of Ingeniería y Tecnología de Computadores

Domingo Giménez Department of Informática y Sistemas

University of MurciaSPAIN

Towards the Design of an Automatically Tuned Linear Algebra

Library

Linear Algebra: highly optimizable operations, but optimizations are Platform Specific

Traditional method: Hand-Optimization for each platform• Time-consuming• Incompatible with Hardware Evolution• Incompatible with changes in the system (architecture and

basic libraries)• Unsuitable for systems with variable workload• Misuse by non expert users

Current Situation of Linear Algebra Parallel Routines

Some groups and projects:

ATLAS, GrADS, LAWRA, FLAME, I-LIB

But the problem is very complex.  

Solutions to this situation?

Routines Parameterised: System parameters, Algorithmic parameters

System parameters obtained at installation timeAnalytical model of the routine and simple installation routines to

obtain the system parameters

A reduced number of executions at installation time

Algorithmic parameters From the analytical model with the system parameters obtained in the installation process

Our approach

Our approach: the scheme

LAR-IFEXECUT. OF LAR-ERsBL

LIBRARY

INCLUSION PROCESS

LAR-OAPF

OAP SELECTION LAR-SPFINSTALLATION

SYSTEM MANAGER

IMPLEMEN. OF LAR-ERs

LAR-DESIGNER

MODELLING LAR

LAR-MOD

DESIGN

LAR

LAR-ERs

Design: Modelling the LAR LAR-DESIGNER

MODELLING LAR

LAR-MOD

DESIGN

LAR

The behaviour of the algorithm on the platform is defined

Texec = f (SPs, n, APs)

SPs = f(n, APs) System Parameters APs Algorithmic Parameters n Problem Size

LAR-MOD:Analytical Model of LAR

System Parameters (SPs):Hardware Platform Physical Characteristics

Current Conditions

Basic libraries

LARs Performance

LAR-MOD:Analytical Model of LAR

System Parameters (SPs):Hardware Platform Physical Characteristics

Current Conditions

Basic libraries

Two Kinds of SPs:

Communication System Parameters (CSPs)

Arithmetic System Parameters (ASPs)

LARs Performance

LAR-MOD:Analytical Model of LAR

System Parameters (SPs):Hardware Platform Physical Characteristics

Current Conditions

Basic libraries

Two Kinds of SPs:

Communication System Parameters (CSPs):

ts start-up time

tw word-sending time

Arithmetic System Parameters (ASPs)

LARs Performance

LAR-MOD:Analytical Model of LAR

System Parameters (SPs):Hardware Platform Physical Characteristics

Current Conditions

Basic libraries

Two Kinds of SPs:

Communication System Parameters (CSPs)

Arithmetic System Parameters (ASPs):

tc arithmetic cost. Using BLAS: k1 k2 and k3

LARs Performance

LAR-MOD:Analytical Model of LAR

System Parameters (SPs):Hardware Platform Physical Characteristics

Current Conditions

Basic libraries

How to estimate each SP?

1º.- Obtain the kernel of performance cost of LAR

2º.- Make an Estimation Routine from this kernel

LARs Performance

LAR-MOD:Analytical Model of LAR

DesignLAR-DESIGNER

MODELLING LAR

LAR-MOD

DESIGN

LAR

Design: Making the LAR-ERs

IMPLEMEN. OF LAR-ERs

LAR-DESIGNER

MODELLING LAR

LAR-MOD

DESIGN

LAR

LAR-ERs

Arithmetic System Parameters (ASPs):Computation Kernel of the LAR Estimation Routine

Similar storage scheme Similar quantity of data

Communication System Parameters (CSPs):Communication Kernel of the LAR Estimation Routine

Similar kind of communication Similar quantity of data

LAR-ERs: Estimation Routines

IMPLEMEN. OF LAR-ERs

LAR-DESIGNER

MODELLING LAR

LAR-MOD

DESIGN

LAR

LAR-ERs

Design

IMPLEMEN. OF LAR-ERs

LAR-DESIGNER

HAND-MADE

ONLY ONCE

MODELLING LAR

LAR-MOD

DESIGN

LAR

LAR-ERs

Design: Process has finished

Installation: Runing the LAR-ERs

LAR-IFEXECUT. OF LAR-ERsBL

LAR-SPFINSTALLATION

SYSTEM MANAGER

IMPLEMEN. OF LAR-ERs

LAR-DESIGNER

MODELLING LAR

LAR-MOD

DESIGN

LAR

LAR-ERs

Installation: obtaining the OAP

LAR-IFEXECUT. OF LAR-ERsBL

LAR-OAPF

OAP SELECTION LAR-SPFINSTALLATION

SYSTEM MANAGER

IMPLEMEN. OF LAR-ERs

LAR-DESIGNER

MODELLING LAR

LAR-MOD

DESIGN

LAR

LAR-ERs

Algorithmic Parameters (APs)

Known the SPs values,

the Optimum Values for the APs are calculated (OAP):

b block size

p number of processors

r c logical topology

grid configuration (logical 2D mesh)

Installation: obtaining the OAP

Installation

LAR-IFEXECUT. OF LAR-ERsBL

LAR-OAPF

OAP SELECTION LAR-SPFINSTALLATION

SYSTEM MANAGER

IMPLEMEN. OF LAR-ERs

LAR-DESIGNER

MODELLING LAR

LAR-MOD

DESIGN

LAR

LAR-ERs

Installation: putting it all together

LAR-IFEXECUT. OF LAR-ERsBL

LIBRARY

INCLUSION PROCESS

LAR-OAPF

OAP SELECTION LAR-SPFINSTALLATION

SYSTEM MANAGER

IMPLEMEN. OF LAR-ERs

LAR-DESIGNER

MODELLING LAR

LAR-MOD

DESIGN

LAR

LAR-ERs

Installation process finished

LAR-IFEXECUT. OF LAR-ERsBL

LIBRARY

INCLUSION PROCESS

LAR-OAPF

OAP SELECTION LAR-SPFINSTALLATION

SYSTEM MANAGER

IMPLEMEN. OF LAR-ERs

LAR-DESIGNER

MODELLING LAR

LAR-MOD

DESIGN

LAR

LAR-ERs

LAR: Least Squares Toeplitz Routine.

Platform: Network of PCs

LAR: One-sided Block Jacobi Method to solve the Symmetric Eigenvalue Problem.

Platform: SGI Origin 2000

LAR: Gaussian elimination.

Platform: NoW (heterogeneous system)

LAR: block LU factorization.

Platforms: IBM SP2, SGI Origin 2000, NoW

Basic Libraries: reference BLAS, machine BLAS, ATLAS

Experiments

Quotient between the execution time with the parameters provided by the model and the optimum execution time. In the sequential case, and in parallel with

4 and 8 processors.

LU on IBM SP2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

512 1024 1536 2048 2560 3072 3584

SEQ

PAR4

PAR8

Quotient between the execution time with the parameters provided by the model and the optimum execution time. In the sequential case, and in parallel with

4, 8 and 16 processors.

LU on Origin 2000

0

0.2

0.4

0.6

0.8

1

1.2

1.4

512 1024 1536 2048 2560 3072 3584

SEQ

PAR4

PAR8

PAR16

Quotient between the execution time with the parameters provided by the model and the optimum execution time. In the sequential case, and in parallel with 4 processors. Using machine BLAS and ATLAS as basic

libraries.

LU on NoW

0

0,2

0,4

0,6

0,8

1

1,2

512 1024 1536 2048

SEQ BLAS

SEQ ATLAS

PAR4 BLAS

PAR4 ATLAS

We try to develop a methodology valid for a wide range of systems, and to include it in the design of linear algebra libraries:it is necessary to analyse the methodology in more systems and with more routines

The Basic Linear Algebra Library to use can be considered as another parameter

An installation strategy common to a set of routines must be developed

At the moment we are analysing routines individually, but it could be preferable to analyse algorithmic schemes

We are working in the design of a strategy for the parameters election in dynamic systems

Future Works