High Performance Matrix Inversion on a Multi-core
Platform with Several GPUs
Pablo Ezzatti1, Enrique S. Quintana-Ortı2 and Alfredo Remon2
1Centro de Calculo-Instituto de Computacion, Univ. de la [email protected]
2Depto. de Ing. y Ciencia de los Computadores, Univ. Jaume I de Castellon{quintana,remon}@icc.uji.es
Ayia Napa – February 2011
Alfredo Remon ([email protected]) High Performance Matrix Inversion with Several GPUs
Matrix inversion of large-scale matrices
Appears in a number of scientific applications like modelreduction or optimal control
Requires a high computational effort, 2n3 floating-pointoperations (flops)
Graphics processors (GPUs)
Massively parallel architectures
Good results on the acceleration of linear algebra operations
Alfredo Remon ([email protected]) High Performance Matrix Inversion with Several GPUs
Outline
1 Motivation
2 Matrix inversion
3 Implementations on a multi-core CPU and multiple CPUs
4 Experimental analysis
5 Concluding remarks
Alfredo Remon ([email protected]) High Performance Matrix Inversion with Several GPUs
Matrix inversion
Two different approaches
1 Based on the Gaussian elimination (i.e., the LU factorization)
2 Based on the Gauss-Jordan elimination (GJE)
Both approaches present similar computational costbut different properties
Alfredo Remon ([email protected]) High Performance Matrix Inversion with Several GPUs
Matrix inversion via LU factorization
1 PA = LU
2 U → U−1
3 Solve the system XL = U−1 for X
4 Undo the permutations A−1 := XP
Implementation
The algorithm sweeps through the matrix four times.
Presents a mild load imbalance, due to the work withtriangular factors.
Algorithm implemented by LAPACK
Alfredo Remon ([email protected]) High Performance Matrix Inversion with Several GPUs
Matrix inversion via GJE
Based on Gauss-Jordan elimination
In essence, it is a reordering of the operations
Same arithmetical cost
Implementation
The algorithm sweeps through the matrix once→ Less memory accesses
Most of the computations are highly parallel→ More parallelism
Alfredo Remon ([email protected]) High Performance Matrix Inversion with Several GPUs
Matrix inversion via GJE
At a given iteration:
A11A10 A12
A21A20 A22
A01A00 A02
A11 is b × b
A11A10 A12
A21A20 A22
A01A00 A02
2
4
A01
A11
A21
3
5 := GJEunb
0
@
2
4
A01
A11
A21
3
5
1
A
Alfredo Remon ([email protected]) High Performance Matrix Inversion with Several GPUs
Matrix inversion via GJE
A11A10 A12
A21A20 A22
A01A00 A02
A00 := A00 + A01A10
A20 := A20 + A21A10
A10 := A11A10
A11A10 A12
A21A20 A22
A01A00 A02
A02 := A02 + A01A12
A22 := A22 + A21A12
A12 := A11A12
Alfredo Remon ([email protected]) High Performance Matrix Inversion with Several GPUs
Matrix inversion via GJE
Move boundaries for the next iteration
A00 A01 A02
A10 A11 A12
A20 A21 A22
A11 is b × b
Alfredo Remon ([email protected]) High Performance Matrix Inversion with Several GPUs
Outline
1 Motivation
2 Matrix inversion
3 Implementations on a multi-core CPU and multiple CPUs
4 Experimental analysis
5 Concluding remarks
Alfredo Remon ([email protected]) High Performance Matrix Inversion with Several GPUs
Implementations on a multi-core CPU and multiple GPUs
Implementation GJEmGPU
Executes every operation on the most convenient device:
GJEUNB on the CPUMatrix-matrix products on the GPUs
Data distribution
Data is uniformly distributed amongGPUs
Each GPU performs the operationsthat involve their respective panel
GPU 1 GPU 2 GPU 3 GPU 4
Alfredo Remon ([email protected]) High Performance Matrix Inversion with Several GPUs
Implementations on a multi-core CPU and multiple GPUs
Implementation GJEmGPU
All GPUs compute concurrently
The update of current panel on theCPU is overlapped with someupdates on the GPUs
The active column panel istransferred to the CPU initially,and broadcasted to the GPUs afterbeing updated
GPU 1 GPU 2 GPU 3 GPU 4
Alfredo Remon ([email protected]) High Performance Matrix Inversion with Several GPUs
Implementations on a multi-core CPU and multiple GPUs
Implementation GJELA
Based on GJEmGPU but reordering theoperations applying a look-aheadapproach
The first b columns of block[A02;A12;A22] are updated andtransferred to the CPU
���������������������������������������������������������������������������������������������������������������������������������������
���������������������������������������������������������������������������������������������������������������������������������������
GPU 1 GPU 2 GPU 3 GPU 4
Alfredo Remon ([email protected]) High Performance Matrix Inversion with Several GPUs
Implementations on a multi-core CPU and multiple GPUs
Implementation GJELA
The CPU updates [A01;A11;A21] andthe new received block (that is, blocks[A01;A11;A21] of the next iteration)
Concurrently, the GPUs update therest of the matrix. ���
������������������������������������������������������������������������������������������������������������������������������������
���������������������������������������������������������������������������������������������������������������������������������������
GPU 1 GPU 2 GPU 3 GPU 4
Alfredo Remon ([email protected]) High Performance Matrix Inversion with Several GPUs
Implementations on a multi-core CPU and multiple GPUs
Implementation GJEML
The optimal algorithmic block-size for CPU and GPU aresignificantly different
The use of the same block-size in both architectures limits theperformance of GJELA
In this version, the CPU employs a blocked version of GJEinstead of the unblocked one
The CPU and GPUs employ block-sizes b and bc respectively,allowing each architecture to optimize its performance
Alfredo Remon ([email protected]) High Performance Matrix Inversion with Several GPUs
Implementations on a multi-core CPU and multiple GPUs
Implementation GJECD
At any iteration, one of the GPUsrequire more time than the rest,leading to load imbalance
We can partially overcome thisproblem by employing a cyclicdistribution
The matrix is partitioned in blocks ofb columns, and the ith block ofcolumns is stored and updated by the(i mod k)-th GPU
GPU 1 GPU 2 GPU 3 GPU 4
Alfredo Remon ([email protected]) High Performance Matrix Inversion with Several GPUs
Implementations on a multi-core CPU and multiple GPUs
Implementation GJEMerge
All GPUs perform 3 operations per iteration
Performing a minor change on our algorithm, we can reduce itto one operation per iteration with the following advantages:
Less overhead due to routines invokations
Avoid the matrix-matrix products that involve small blocks
Alfredo Remon ([email protected]) High Performance Matrix Inversion with Several GPUs
Implementations on a multi-core CPU and multiple GPUsImplementation GJEMerge
At a given iteration:
A11A10 A12
A21A20 A22
A01A00 A02
A11 is b × b
A11A10 A12
A21A20 A22
A01A00 A02
2
4
A01
A11
A21
3
5 := GJEblk
0
@
2
4
A01
A11
A21
3
5
1
A
Alfredo Remon ([email protected]) High Performance Matrix Inversion with Several GPUs
Implementations on a multi-core CPU and multiple GPUsImplementation GJEMerge
A11A10 A12
A21A20 A22
A01A00 A02
W1 := A10
A10 := 0
W2 := A12
A12 := 0
2
4
A00A02
A10A12
A20A22
3
5 :=
2
4
A00A02
A10A12
A20A22
3
5 +
2
4
A01
A11
A21
3
5 [W1W2]
Alfredo Remon ([email protected]) High Performance Matrix Inversion with Several GPUs
Implementations on a multi-core CPU and multiple GPUsImplementation GJEMerge
Move boundaries for the next iteration
A00 A01 A02
A10 A11 A12
A20 A21 A22
A11 is b × b
Alfredo Remon ([email protected]) High Performance Matrix Inversion with Several GPUs
Implementations on a multi-core CPU and multiple GPUs
Implementation GJEMerge
An important improvement in performance can be obtained bymerging the extra copies with the swap stage required bypivoting.
Thus, W1 and W2 will contain blocks A10 and A12 after thepivoting has been applied.
This considerably reduces the number of memory accesses andpartially hides the overhead introduced by the copies of A10
and A12
Alfredo Remon ([email protected]) High Performance Matrix Inversion with Several GPUs
Outline
1 Motivation
2 Matrix inversion
3 Implementations on a multi-core CPU and multiple CPUs
4 Experimental analysis
5 Concluding remarks
Alfredo Remon ([email protected]) High Performance Matrix Inversion with Several GPUs
Experimental analysis
Processors # Cores Frequency L2 cache Memory
(MB) (GB)
Intel Xeon QuadCore 8 (2x4) 2.27 8 48
NVIDIA TESLA c1060 960 (4x240) 1.3 - 16(4x4)
BLAS Implementations
MKL 11.1
NVIDIA CUBLAS 3.0
Results for matrices with 1000 ≤ n ≤ 64000
Alfredo Remon ([email protected]) High Performance Matrix Inversion with Several GPUs
Experimental analysis
(Martix inversion on 2 GPUs)
0
100
200
300
400
500
600
0 5 10 15 20 25 30 35 40
LIN0
GFLOPS
Matrix dimension (x1000)
LAPACK
GJEmGPUGJELAGJEMLGJECD
GJEMerge
Alfredo Remon ([email protected]) High Performance Matrix Inversion with Several GPUs
Experimental analysis
(Martix inversion on 4 GPUs)
0
200
400
600
800
1000
1200
0 5 10 15 20 25 30 35 40
GFLOPS
Matrix dimension (x1000)
LAPACKGJEmGPU
GJELA
GJEML
GJECD
GJEMerge
Alfredo Remon ([email protected]) High Performance Matrix Inversion with Several GPUs
Experimental analysis
(Variant GJEMerge on 1, 2, 3 and 4 GPUs)
0
200
400
600
800
1000
1200
1400
0 10 20 30 40 50 60 70
GFLOPS
Matrix dimension (x1000)
GJEMerge on 1 GPUGJEMerge on 2 GPUsGJEMerge on 3 GPUsGJEMerge on 4 GPUs
Alfredo Remon ([email protected]) High Performance Matrix Inversion with Several GPUs
Outline
1 Motivation
2 Matrix inversion
3 Implementations on a multi-core CPU and multiple CPUs
4 Experimental analysis
5 Concluding remarks
Alfredo Remon ([email protected]) High Performance Matrix Inversion with Several GPUs
Concluding Remarks
We have presented five implementations of matrix inversionbased on the GJE method on multiple GPUs
Those implementations are between 6 and 12 times fasterthan LAPACK using 4 GPUs and exhibit excellent scalabilityproperties (with a nearly linear speed-up)
The use of multiple GPUs
Reduce the computational timeIncrements the amount of memory available, allowing theinversion of larger matrices
Alfredo Remon ([email protected]) High Performance Matrix Inversion with Several GPUs
Future Work
Evaluation of double precision arithmetic on the new NVIDIAFermi architecture
GJEMerge implementation is clearly limited by the performanceof CUBLAS routine for matrix-matrix products. Other GPUKernels should be evaluated
Alfredo Remon ([email protected]) High Performance Matrix Inversion with Several GPUs
Questions ?
Alfredo Remon ([email protected]) High Performance Matrix Inversion with Several GPUs
Top Related