Communication costs of LU decomposition algorithms for banded matrices
description
Transcript of Communication costs of LU decomposition algorithms for banded matrices
1
Communication costs of LU decomposition algorithms for
banded matrices
Razvan Carbunescu
12/02/2011
2
Outline (1/2)• Sequential general LU factorization (GETRF) and Lower Bounds• Definitions and Lower Bounds• LAPACK algorithm• Communication cost• Summary
• Sequential banded LU factorization (GBTRF) and Lower Bounds• Definitions and Lower Bounds• Banded format• LAPACK algorithm• Communication cost• Summary
• Sequential LU Summary
12/02/2011
3
Outline (2/2)• Parallel LU definitions and Lower bounds
• Parallel Cholesky algorithms (Saad, Schultz ‘85)• SPIKE Cholesky algorithm (Sameh’85)
• Parallel banded LU factorization (PGBTRF)• ScaLAPACK algorithm• Communication cost• Summary
• Parallel banded LU and Cholesky Summary
• Future Work
• General Summary12/02/2011
4
GETRF – Definitions and Lower Bounds• Variables:
n - size of the matrix
r - block size (panel width)
i - current panel number
M - size of fast memory
• fits into pattern of 3-nested loops and has usual lower bounds:
12/02/2011
5
GETRF - Communication assumptions•BLAS2 LU on (m x n) matrix takes
•TRSM on (n x m) with LL (n x n) takes
•GEMM in (m x n) - (m x k) (k x n) takes
12/02/2011
m
n
m
n
n
n
P
L
U
n
m
n
n
n
m
U
LL-1
A
m
n
m
k
k
n
A
L
U
m
m
A
6
GETRF – LAPACK algorithm
12/02/2011
• For each panel block:
1) Factorize panel (n x r) 2) Permute matrix3) Compute U update (TRSM) of size r x (n-ir) with LL of size r x r4) Compute GEMM update of size:
(n-ir) x (n-ir) - ((n-ir) x r ) * (r x (n-ir))
7
GETRF – LAPACK algorithm (1/4)
12/02/2011
• Factorize panel P
Words:
Total words :
n- (i-1)r
r
r
r
r
P
L
U
n- (i-1)r
8
GETRF – LAPACK algorithm (2/4)
12/02/2011
• Permute matrix with pivot information from panel
Words:
Total words :
9
GETRF – LAPACK algorithm (3/4)
12/02/2011
• Permute matrix with pivot information from panel
Words:
Total words :
r
n-ir
r
r
r
n-ir U
LL-1
A
10
GETRF – LAPACK algorithm (4/4)
12/02/2011
• Permute matrix with pivot information from panel
Words:
Total words :
n-ir
n - ir
r r
n -ir A
L
U
n-ir A
n-ir
n-ir
11
GETRF – Communication cost
12/02/2011
• Communication cost
• Simplified in the big O notation we get:
12
GETRF - General LU Summary• General LU lower bounds are:
• LAPACK LU algorithm gives :
12/02/2011
13
GBTRF - Banded LU factorization• Variables:
n - size of the matrix
b - matrix bandwidth
r - block size (panel width)
M - size of fast memory
• Also fits into 3-nested loops lower bounds:
12/02/2011
14
Banded Format• GBTRF uses a special “banded format”
• Packed data format that stores mostly data and very few non-zeros
• columns map to columns ; diagonals map to rows
• easy to retrieve a square block from original A by using lda – 1
12/02/2011
15
Banded Format
12/02/2011
Conceptual
Actual
• Because of format the update of U and of the Schur complement get split into multiple stages for the parts of the band matrix near the edges of the storage array
16
GBTRF Algorithm• For each panel block
1) Factorize panel of size b x r2) Permute rest of matrix affected by panel3) Compute U update (TRSM) of size (b- 2r) x r with LL of size (r x r)4) Compute U update (TRSM) of size r x r with LL of size (r x r)5) Compute 4 GEMM updates of sizes:
(b-2r) x (b-2r) + ((b-2r) x r ) * (r x (b-2r)) (b-2r) x r + ((b-2r) x r ) * (r x r) r x (b-2r) + (r x r) * (r x (b-2r)) r x r + (r x r) * (r x r)
12/02/2011
17
GBTRF – LAPACK algorithm (1/8)
12/02/2011
• Factorize panel P
Words:
Total words :
b
r rr
b
r
18
GBTRF – LAPACK algorithm (2/8)
12/02/2011
• Apply permutations
Words:
Total words :
19
GBTRF – LAPACK algorithm (3/8)
12/02/2011
• Compute U update (TRSM) of size (b- 2r) x r with LL of size (r x r)
Words:
Total words :
r
b – 2r b – 2rr
r r-1
20
GBTRF – LAPACK algorithm (4/8)
12/02/2011
• Compute U update (TRSM) of size r x r with LL of size (r x r)
Words:
Total words :
r
-1rr
r
r
r
21
GBTRF – LAPACK algorithm (5/8)
12/02/2011
• Compute GEMM update of size (b-2r)x(b-2r) + ((b-2r) x r)*(r x (b-2r))
Words:
Total words :
b – 2r
b – 2r b – 2rrb – 2r
22
GBTRF – LAPACK algorithm (6/8)
12/02/2011
• Compute GEMM update of size
Words:
Total words :
b – 2r b – 2r b – 2r
r
r
23
GBTRF – LAPACK algorithm (7/8)
12/02/2011
• Compute GEMM update of size
Words:
Total words :
b – 2r
r r r
r
r
24
GBTRF – LAPACK algorithm (8/8)
12/02/2011
• Compute GEMM update of size
Words:
Total words :
r
r r r r
25
GBTRF communication cost
12/02/2011
• A full cost would be:
• If we choose r < b/3 this simplifies the leading terms to:
• Since r < b the other option is b/3 < r < b which gives in this case we get:
26
GBTRF - Banded LU Summary• Banded LU lower bounds are:
• LAPACK banded LU algorithm gives :
12/02/2011
27
Sequential Summary
12/02/2011
28
Parallel banded LU - Definitions• Variables:
n - size of the matrix
p - number of processors
b - matrix bandwidth
M - size of fast memory
12/02/2011
29
Parallel banded LU – Lower Bounds• Assuming banded matrix is distributed in a 1D layout across n
• Lower Bounds
12/02/2011
P(i-1) P(i)
30
Parallel banded algorithms – (Saad ‘85)• In (Saad, Schultz ’85) we are presented with a computation and communication analysis for banded Cholesky (LLT) solvers on a 1D ring, 2D torus and n-D hypercube as well as a pipelined approach • While this is a different computation from LU, Cholesky can be viewed as a minimum cost for LU since it does not require pivoting nor the computation of the U but is also used for Gaussian Elimination
• Since most parallel banded algorithms also increase the amount of computation done that will also be compared between the algorithms in terms of multiplicative factors to the leading term.
12/02/2011
31
Parallel banded algorithms – RIGBE
12/02/2011
32
Parallel banded algorithms – BIGBE
12/02/2011
33
Parallel banded algorithms – HBGE
12/02/2011
• Same algorithm as BIGGE but the 2D grid is embedded in the Hypercube to allow for faster communication costs
34
Parallel banded algorithms – WFGE
12/02/2011
• Uses the 2D cyclic layout and then performs operations diagonally
35
Parallel banded algorithms – (Saad ‘85)• Parallel band LU lower bounds:
• Banded Cholesky algorithms :
12/02/2011
36
Parallel banded algorithms – SPIKE (1/3)• Another parallel banded implementation is presented in the SPIKE Algorithm (Lawrie, Sameh ‘84) which is a Cholesky solver which is just a special case of Gaussian Elimination
• This algorithm for factorization and solver is extended to a pivoting LU implementation in (Sameh ’05)
12/02/2011
37
Parallel banded algorithms – SPIKE (2/3)
12/02/2011
38
Parallel banded algorithms – SPIKE (3/3)
12/02/2011
• parallel band LU Lower Bounds
• SPIKE Cholesky algorithm
39
PGBTRF – Data Layout• Adopts same banded layout as sequential with a slightly higher bandwidth storage (4b instead of 3b) and 1D block distribution
12/02/2011
n
P1 P2 P3 P4
2b
2b
40
PGBTRF – Algorithm• Description from ScaLAPACK code
1) Compute Fully Independent band LU factorizations of the submatrices located in local memory.
2) Pass the upper triangular matrix from the end of the local storage on to the next processor.
3) From local factorization and upper triangular matrix form a reduced blocked bidiagonal system and store extra data in Af (extra storage)
4) Solve reduced blocked bidiagonal system to compute extra factors and store in Af
12/02/2011
41
PGBTRF – Communication cost
12/02/2011
• Parallel band LU lower bounds:
• ScaLAPACK band LU algorithm:
42
Parallel Summary• Lower Bounds
• (Saad’85)
• SPIKE
• ScaLAPACK
12/02/2011
43
Future Work• Checking the lower bounds and implementation details of applying CALU to the panel in the LAPACK algorithm
• Investigate parallel band LU lower bounds for an exact cost
• Heterogeneous analysis of implemented MAGMA sgbtrf and lower bounds for a heterogeneous model
• Looking at Nested Dissection as another Divide and Conquer method for parallel banded LU
• Analysis of cost of applying a parallel banded algorithm to the sequential model to see if we can reduce the communication by increasing computation
12/02/2011
44
General Summary
12/02/2011
45
Questions?
12/02/2011