Communication costs of LU decomposition algorithms for banded matrices Razvan Carbunescu...

45
Communication costs of LU decomposition algorithms for banded matrices Razvan Carbunescu 12/02/2011 1

Transcript of Communication costs of LU decomposition algorithms for banded matrices Razvan Carbunescu...

  • Slide 1

Communication costs of LU decomposition algorithms for banded matrices Razvan Carbunescu 12/02/20111 Slide 2 Outline (1/2) Sequential general LU factorization (GETRF) and Lower Bounds Definitions and Lower Bounds LAPACK algorithm Communication cost Summary Sequential banded LU factorization (GBTRF) and Lower Bounds Definitions and Lower Bounds Banded format LAPACK algorithm Communication cost Summary Sequential LU Summary 12/02/20112 Slide 3 Outline (2/2) Parallel LU definitions and Lower bounds Parallel Cholesky algorithms (Saad, Schultz 85) SPIKE Cholesky algorithm (Sameh85) Parallel banded LU factorization (PGBTRF) ScaLAPACK algorithm Communication cost Summary Parallel banded LU and Cholesky Summary Future Work General Summary 12/02/20113 Slide 4 GETRF Definitions and Lower Bounds Variables: n - size of the matrix r - block size (panel width) i- current panel number M - size of fast memory fits into pattern of 3-nested loops and has usual lower bounds: 12/02/20114 Slide 5 GETRF - Communication assumptions BLAS2 LU on (m x n) matrix takes TRSM on (n x m) with LL (n x n) takes GEMM in (m x n) - (m x k) (k x n) takes 12/02/20115 m n m n n n P L U n m n n n m U LL -1 A m n m k k n A L U m m A Slide 6 GETRF LAPACK algorithm 12/02/20116 For each panel block: 1)Factorize panel (n x r) 2)Permute matrix 3)Compute U update (TRSM) of size r x (n-ir) with LL of size r x r 4)Compute GEMM update of size: (n-ir) x (n-ir) - ((n-ir) x r ) * (r x (n-ir)) Slide 7 GETRF LAPACK algorithm (1/4) 12/02/20117 Factorize panel P Words: Total words : n- (i-1)r r r r r P L U n- (i-1)r Slide 8 GETRF LAPACK algorithm (2/4) 12/02/20118 Permute matrix with pivot information from panel Words: Total words : Slide 9 GETRF LAPACK algorithm (3/4) 12/02/20119 Permute matrix with pivot information from panel Words: Total words : r n-ir r r r n-ir U LL -1 A Slide 10 GETRF LAPACK algorithm (4/4) 12/02/201110 Permute matrix with pivot information from panel Words: Total words : n-ir r r A L U A Slide 11 GETRF Communication cost 12/02/201111 Communication cost Simplified in the big O notation we get: Slide 12 GETRF - General LU Summary General LU lower bounds are: LAPACK LU algorithm gives : 12/02/201112 Slide 13 GBTRF - Banded LU factorization Variables: n - size of the matrix b- matrix bandwidth r - block size (panel width) M - size of fast memory Also fits into 3-nested loops lower bounds: 12/02/201113 Slide 14 Banded Format GBTRF uses a special banded format Packed data format that stores mostly data and very few non-zeros columns map to columns ; diagonals map to rows easy to retrieve a square block from original A by using lda 1 12/02/201114 Slide 15 Banded Format 12/02/201115 Conceptual Actual Because of format the update of U and of the Schur complement get split into multiple stages for the parts of the band matrix near the edges of the storage array Slide 16 GBTRF Algorithm For each panel block 1)Factorize panel of size b x r 2)Permute rest of matrix affected by panel 3)Compute U update (TRSM) of size (b- 2r) x r with LL of size (r x r) 4)Compute U update (TRSM) of size r x r with LL of size (r x r) 5)Compute 4 GEMM updates of sizes: (b-2r) x (b-2r) + ((b-2r) x r ) * (r x (b-2r)) (b-2r) x r + ((b-2r) x r ) * (r x r) r x (b-2r) + (r x r) * (r x (b-2r)) r x r + (r x r) * (r x r) 12/02/201116 Slide 17 GBTRF LAPACK algorithm (1/8) 12/02/201117 Factorize panel P Words: Total words : b rrr b r Slide 18 GBTRF LAPACK algorithm (2/8) 12/02/201118 Apply permutations Words: Total words : Slide 19 GBTRF LAPACK algorithm (3/8) 12/02/201119 Compute U update (TRSM) of size (b- 2r) x r with LL of size (r x r) Words: Total words : r b 2r r rr Slide 20 GBTRF LAPACK algorithm (4/8) 12/02/201120 Compute U update (TRSM) of size r x r with LL of size (r x r) Words: Total words : r rr r r r Slide 21 GBTRF LAPACK algorithm (5/8) 12/02/201121 Compute GEMM update of size (b-2r)x(b-2r) + ((b-2r) x r)*(r x (b-2r)) Words: Total words : b 2r r Slide 22 GBTRF LAPACK algorithm (6/8) 12/02/201122 Compute GEMM update of size Words: Total words : b 2r r r Slide 23 GBTRF LAPACK algorithm (7/8) 12/02/201123 Compute GEMM update of size Words: Total words : b 2r rrr r r Slide 24 GBTRF LAPACK algorithm (8/8) 12/02/201124 Compute GEMM update of size Words: Total words : r rr rr Slide 25 GBTRF communication cost 12/02/201125 A full cost would be: If we choose r < b/3 this simplifies the leading terms to: Since r < b the other option is b/3 < r < b which gives in this case we get: Slide 26 GBTRF - Banded LU Summary Banded LU lower bounds are: LAPACK banded LU algorithm gives : 12/02/201126 Slide 27 Sequential Summary 12/02/201127 Slide 28 Parallel banded LU - Definitions Variables: n - size of the matrix p- number of processors b- matrix bandwidth M - size of fast memory 12/02/201128 Slide 29 Parallel banded LU Lower Bounds Assuming banded matrix is distributed in a 1D layout across n Lower Bounds 12/02/201129 P(i-1)P(i) Slide 30 Parallel banded algorithms (Saad 85) In (Saad, Schultz 85) we are presented with a computation and communication analysis for banded Cholesky (LL T ) solvers on a 1D ring, 2D torus and n-D hypercube as well as a pipelined approach While this is a different computation from LU, Cholesky can be viewed as a minimum cost for LU since it does not require pivoting nor the computation of the U but is also used for Gaussian Elimination Since most parallel banded algorithms also increase the amount of computation done that will also be compared between the algorithms in terms of multiplicative factors to the leading term. 12/02/201130 Slide 31 Parallel banded algorithms RIGBE 12/02/201131 Slide 32 Parallel banded algorithms BIGBE 12/02/201132 Slide 33 Parallel banded algorithms HBGE 12/02/201133 Same algorithm as BIGGE but the 2D grid is embedded in the Hypercube to allow for faster communication costs Slide 34 Parallel banded algorithms WFGE 12/02/201134 Uses the 2D cyclic layout and then performs operations diagonally Slide 35 Parallel banded algorithms (Saad 85) Parallel band LU lower bounds: Banded Cholesky algorithms : 12/02/201135 Slide 36 Parallel banded algorithms SPIKE (1/3) Another parallel banded implementation is presented in the SPIKE Algorithm (Lawrie, Sameh 84) which is a Cholesky solver which is just a special case of Gaussian Elimination This algorithm for factorization and solver is extended to a pivoting LU implementation in (Sameh 05) 12/02/201136 Slide 37 Parallel banded algorithms SPIKE (2/3) 12/02/201137 Slide 38 Parallel banded algorithms SPIKE (3/3) 12/02/201138 parallel band LU Lower Bounds SPIKE Cholesky algorithm Slide 39 PGBTRF Data Layout Adopts same banded layout as sequential with a slightly higher bandwidth storage (4b instead of 3b) and 1D block distribution 12/02/201139 n P1 P2P3P4 2b Slide 40 PGBTRF Algorithm Description from ScaLAPACK code 1) Compute Fully Independent band LU factorizations of the submatrices located in local memory. 2) Pass the upper triangular matrix from the end of the local storage on to the next processor. 3) From local factorization and upper triangular matrix form a reduced blocked bidiagonal system and store extra data in Af (extra storage) 4) Solve reduced blocked bidiagonal system to compute extra factors and store in Af 12/02/201140 Slide 41 PGBTRF Communication cost 12/02/201141 Parallel band LU lower bounds: ScaLAPACK band LU algorithm: Slide 42 Parallel Summary Lower Bounds (Saad85) SPIKE ScaLAPACK 12/02/201142 Slide 43 Future Work Checking the lower bounds and implementation details of applying CALU to the panel in the LAPACK algorithm Investigate parallel band LU lower bounds for an exact cost Heterogeneous analysis of implemented MAGMA sgbtrf and lower bounds for a heterogeneous model Looking at Nested Dissection as another Divide and Conquer method for parallel banded LU Analysis of cost of applying a parallel banded algorithm to the sequential model to see if we can reduce the communication by increasing computation 12/02/201143 Slide 44 General Summary 12/02/201144 Slide 45 Questions? 12/02/201145