ICT-SSAI: Scalable Tree-based Drop-Threshold Cholesky...
Transcript of ICT-SSAI: Scalable Tree-based Drop-Threshold Cholesky...
ICT-SSAI: Scalable Tree-based Drop-Threshold Cholesky
PreconditioningKeita Teranishi and Padma Raghavan
Department of Computer Science and EngineeringThe Pennsylvania State University
Barry F. SmithMathematics and Computer Science
Argonne National Laboratory
iWNMSC’06The University of Tokyo
10/25/2006
Supported by the National Science FoundationACI-0102537 and DMR-0205232. and in part through DOE TOPS-1
Outline
Background Tree-based hybrid solvers (SPD matrices)
Incomplete Cholesky using drop-thresholds (ICT) with Selective Sparse Approximate Inversion(ICT-SSAI)
Analysis for model problemsConvergence, speedups and scalability for finite-element problems Conclusions
Background
Incomplete Cholesky preconditioners for CGGeneral purpose, tuning fill = pure direct to pure iterative
Parallel ICT preconditionerConstruction like sparse parallel direct solver, managing fill, complex data structures, etc.Application is inefficient, dominated by large latency of inter-processor communication relative to computations
Sparse approximate inverse (SAI) Alternative to ICT, allows efficient application, but preconditioner quality and tunability lag ICTFrobenius Norm Minimization (Grote,Huckle 97, and Chow 00)
Our method: ICT+ Selective SAI (ICT-SSAI)
Tree-Based SolversPartitioned domain = tree of separators Tree=data dependency of Cholesky, ICT and triangular solutionA tree supernode = separator, columns with similar nonzero pattern
Block algorithm can be applied for efficiency
Uses techniques from sparse direct solvers (e.g. fill-reduction, and sparse left-looking update)
Oˆ L 11
ˆ L 22
K ˆ L 31ˆ L 32
ˆ L 33
⎡
⎣
⎢ ⎢ ⎢ ⎢
⎤
⎦
⎥ ⎥ ⎥ ⎥
O
A11
A22
L A31 A32 A33
⎡
⎣
⎢ ⎢ ⎢ ⎢
⎤
⎦
⎥ ⎥ ⎥ ⎥
Factor
Tree-Structured Parallel ICT(Raghavan, Teranishi, and Ng: NLA2003)
Tree provides task parallelismData parallelism at each distributed supernodeUses technique from sparse direct solvers (e.g. fill-reduction, and sparse left-looking update)
Triangular Solution at a Supernode
At each supernode:
Two main operations:
Bottleneck!
Parallel ICT+Selective Inversion (ICT-SI) (Raghavan, Teranishi, and Ng: NLA2003)
Preconditioner application: latency-tolerant parallel matrix-vector multiplication instead of parallel substitutionPreconditioner construction: Parts of IC factor have to be inverted (approximately)ICT-SI is faster and more scalable, unlike ICT with substitution (ICT-TS)
⎥⎦
⎤⎢⎣
⎡ −
21
111
ˆˆ
LL
⎥⎦
⎤⎢⎣
⎡
21
11
AA L11
A21
⎡
⎣ ⎢
⎤
⎦⎥
L11−1
A21
⎡
⎣ ⎢
⎤
⎦⎥
inversion
sparse dense
ICT +Selective SAI (ICT-SSAI)
The ICT submatrix at a supernode is sparsefrom drops at earlier columns/supernodesUse parallel sparse approximate inversion (SAI) instead of explicit inversionLatency tolerant construction and application
Factor and Inversion Drop and update SAI and update
ICT-SI ICT-SSAI
ICT-SSAI at a Supernode
A11 is sparse, use Parasails (Edmond Chow), parallel Frobenius norm minimization
Diagonal updates off-diagonal through parallel sparse matrix-matrix multiplication
⎥⎦
⎤⎢⎣
⎡ −
21
111
ˆˆ
LL
⎥⎦
⎤⎢⎣
⎡
21
11
AA L11
A21
⎡
⎣ ⎢
⎤
⎦⎥ ⎥
⎦
⎤⎢⎣
⎡ −
21
111ˆ
AL
FILL −−1ˆ
Generalized Tree-based Hybrids
⎥⎦
⎤⎢⎣
⎡ −
21
111
ˆˆ
LLˆ A 11
ˆ A 21
⎡
⎣ ⎢
⎤
⎦⎥
ˆ L 11−1
ˆ A 21
⎡
⎣ ⎢
⎤
⎦⎥
Efficient parallel IC can be constructed throughAny parallel explicit/implicit approximate parallel matrix inversion
SAIIC/ICTSSOR
Support for off-diagonal matrix computationSupport for data-parallel left-looking update
⎥⎦
⎤⎢⎣
⎡
21
11
AA
Cost Analysis of ICT on Model Grids
Communication cost of parallel ICT variants on model finite difference 2D and 3D matricesCommunication cost per message of m words
ts+m twAnalysis based on recurrence associated with subgrids, separators
Communication Latency
O((log2P)2ts)ICT-SI/SSAI
O((K/P0.5)ts)Substitution
Latency
O((log2P)2ts)ICT-SSAI
O(Kts)ICT/ICT-SI
Latency
O((log2P)2ts)ICT-SSAI
O(K2ts)ICT/ICT-SI
Latency
O((log2P)2ts)ICT-SI/SSAI
O((K2/P0.67)ts)Substitution
Latency
K x K Grids on P processors
K x K x K Grids on P processors
Construction Application
Construction Application
Empirical EvaluationOur ICT codes in C IC(0) from BlockSolve95SPAI from Parasails (sparse approximate inverse) included in HypreTests by linking to PETSc, results for all preconditioners with nonzeroes in the range 1 –2 x nonzeroes in AProblems:
3D finite element (hex20) grids (parallel scalability test) Poisson ratio=0.47, 100-130 nonzero elements per row
Sparse matrices from applications Available at Matrix Market and University Florida Sparse Matrix Collection
AMD Opteron250 cluster1-64 processors testedInfiniband interconnect
Results: Hex20 Grids
Challenge problem for testing efficiency and scalabilityEasy to precondition, not representative of problems requiring ICT-like preconditioning; i.e., not a Horror Matrix (Tim Davis) !Scaled problem with number of processors (weak scaling)Nonzeroes:0.6 million (1 processor) to 23.6 million (32 processors)
Hex20 Iterations, ICT-??, IC(0), SPAI
1 2 4 8 16 3250
100
150
200
250
300
350
Ite
ratio
ns
Processors
Number of iterations: ν=0.45
ICT−TS(0.01)ICT−SI(0.01)ICT−SSAI(0.01)
•ICT-SSAIiterations are lower than IC(0)and SPAI
1 2 4 8 16 320
100
200
300
400
500
600
700
Iterations
Processors
Number of iterations: ν=0.45
IC(0)SPAI(1)ICT−SSAI(0.01)
ICT-SSAI
ICT-TS, SI coincide
•Traditional ICT and ICT-SI iterations are the same•ICT-SSAI has higher iterations
Hex20 Construction, ICT -??, IC(0),SPAI
•ICT-SSAI time matches IC(0), SPAI is slower, worse scaling
•Traditional ICT and ICT-SSAI times match •ICT-SI is slower, does not scale
1 2 4 8 16 320
2
4
6
8
10
12
14
16
18
20
Processors
Se
co
nd
s
Time for preconditioner construction: ν =0.45
ICT−TS(0.01)ICT−SI(0.01)ICT−SSAI(0.01)
1 2 4 8 16 320
10
20
30
40
50
60
70
80
90
Processors
Seconds
Time for preconditioner construction: ν =0.45
IC(0)SPAI(1)ICT−SSAI(0.01)
ICT-SI time is high
ICT TS,SSAI time match
Hex20 Application, ICT -??, IC(0),SPAI
•ICT-SSAI time matches SPAI •IC (0) is slower, poor scaling
•ICT-SI and ICT-SSAI match•ICT-TS is slower, does not scale
1 2 4 8 16 320
1
2
3
4
5
6
7
8
9
10
Se
co
nd
s
Processors
Time for PCG iterations: ν=0.45
ICT−TS(0.01)ICT−SI(0.01)ICT−SSAI(0.01)
1 2 4 8 16 320
2
4
6
8
10
12
14
16
Se
co
nd
s
Processors
Time for PCG iterations: ν=0.45
IC(0)SPAI(1)ICT−SSAI(0.01)
ICT-TS
ICT-SI, SSAI
FE Horror Matrices
Matrix 1: augustus7 Rank: 1,060,864Nz: 9,313,87 (Kershaw sq. mesh)
Matrix 2: ldoorRank:952,203Nz:46,522,475
Matrix 3: af_shell3 Rank: 504,855Nz: 17,588,875
CG fails without “strong”preconditioning; require ICT/direct sparse solver
Iterations
4 8 16 32 640
1000
2000
3000
4000
5000
6000Number of iterations: all problems
Number of procesors
Itera
tions
af_shell3augustus7ldoor
IC(0)
ICT−SSAI
SPAI
ICT-SSAI iterations are 1/3 SPAI, 1/4 IC(0)
Construction
4 8 16 32 640
20
40
60
80
100
120Time for preconditioner construction: all problems
Number of processors
Sec
onds
af_shell3augustus7ldoor
SPAI ICT−SSAIIC(0)
PC construction time
•ICT-SSAI 1/5 SPAI, 1/3 IC(0) at 4 processors
•ICT-SSAI still faster at 64 processors
PCG time is 10X construction time
PCG Time
4 8 16 32 640
200
400
600
800
1000
1200Time for PCG iterations: all problems
Number of processors
Sec
onds
af_shell3augustus7ldoor
SPAI ICT−SSAIIC(0)PCG time is 10 X construction time
ICT-SSAI 1/2 SPAI, 1/4 IC(0) at 4 processors
ICT-SSAI still faster at 64 processors
Relative to itself, each method has the same speedup
Total Solution
4 8 16 32 640
200
400
600
800
1000
1200Time for the solution: all problems
Number of processors
Sec
onds
af_shell3augustus7ldoorSPAI ICT−SSAI
IC(0) ICT-SSAI 1/4 SPAI, 1/5 IC(0) at 4 processors
ICT-SSAI still faster at 64 processors
Relative to itself,each method has the same speedup
SPAI, IC(0)slowdown w.r.t.ICT-SSAI
ConclusionsICT-SSAI faster by factor of 4-5 compared to SPAI and IC (0)ICT-SSAI, IC(0), SPAI have similar fixed and scaled problem speedupsICT-SSAI has slower growth in iterationsICT-SSAI hybrid of best of ICT and SPAI
Retains ICT-like preconditioning qualityLatency of preconditioner construction and application independent of problem sizeLower computational cost for preconditioner construction
ICT-SSAI+ TOPS, other app? Nonlinear PDE-based apps with semi-implicit schemes requiring multiple solvesPaper in draft form
Total Solution: ICT-??
ICT-TS off the chart from poor application
ICT-SI suffers from growth of PC construction
ICT-SSAI is much faster
4 8 16 32 640
100
200
300
400
500Time for the solution: all problems
Number of processors
Sec
onds
af_shell3augustus7ldoor
ICT−SI ICT−SSAI
Computational Costs for Preconditioner Construction
K x K Grids on P processors K x K x K Grids on P processors
O(K6/P)ICT-SI
O(C3K3/P)SAI
O(C2K3/P)ICT-SSAI
O(C2K3/P)ICT-TS
Cost
O(K3/P)ICT-SI
O(C3K2/P)SAI
O(C2K2/P)ICT-SSAI
O(C2K2/P)ICT-TS
Cost
C: Average number of nonzeroes per column
Incomplete Factorization Preconditioning
Approximation of matrix factor for A
Applied through triangular solutionsVery effective general-purpose preconditioningHeuristics required for parallelization (for triangular solution)
MulticoloringMatrix (graph) partitioning
ˆ L ˆ U ≈ A ˆ L ̂ L T ≈ A
Sparse Approximate Inverse Preconditioning
Directly compute a preconditioner M
where M is sparseApplication through parallel matrix-vector multiplication
Efficient!Construction through
A-Conjugation (Benzi, 1993,1995,1998…)Ordering required for parallelization
Frobenius norm minimizationNo ordering required for efficient parallelization
M ≈ A−1
Frobenius Norm Minimization
min AM − I F
min Ami − ei 2
Grote and Hackle (1997), and Chow (2000)Solve the minimization problem:
M is a preconditioner
Treat it as multiple least-squares problems
Construction can be done in parallelIf A is symmetric, least-squares can be solved by Cholesky factorization
Tree-Structured Parallel ICT Factorization (Raghavan, Teranishi, and Ng: NLA2003)
Tree provides task parallelismOverall performance depends on efficient, data parallelism at each distributed supernode
Number of Nonzeroes in Preconditioner
1.293.031.79ICT-SSAI
0.511.430.32SPAI(apply)
1.633.551.18SPAI (const)
1.001.001.00IC(0)
ldooraugustus7af_shell3
Nonzeroes relative to the coefficient matrix