ICT-SSAI: Scalable Tree-based Drop-Threshold Cholesky...

30
ICT-SSAI: Scalable Tree-based Drop-Threshold Cholesky Preconditioning Keita Teranishi and Padma Raghavan Department of Computer Science and Engineering The Pennsylvania State University Barry F. Smith Mathematics and Computer Science Argonne National Laboratory iWNMSC’06 The University of Tokyo 10/25/2006 Supported by the National Science Foundation ACI-0102537 and DMR-0205232. and in part through DOE TOPS-1

Transcript of ICT-SSAI: Scalable Tree-based Drop-Threshold Cholesky...

Page 1: ICT-SSAI: Scalable Tree-based Drop-Threshold Cholesky ...nkl.cc.u-tokyo.ac.jp/seminars/0610-NA/presentations/teranishi-p.pdf · ICT−SSAI(0.01) •ICT-SSAI iterations are lower than

ICT-SSAI: Scalable Tree-based Drop-Threshold Cholesky

PreconditioningKeita Teranishi and Padma Raghavan

Department of Computer Science and EngineeringThe Pennsylvania State University

Barry F. SmithMathematics and Computer Science

Argonne National Laboratory

iWNMSC’06The University of Tokyo

10/25/2006

Supported by the National Science FoundationACI-0102537 and DMR-0205232. and in part through DOE TOPS-1

Page 2: ICT-SSAI: Scalable Tree-based Drop-Threshold Cholesky ...nkl.cc.u-tokyo.ac.jp/seminars/0610-NA/presentations/teranishi-p.pdf · ICT−SSAI(0.01) •ICT-SSAI iterations are lower than

Outline

Background Tree-based hybrid solvers (SPD matrices)

Incomplete Cholesky using drop-thresholds (ICT) with Selective Sparse Approximate Inversion(ICT-SSAI)

Analysis for model problemsConvergence, speedups and scalability for finite-element problems Conclusions

Page 3: ICT-SSAI: Scalable Tree-based Drop-Threshold Cholesky ...nkl.cc.u-tokyo.ac.jp/seminars/0610-NA/presentations/teranishi-p.pdf · ICT−SSAI(0.01) •ICT-SSAI iterations are lower than

Background

Incomplete Cholesky preconditioners for CGGeneral purpose, tuning fill = pure direct to pure iterative

Parallel ICT preconditionerConstruction like sparse parallel direct solver, managing fill, complex data structures, etc.Application is inefficient, dominated by large latency of inter-processor communication relative to computations

Sparse approximate inverse (SAI) Alternative to ICT, allows efficient application, but preconditioner quality and tunability lag ICTFrobenius Norm Minimization (Grote,Huckle 97, and Chow 00)

Our method: ICT+ Selective SAI (ICT-SSAI)

Page 4: ICT-SSAI: Scalable Tree-based Drop-Threshold Cholesky ...nkl.cc.u-tokyo.ac.jp/seminars/0610-NA/presentations/teranishi-p.pdf · ICT−SSAI(0.01) •ICT-SSAI iterations are lower than

Tree-Based SolversPartitioned domain = tree of separators Tree=data dependency of Cholesky, ICT and triangular solutionA tree supernode = separator, columns with similar nonzero pattern

Block algorithm can be applied for efficiency

Uses techniques from sparse direct solvers (e.g. fill-reduction, and sparse left-looking update)

Oˆ L 11

ˆ L 22

K ˆ L 31ˆ L 32

ˆ L 33

⎢ ⎢ ⎢ ⎢

⎥ ⎥ ⎥ ⎥

O

A11

A22

L A31 A32 A33

⎢ ⎢ ⎢ ⎢

⎥ ⎥ ⎥ ⎥

Factor

Page 5: ICT-SSAI: Scalable Tree-based Drop-Threshold Cholesky ...nkl.cc.u-tokyo.ac.jp/seminars/0610-NA/presentations/teranishi-p.pdf · ICT−SSAI(0.01) •ICT-SSAI iterations are lower than

Tree-Structured Parallel ICT(Raghavan, Teranishi, and Ng: NLA2003)

Tree provides task parallelismData parallelism at each distributed supernodeUses technique from sparse direct solvers (e.g. fill-reduction, and sparse left-looking update)

Page 6: ICT-SSAI: Scalable Tree-based Drop-Threshold Cholesky ...nkl.cc.u-tokyo.ac.jp/seminars/0610-NA/presentations/teranishi-p.pdf · ICT−SSAI(0.01) •ICT-SSAI iterations are lower than

Triangular Solution at a Supernode

At each supernode:

Two main operations:

Bottleneck!

Page 7: ICT-SSAI: Scalable Tree-based Drop-Threshold Cholesky ...nkl.cc.u-tokyo.ac.jp/seminars/0610-NA/presentations/teranishi-p.pdf · ICT−SSAI(0.01) •ICT-SSAI iterations are lower than

Parallel ICT+Selective Inversion (ICT-SI) (Raghavan, Teranishi, and Ng: NLA2003)

Preconditioner application: latency-tolerant parallel matrix-vector multiplication instead of parallel substitutionPreconditioner construction: Parts of IC factor have to be inverted (approximately)ICT-SI is faster and more scalable, unlike ICT with substitution (ICT-TS)

⎥⎦

⎤⎢⎣

⎡ −

21

111

ˆˆ

LL

⎥⎦

⎤⎢⎣

21

11

AA L11

A21

⎣ ⎢

⎦⎥

L11−1

A21

⎣ ⎢

⎦⎥

inversion

sparse dense

Page 8: ICT-SSAI: Scalable Tree-based Drop-Threshold Cholesky ...nkl.cc.u-tokyo.ac.jp/seminars/0610-NA/presentations/teranishi-p.pdf · ICT−SSAI(0.01) •ICT-SSAI iterations are lower than

ICT +Selective SAI (ICT-SSAI)

The ICT submatrix at a supernode is sparsefrom drops at earlier columns/supernodesUse parallel sparse approximate inversion (SAI) instead of explicit inversionLatency tolerant construction and application

Factor and Inversion Drop and update SAI and update

ICT-SI ICT-SSAI

Page 9: ICT-SSAI: Scalable Tree-based Drop-Threshold Cholesky ...nkl.cc.u-tokyo.ac.jp/seminars/0610-NA/presentations/teranishi-p.pdf · ICT−SSAI(0.01) •ICT-SSAI iterations are lower than

ICT-SSAI at a Supernode

A11 is sparse, use Parasails (Edmond Chow), parallel Frobenius norm minimization

Diagonal updates off-diagonal through parallel sparse matrix-matrix multiplication

⎥⎦

⎤⎢⎣

⎡ −

21

111

ˆˆ

LL

⎥⎦

⎤⎢⎣

21

11

AA L11

A21

⎣ ⎢

⎦⎥ ⎥

⎤⎢⎣

⎡ −

21

111ˆ

AL

FILL −−1ˆ

Page 10: ICT-SSAI: Scalable Tree-based Drop-Threshold Cholesky ...nkl.cc.u-tokyo.ac.jp/seminars/0610-NA/presentations/teranishi-p.pdf · ICT−SSAI(0.01) •ICT-SSAI iterations are lower than

Generalized Tree-based Hybrids

⎥⎦

⎤⎢⎣

⎡ −

21

111

ˆˆ

LLˆ A 11

ˆ A 21

⎣ ⎢

⎦⎥

ˆ L 11−1

ˆ A 21

⎣ ⎢

⎦⎥

Efficient parallel IC can be constructed throughAny parallel explicit/implicit approximate parallel matrix inversion

SAIIC/ICTSSOR

Support for off-diagonal matrix computationSupport for data-parallel left-looking update

⎥⎦

⎤⎢⎣

21

11

AA

Page 11: ICT-SSAI: Scalable Tree-based Drop-Threshold Cholesky ...nkl.cc.u-tokyo.ac.jp/seminars/0610-NA/presentations/teranishi-p.pdf · ICT−SSAI(0.01) •ICT-SSAI iterations are lower than

Cost Analysis of ICT on Model Grids

Communication cost of parallel ICT variants on model finite difference 2D and 3D matricesCommunication cost per message of m words

ts+m twAnalysis based on recurrence associated with subgrids, separators

Page 12: ICT-SSAI: Scalable Tree-based Drop-Threshold Cholesky ...nkl.cc.u-tokyo.ac.jp/seminars/0610-NA/presentations/teranishi-p.pdf · ICT−SSAI(0.01) •ICT-SSAI iterations are lower than

Communication Latency

O((log2P)2ts)ICT-SI/SSAI

O((K/P0.5)ts)Substitution

Latency

O((log2P)2ts)ICT-SSAI

O(Kts)ICT/ICT-SI

Latency

O((log2P)2ts)ICT-SSAI

O(K2ts)ICT/ICT-SI

Latency

O((log2P)2ts)ICT-SI/SSAI

O((K2/P0.67)ts)Substitution

Latency

K x K Grids on P processors

K x K x K Grids on P processors

Construction Application

Construction Application

Page 13: ICT-SSAI: Scalable Tree-based Drop-Threshold Cholesky ...nkl.cc.u-tokyo.ac.jp/seminars/0610-NA/presentations/teranishi-p.pdf · ICT−SSAI(0.01) •ICT-SSAI iterations are lower than

Empirical EvaluationOur ICT codes in C IC(0) from BlockSolve95SPAI from Parasails (sparse approximate inverse) included in HypreTests by linking to PETSc, results for all preconditioners with nonzeroes in the range 1 –2 x nonzeroes in AProblems:

3D finite element (hex20) grids (parallel scalability test) Poisson ratio=0.47, 100-130 nonzero elements per row

Sparse matrices from applications Available at Matrix Market and University Florida Sparse Matrix Collection

AMD Opteron250 cluster1-64 processors testedInfiniband interconnect

Page 14: ICT-SSAI: Scalable Tree-based Drop-Threshold Cholesky ...nkl.cc.u-tokyo.ac.jp/seminars/0610-NA/presentations/teranishi-p.pdf · ICT−SSAI(0.01) •ICT-SSAI iterations are lower than

Results: Hex20 Grids

Challenge problem for testing efficiency and scalabilityEasy to precondition, not representative of problems requiring ICT-like preconditioning; i.e., not a Horror Matrix (Tim Davis) !Scaled problem with number of processors (weak scaling)Nonzeroes:0.6 million (1 processor) to 23.6 million (32 processors)

Page 15: ICT-SSAI: Scalable Tree-based Drop-Threshold Cholesky ...nkl.cc.u-tokyo.ac.jp/seminars/0610-NA/presentations/teranishi-p.pdf · ICT−SSAI(0.01) •ICT-SSAI iterations are lower than

Hex20 Iterations, ICT-??, IC(0), SPAI

1 2 4 8 16 3250

100

150

200

250

300

350

Ite

ratio

ns

Processors

Number of iterations: ν=0.45

ICT−TS(0.01)ICT−SI(0.01)ICT−SSAI(0.01)

•ICT-SSAIiterations are lower than IC(0)and SPAI

1 2 4 8 16 320

100

200

300

400

500

600

700

Iterations

Processors

Number of iterations: ν=0.45

IC(0)SPAI(1)ICT−SSAI(0.01)

ICT-SSAI

ICT-TS, SI coincide

•Traditional ICT and ICT-SI iterations are the same•ICT-SSAI has higher iterations

Page 16: ICT-SSAI: Scalable Tree-based Drop-Threshold Cholesky ...nkl.cc.u-tokyo.ac.jp/seminars/0610-NA/presentations/teranishi-p.pdf · ICT−SSAI(0.01) •ICT-SSAI iterations are lower than

Hex20 Construction, ICT -??, IC(0),SPAI

•ICT-SSAI time matches IC(0), SPAI is slower, worse scaling

•Traditional ICT and ICT-SSAI times match •ICT-SI is slower, does not scale

1 2 4 8 16 320

2

4

6

8

10

12

14

16

18

20

Processors

Se

co

nd

s

Time for preconditioner construction: ν =0.45

ICT−TS(0.01)ICT−SI(0.01)ICT−SSAI(0.01)

1 2 4 8 16 320

10

20

30

40

50

60

70

80

90

Processors

Seconds

Time for preconditioner construction: ν =0.45

IC(0)SPAI(1)ICT−SSAI(0.01)

ICT-SI time is high

ICT TS,SSAI time match

Page 17: ICT-SSAI: Scalable Tree-based Drop-Threshold Cholesky ...nkl.cc.u-tokyo.ac.jp/seminars/0610-NA/presentations/teranishi-p.pdf · ICT−SSAI(0.01) •ICT-SSAI iterations are lower than

Hex20 Application, ICT -??, IC(0),SPAI

•ICT-SSAI time matches SPAI •IC (0) is slower, poor scaling

•ICT-SI and ICT-SSAI match•ICT-TS is slower, does not scale

1 2 4 8 16 320

1

2

3

4

5

6

7

8

9

10

Se

co

nd

s

Processors

Time for PCG iterations: ν=0.45

ICT−TS(0.01)ICT−SI(0.01)ICT−SSAI(0.01)

1 2 4 8 16 320

2

4

6

8

10

12

14

16

Se

co

nd

s

Processors

Time for PCG iterations: ν=0.45

IC(0)SPAI(1)ICT−SSAI(0.01)

ICT-TS

ICT-SI, SSAI

Page 18: ICT-SSAI: Scalable Tree-based Drop-Threshold Cholesky ...nkl.cc.u-tokyo.ac.jp/seminars/0610-NA/presentations/teranishi-p.pdf · ICT−SSAI(0.01) •ICT-SSAI iterations are lower than

FE Horror Matrices

Matrix 1: augustus7 Rank: 1,060,864Nz: 9,313,87 (Kershaw sq. mesh)

Matrix 2: ldoorRank:952,203Nz:46,522,475

Matrix 3: af_shell3 Rank: 504,855Nz: 17,588,875

CG fails without “strong”preconditioning; require ICT/direct sparse solver

Page 19: ICT-SSAI: Scalable Tree-based Drop-Threshold Cholesky ...nkl.cc.u-tokyo.ac.jp/seminars/0610-NA/presentations/teranishi-p.pdf · ICT−SSAI(0.01) •ICT-SSAI iterations are lower than

Iterations

4 8 16 32 640

1000

2000

3000

4000

5000

6000Number of iterations: all problems

Number of procesors

Itera

tions

af_shell3augustus7ldoor

IC(0)

ICT−SSAI

SPAI

ICT-SSAI iterations are 1/3 SPAI, 1/4 IC(0)

Page 20: ICT-SSAI: Scalable Tree-based Drop-Threshold Cholesky ...nkl.cc.u-tokyo.ac.jp/seminars/0610-NA/presentations/teranishi-p.pdf · ICT−SSAI(0.01) •ICT-SSAI iterations are lower than

Construction

4 8 16 32 640

20

40

60

80

100

120Time for preconditioner construction: all problems

Number of processors

Sec

onds

af_shell3augustus7ldoor

SPAI ICT−SSAIIC(0)

PC construction time

•ICT-SSAI 1/5 SPAI, 1/3 IC(0) at 4 processors

•ICT-SSAI still faster at 64 processors

PCG time is 10X construction time

Page 21: ICT-SSAI: Scalable Tree-based Drop-Threshold Cholesky ...nkl.cc.u-tokyo.ac.jp/seminars/0610-NA/presentations/teranishi-p.pdf · ICT−SSAI(0.01) •ICT-SSAI iterations are lower than

PCG Time

4 8 16 32 640

200

400

600

800

1000

1200Time for PCG iterations: all problems

Number of processors

Sec

onds

af_shell3augustus7ldoor

SPAI ICT−SSAIIC(0)PCG time is 10 X construction time

ICT-SSAI 1/2 SPAI, 1/4 IC(0) at 4 processors

ICT-SSAI still faster at 64 processors

Relative to itself, each method has the same speedup

Page 22: ICT-SSAI: Scalable Tree-based Drop-Threshold Cholesky ...nkl.cc.u-tokyo.ac.jp/seminars/0610-NA/presentations/teranishi-p.pdf · ICT−SSAI(0.01) •ICT-SSAI iterations are lower than

Total Solution

4 8 16 32 640

200

400

600

800

1000

1200Time for the solution: all problems

Number of processors

Sec

onds

af_shell3augustus7ldoorSPAI ICT−SSAI

IC(0) ICT-SSAI 1/4 SPAI, 1/5 IC(0) at 4 processors

ICT-SSAI still faster at 64 processors

Relative to itself,each method has the same speedup

SPAI, IC(0)slowdown w.r.t.ICT-SSAI

Page 23: ICT-SSAI: Scalable Tree-based Drop-Threshold Cholesky ...nkl.cc.u-tokyo.ac.jp/seminars/0610-NA/presentations/teranishi-p.pdf · ICT−SSAI(0.01) •ICT-SSAI iterations are lower than

ConclusionsICT-SSAI faster by factor of 4-5 compared to SPAI and IC (0)ICT-SSAI, IC(0), SPAI have similar fixed and scaled problem speedupsICT-SSAI has slower growth in iterationsICT-SSAI hybrid of best of ICT and SPAI

Retains ICT-like preconditioning qualityLatency of preconditioner construction and application independent of problem sizeLower computational cost for preconditioner construction

ICT-SSAI+ TOPS, other app? Nonlinear PDE-based apps with semi-implicit schemes requiring multiple solvesPaper in draft form

Page 24: ICT-SSAI: Scalable Tree-based Drop-Threshold Cholesky ...nkl.cc.u-tokyo.ac.jp/seminars/0610-NA/presentations/teranishi-p.pdf · ICT−SSAI(0.01) •ICT-SSAI iterations are lower than

Total Solution: ICT-??

ICT-TS off the chart from poor application

ICT-SI suffers from growth of PC construction

ICT-SSAI is much faster

4 8 16 32 640

100

200

300

400

500Time for the solution: all problems

Number of processors

Sec

onds

af_shell3augustus7ldoor

ICT−SI ICT−SSAI

Page 25: ICT-SSAI: Scalable Tree-based Drop-Threshold Cholesky ...nkl.cc.u-tokyo.ac.jp/seminars/0610-NA/presentations/teranishi-p.pdf · ICT−SSAI(0.01) •ICT-SSAI iterations are lower than

Computational Costs for Preconditioner Construction

K x K Grids on P processors K x K x K Grids on P processors

O(K6/P)ICT-SI

O(C3K3/P)SAI

O(C2K3/P)ICT-SSAI

O(C2K3/P)ICT-TS

Cost

O(K3/P)ICT-SI

O(C3K2/P)SAI

O(C2K2/P)ICT-SSAI

O(C2K2/P)ICT-TS

Cost

C: Average number of nonzeroes per column

Page 26: ICT-SSAI: Scalable Tree-based Drop-Threshold Cholesky ...nkl.cc.u-tokyo.ac.jp/seminars/0610-NA/presentations/teranishi-p.pdf · ICT−SSAI(0.01) •ICT-SSAI iterations are lower than

Incomplete Factorization Preconditioning

Approximation of matrix factor for A

Applied through triangular solutionsVery effective general-purpose preconditioningHeuristics required for parallelization (for triangular solution)

MulticoloringMatrix (graph) partitioning

ˆ L ˆ U ≈ A ˆ L ̂ L T ≈ A

Page 27: ICT-SSAI: Scalable Tree-based Drop-Threshold Cholesky ...nkl.cc.u-tokyo.ac.jp/seminars/0610-NA/presentations/teranishi-p.pdf · ICT−SSAI(0.01) •ICT-SSAI iterations are lower than

Sparse Approximate Inverse Preconditioning

Directly compute a preconditioner M

where M is sparseApplication through parallel matrix-vector multiplication

Efficient!Construction through

A-Conjugation (Benzi, 1993,1995,1998…)Ordering required for parallelization

Frobenius norm minimizationNo ordering required for efficient parallelization

M ≈ A−1

Page 28: ICT-SSAI: Scalable Tree-based Drop-Threshold Cholesky ...nkl.cc.u-tokyo.ac.jp/seminars/0610-NA/presentations/teranishi-p.pdf · ICT−SSAI(0.01) •ICT-SSAI iterations are lower than

Frobenius Norm Minimization

min AM − I F

min Ami − ei 2

Grote and Hackle (1997), and Chow (2000)Solve the minimization problem:

M is a preconditioner

Treat it as multiple least-squares problems

Construction can be done in parallelIf A is symmetric, least-squares can be solved by Cholesky factorization

Page 29: ICT-SSAI: Scalable Tree-based Drop-Threshold Cholesky ...nkl.cc.u-tokyo.ac.jp/seminars/0610-NA/presentations/teranishi-p.pdf · ICT−SSAI(0.01) •ICT-SSAI iterations are lower than

Tree-Structured Parallel ICT Factorization (Raghavan, Teranishi, and Ng: NLA2003)

Tree provides task parallelismOverall performance depends on efficient, data parallelism at each distributed supernode

Page 30: ICT-SSAI: Scalable Tree-based Drop-Threshold Cholesky ...nkl.cc.u-tokyo.ac.jp/seminars/0610-NA/presentations/teranishi-p.pdf · ICT−SSAI(0.01) •ICT-SSAI iterations are lower than

Number of Nonzeroes in Preconditioner

1.293.031.79ICT-SSAI

0.511.430.32SPAI(apply)

1.633.551.18SPAI (const)

1.001.001.00IC(0)

ldooraugustus7af_shell3

Nonzeroes relative to the coefficient matrix