E Oznergiz, C Ozsoy I Delice, and A Kural Jed Goodell September 9 th,2009.
On Time Integration for Strong Scalability - Jed Brown · I Implicit methods require an implicit...
Transcript of On Time Integration for Strong Scalability - Jed Brown · I Implicit methods require an implicit...
![Page 1: On Time Integration for Strong Scalability - Jed Brown · I Implicit methods require an implicit solve in each stage. I Time step size proportional to CFL for accuracy reasons. I](https://reader033.fdocuments.us/reader033/viewer/2022050417/5f8ce606d2e79c3793206ff9/html5/thumbnails/1.jpg)
On Time Integration for Strong Scalability
Jed Brown [email protected] (CU Boulder and ANL)Collaborators: Debojyoti Ghosh (LLNL), Matt Normile (CU),
Martin Schreiber (Exeter), Richard Mills (ANL)
PETSc User Meeting, 2019-06-06This talk: https://jedbrown.org/files/20190606-StrongTime.pdf
![Page 2: On Time Integration for Strong Scalability - Jed Brown · I Implicit methods require an implicit solve in each stage. I Time step size proportional to CFL for accuracy reasons. I](https://reader033.fdocuments.us/reader033/viewer/2022050417/5f8ce606d2e79c3793206ff9/html5/thumbnails/2.jpg)
1
10
2008 2010 2012 2014 2016 2018
HD 3870
HD 4870
HD 5870HD 6970
HD 6970
HD 7970 GHz Ed.
HD 8970
FirePro W
9100
FirePro S9150
X5482X5492
W5590
X5680 X5690
E5-2690
E5-2697 v2
E5-2699 v3
E5-2699 v3
E5-2699 v4
Xeon 8180
Xeon 8280
Tesla C1060
Tesla C1060
Tesla C2050
Tesla M2090
Tesla K20Tesla K20X
Tesla K40
Tesla K40
Tesla P100
Tesla V100
Xeon Phi 7120 (KNC)
Xeon Phi 7290 (K
NL)
FLO
P p
er
Byte
End of Year
Theoretical Peak Floating Point Operations per Byte, Double Precision
INTEL Xeon CPUs
NVIDIA Tesla GPUs
AMD Radeon GPUs
INTEL Xeon Phis
![Page 3: On Time Integration for Strong Scalability - Jed Brown · I Implicit methods require an implicit solve in each stage. I Time step size proportional to CFL for accuracy reasons. I](https://reader033.fdocuments.us/reader033/viewer/2022050417/5f8ce606d2e79c3793206ff9/html5/thumbnails/3.jpg)
Latency versus Throughput
Latency
Throughput
Adapted from Kronbichler and Ljungkvist (2019)
![Page 4: On Time Integration for Strong Scalability - Jed Brown · I Implicit methods require an implicit solve in each stage. I Time step size proportional to CFL for accuracy reasons. I](https://reader033.fdocuments.us/reader033/viewer/2022050417/5f8ce606d2e79c3793206ff9/html5/thumbnails/4.jpg)
Motivation
I Hardware trendsI Memory bandwidth a precious commodity (8+ flops/byte)I Vectorization necessary for floating point performanceI Conflicting demands of cache reuse and vectorizationI Can deliver bandwidth, but latency is hard
I Assembled sparse linear algebra is doomed!I Limited by memory bandwidth (1 flop/6 bytes)I No vectorization without blocking, return of ELLPACK
I Spatial-domain vectorization is intrusiveI Must be unassembled to avoid bandwidth bottleneckI Whether it is “hard” depends on discretizationI Geometry, boundary conditions, and adaptivity
![Page 5: On Time Integration for Strong Scalability - Jed Brown · I Implicit methods require an implicit solve in each stage. I Time step size proportional to CFL for accuracy reasons. I](https://reader033.fdocuments.us/reader033/viewer/2022050417/5f8ce606d2e79c3793206ff9/html5/thumbnails/5.jpg)
Sparse linear algebra is dead (long live sparse . . . )
I Arithmetic intensity < 1/4
I Idea: multiple right hand sides
(2k flops)(bandwidth)
sizeof(Scalar)+sizeof(Int), k � avg. nz/row
I Problem: popular algorithms have nested data dependenciesI Time step
Nonlinear solveKrylov solve
Preconditioner/sparse matrix
I Cannot parallelize/vectorize these nested loopsI Can we create new algorithms to reorder/fuse loops?
I Reduce latency-sensitivity for communicationI Reduce memory bandwidth (reuse matrix while in cache)
![Page 6: On Time Integration for Strong Scalability - Jed Brown · I Implicit methods require an implicit solve in each stage. I Time step size proportional to CFL for accuracy reasons. I](https://reader033.fdocuments.us/reader033/viewer/2022050417/5f8ce606d2e79c3793206ff9/html5/thumbnails/6.jpg)
Sparse linear algebra is dead (long live sparse . . . )
I Arithmetic intensity < 1/4
I Idea: multiple right hand sides
(2k flops)(bandwidth)
sizeof(Scalar)+sizeof(Int), k � avg. nz/row
I Problem: popular algorithms have nested data dependenciesI Time step
Nonlinear solveKrylov solve
Preconditioner/sparse matrix
I Cannot parallelize/vectorize these nested loopsI Can we create new algorithms to reorder/fuse loops?
I Reduce latency-sensitivity for communicationI Reduce memory bandwidth (reuse matrix while in cache)
![Page 7: On Time Integration for Strong Scalability - Jed Brown · I Implicit methods require an implicit solve in each stage. I Time step size proportional to CFL for accuracy reasons. I](https://reader033.fdocuments.us/reader033/viewer/2022050417/5f8ce606d2e79c3793206ff9/html5/thumbnails/7.jpg)
Attempt: s-step methods in 3D
102 103 104 105
dofs/process
102
103
104
105
106work
p = 1
p = 1, s = 2
p = 2, s = 3
p = 2, s = 5
102 103 104 105
dofs/process
102
103
104
105
106memory
p = 1
p = 2
p = 2, s = 3
p = 2, s = 5
102 103 104 105
dofs/process
102
103
104
105
106communication
p = 1
p = 2
p = 2, s = 3
p = 2, s = 5
I Limited choice of preconditioners (none optimal, surface/volume)I Amortizing message latency is most important for strong-scalingI s-step methods have high overhead for small subdomains
![Page 8: On Time Integration for Strong Scalability - Jed Brown · I Implicit methods require an implicit solve in each stage. I Time step size proportional to CFL for accuracy reasons. I](https://reader033.fdocuments.us/reader033/viewer/2022050417/5f8ce606d2e79c3793206ff9/html5/thumbnails/8.jpg)
Attempt: distribute in time (multilevel SDC/Parareal)
I PFASST algorithm (Emmett and Minion, 2012)I Zero-latency messages (cf. performance model of s-step)I Spectral Deferred Correction: iterative, converges to IRK (Gauss, Radau, . . . )I Stiff problems use implicit basic integrator (synchronizing on spatial communicator)
![Page 9: On Time Integration for Strong Scalability - Jed Brown · I Implicit methods require an implicit solve in each stage. I Time step size proportional to CFL for accuracy reasons. I](https://reader033.fdocuments.us/reader033/viewer/2022050417/5f8ce606d2e79c3793206ff9/html5/thumbnails/9.jpg)
Problems with SDC and time-parallel
256 512 1K 2K 4K 8K 16K 32K 64K 112K 224K 448Knumber of cores
1248
163264
128256448896
1792sp
eedu
pidealPFASST (theory)PMG+SDC (fine)PMG+SDC (coarse)PMG+PFASST
256 512 1K 2K 4K 8K 16K 32K 64K 112K 224K 448Knumber of cores
0.00.20.40.60.81.0
rel.
effic
ienc
y PMGPMG+PFASST
c/o Matthew Emmett, parallel compared to sequential SDCI Iteration count not uniform in s; efficiency starts lowI Low arithmetic intensity; tight error tolerance (cf. Crank-Nicolson)I Parabolic space-time also works, but comparison flawed
![Page 10: On Time Integration for Strong Scalability - Jed Brown · I Implicit methods require an implicit solve in each stage. I Time step size proportional to CFL for accuracy reasons. I](https://reader033.fdocuments.us/reader033/viewer/2022050417/5f8ce606d2e79c3793206ff9/html5/thumbnails/10.jpg)
Runge-Kutta methods
u = F(u)y1...
ys
︸ ︷︷ ︸
Y
= un + h
a11 · · · a1s...
. . ....
as1 · · · ass
︸ ︷︷ ︸
A
F
y1...
ys
un+1 = un + hbT F(Y )
I General framework for one-step methods
I Diagonally implicit: A lower triangular, stage order 1 (or 2 with explicit first stage)
I Singly diagonally implicit: all Aii equal, reuse solver setup, stage order 1
I If A is a general full matrix, all stages are coupled, “implicit RK”
![Page 11: On Time Integration for Strong Scalability - Jed Brown · I Implicit methods require an implicit solve in each stage. I Time step size proportional to CFL for accuracy reasons. I](https://reader033.fdocuments.us/reader033/viewer/2022050417/5f8ce606d2e79c3793206ff9/html5/thumbnails/11.jpg)
Implicit Runge-Kutta
12 −
√15
10536
29 −
√15
15536 −
√15
3012
536 +
√15
2429
536 −
√15
2412 −
√15
10536 +
√15
3029 +
√15
155
36518
49
518
I Excellent accuracy and stability propertiesI Gauss methods with s stages
I order 2s, (s,s) Padé approximation to the exponentialI A-stable, symplectic
I Radau (IIA) methods with s stagesI order 2s−1, A-stable, L-stable
I Lobatto (IIIC) methods with s stagesI order 2s−2, A-stable, L-stable, self-adjoint
I Stage order s or s + 1
![Page 12: On Time Integration for Strong Scalability - Jed Brown · I Implicit methods require an implicit solve in each stage. I Time step size proportional to CFL for accuracy reasons. I](https://reader033.fdocuments.us/reader033/viewer/2022050417/5f8ce606d2e79c3793206ff9/html5/thumbnails/12.jpg)
Method of Butcher (1976) and Bickart (1977)
I Newton linearize Runge-Kutta system at u∗
Y = un + hAF(Y )[Is⊗ In + hA⊗ J(u∗)
]δY = RHS
I Solve linear system with tensor product operator
G = S⊗ In + Is⊗ J
where S = (hA)−1 is s× s dense, J =−∂F(u)/∂u sparse
I SDC (2000) is Gauss-Seidel with low-order correctorI Butcher/Bickart method: diagonalize S = VΛV−1
I Λ⊗ In + Is⊗ JI s decoupled solvesI Complex eigenvalues (overhead for real problem)
![Page 13: On Time Integration for Strong Scalability - Jed Brown · I Implicit methods require an implicit solve in each stage. I Time step size proportional to CFL for accuracy reasons. I](https://reader033.fdocuments.us/reader033/viewer/2022050417/5f8ce606d2e79c3793206ff9/html5/thumbnails/13.jpg)
Eigenbasis ill conditioning A = VΛV−1
![Page 14: On Time Integration for Strong Scalability - Jed Brown · I Implicit methods require an implicit solve in each stage. I Time step size proportional to CFL for accuracy reasons. I](https://reader033.fdocuments.us/reader033/viewer/2022050417/5f8ce606d2e79c3793206ff9/html5/thumbnails/14.jpg)
Skip the diagonalization
[s11 + J s12 + Js21 + J s22 + J
]︸ ︷︷ ︸
S⊗In+Is⊗J
S + j11I j12Ij21I S + j22I j23I
j32I S + j33I
︸ ︷︷ ︸
In⊗S+J⊗Is
I Accessing memory for J dominates cost
I Irregular vector access in application of J limits vectorization
I Permute Kronecker product to reuse J and make fine-grained structure regular
I Stages coupled via register transpose at spatial-point granularity
I Same convergence properties as Butcher/Bickart
![Page 15: On Time Integration for Strong Scalability - Jed Brown · I Implicit methods require an implicit solve in each stage. I Time step size proportional to CFL for accuracy reasons. I](https://reader033.fdocuments.us/reader033/viewer/2022050417/5f8ce606d2e79c3793206ff9/html5/thumbnails/15.jpg)
PETSc MatKAIJ: “sparse” Kronecker product matrices
G = In⊗S + J⊗T
I J is parallel and sparse, S and T are small and dense
I More general than multiple RHS (multivectors)
I Compare J⊗ Is to multiple right hand sides in row-major
I Runge-Kutta systems have T = Is (permuted from Butcher method)
I Stream J through cache once, same efficiency as multiple RHS
I Unintrusive compared to spatial-domain vectorization or s-step
![Page 16: On Time Integration for Strong Scalability - Jed Brown · I Implicit methods require an implicit solve in each stage. I Time step size proportional to CFL for accuracy reasons. I](https://reader033.fdocuments.us/reader033/viewer/2022050417/5f8ce606d2e79c3793206ff9/html5/thumbnails/16.jpg)
Convergence with point-block Jacobi preconditioning
I 3D centered-difference diffusion problem
Method order nsteps Krylov its. (Average)
Gauss 1 2 16 130 (8.1)Gauss 2 4 8 122 (15.2)Gauss 4 8 4 100 (25)Gauss 8 16 2 78 (39)
![Page 17: On Time Integration for Strong Scalability - Jed Brown · I Implicit methods require an implicit solve in each stage. I Time step size proportional to CFL for accuracy reasons. I](https://reader033.fdocuments.us/reader033/viewer/2022050417/5f8ce606d2e79c3793206ff9/html5/thumbnails/17.jpg)
We really want multigrid
I Prolongation: P⊗ IsI Coarse operator: In⊗S + (RJP)⊗ IsI Larger time steps
I GMRES(2)/point-block Jacobi smoothing
I FGMRES outer
Method order nsteps Krylov its. (Average)
Gauss 1 2 16 82 (5.1)Gauss 2 4 8 64 (8)Gauss 4 8 4 44 (11)Gauss 8 16 2 42 (21)
![Page 18: On Time Integration for Strong Scalability - Jed Brown · I Implicit methods require an implicit solve in each stage. I Time step size proportional to CFL for accuracy reasons. I](https://reader033.fdocuments.us/reader033/viewer/2022050417/5f8ce606d2e79c3793206ff9/html5/thumbnails/18.jpg)
Toward a better AMG for IRK/tensor-product systems
I Start with R = R⊗ Is, P = P⊗ Is
Gcoarse = R(In⊗S + J⊗ Is)P
I Imaginary component slows convergence
I Can we use a Kronecker product interpolation?
I Rotation on coarse grids (connections to shifted Laplacian)
![Page 19: On Time Integration for Strong Scalability - Jed Brown · I Implicit methods require an implicit solve in each stage. I Time step size proportional to CFL for accuracy reasons. I](https://reader033.fdocuments.us/reader033/viewer/2022050417/5f8ce606d2e79c3793206ff9/html5/thumbnails/19.jpg)
Why implicit is silly for waves
I Implicit methods require an implicit solve in each stage.
I Time step size proportional to CFL for accuracy reasons.I Methods higher than first order are not unconditionally strong stability preserving
(SSP; Spijker 1983).I Empirically, ceff ≤ 2, Ketcheson, Macdonald, Gottlieb (2008) and othersI Downwind methods offer to bypass, but so far not practical
I Time step size chosen for stabilityI Increase order if more accuracy neededI Large errors from spatial discretization, modest accuracy
I My goal: need less memory motion per stageI Better accuracy, symplecticity nice bonus onlyI Cannot sell method without efficiency
![Page 20: On Time Integration for Strong Scalability - Jed Brown · I Implicit methods require an implicit solve in each stage. I Time step size proportional to CFL for accuracy reasons. I](https://reader033.fdocuments.us/reader033/viewer/2022050417/5f8ce606d2e79c3793206ff9/html5/thumbnails/20.jpg)
Implicit Runge-Kutta for advection
Table: Total number of iterations (communications or accesses of J) to solve linear advection tot = 1 on a 1024-point grid using point-block Jacobi preconditioning of implicit Runge-Kutta matrix.The relative algebraic solver tolerance is 10−8.
Method order nsteps Krylov its. (Average)
Gauss 1 2 1024 3627 (3.5)Gauss 2 4 512 2560 (5)Gauss 4 8 256 1735 (6.8)Gauss 8 16 128 1442 (11.2)
I Naive centered-difference discretization
I Leapfrog requires 1024 iterations at CFL=1
I This is A-stable (can handle dissipation)
![Page 21: On Time Integration for Strong Scalability - Jed Brown · I Implicit methods require an implicit solve in each stage. I Time step size proportional to CFL for accuracy reasons. I](https://reader033.fdocuments.us/reader033/viewer/2022050417/5f8ce606d2e79c3793206ff9/html5/thumbnails/21.jpg)
Diagonalization revisited
(I⊗ I−hA⊗L)Y = (1⊗ I)un (1)
un+1 = un + h(bT ⊗L)Y (2)
I eigendecomposition A = VΛV−1
(V ⊗ I)(I⊗ I−hΛ⊗L)(V−1⊗ I)Y = (1⊗ I)un.
I Find diagonal W such that W−11 = V−11I Commute diagonal matrices
(I⊗ I−hΛ⊗L)(WV−1⊗ I)Y︸ ︷︷ ︸Z
= (1⊗ I)un.
I Using bT = bT VW−1, we have the completion formula
un+1 = un + h(bT ⊗L)Z .
I Λ, b is new diagonal Butcher tableI Compute coefficients offline using extended precision to handle ill-conditioning of VI Equivalent for linear problems, usually fails nonlinear stability
![Page 22: On Time Integration for Strong Scalability - Jed Brown · I Implicit methods require an implicit solve in each stage. I Time step size proportional to CFL for accuracy reasons. I](https://reader033.fdocuments.us/reader033/viewer/2022050417/5f8ce606d2e79c3793206ff9/html5/thumbnails/22.jpg)
Exploiting realnessI Eigenvalues come in conjugate pairs
A = VΛV−1
I For each conjugate pair, create unitary transformation
T =1√2
[1 1i −i
]I Real 2×2 block diagonal D; real V (with appropriate phase)
A = (VT ∗)(T ΛT ∗)(TV−1) = V DV−1
I Yields new block-diagonal Butcher table D, b.I Halve number of stages using identity
(α + J)−1u = (α + J)−1u
Solve one complex problem per conjugate pair, then take twice the real part.
![Page 23: On Time Integration for Strong Scalability - Jed Brown · I Implicit methods require an implicit solve in each stage. I Time step size proportional to CFL for accuracy reasons. I](https://reader033.fdocuments.us/reader033/viewer/2022050417/5f8ce606d2e79c3793206ff9/html5/thumbnails/23.jpg)
Conditioning
2 4 6 8 10 12 14Number of stages
100
101
102
103
104
105
106
107
Conditioning of diagonalization: gauss(V)
||b||1||bre||1
I Diagonalization inextended precision helpssomewhat, as does realformulation
I Neither makes arbitrarilylarge number of stagesviable
![Page 24: On Time Integration for Strong Scalability - Jed Brown · I Implicit methods require an implicit solve in each stage. I Time step size proportional to CFL for accuracy reasons. I](https://reader033.fdocuments.us/reader033/viewer/2022050417/5f8ce606d2e79c3793206ff9/html5/thumbnails/24.jpg)
REXI: Rational approximation of exponential
u(t) = eLtu(0)
I Haut, Babb, Martinsson, Wingate; Schreiber and Loft
(α⊗ I + hI⊗L)Y = (1⊗ I)un
un+1 = (βT ⊗ I)Y .
I α is complex-valued diagonal, β is complex
I Constructs rational approximations of Gaussian basis functions, target (real part of) eit
I REXI is a Runge-Kutta method: can convert via “modified Shu-Osher form”I Developed for SSP (strong stability preserving) methodsI Ferracina, Spijker (2005), Higueras (2005)I Yields diagonal Butcher table A =−α−1,b =−α−2β
![Page 25: On Time Integration for Strong Scalability - Jed Brown · I Implicit methods require an implicit solve in each stage. I Time step size proportional to CFL for accuracy reasons. I](https://reader033.fdocuments.us/reader033/viewer/2022050417/5f8ce606d2e79c3793206ff9/html5/thumbnails/25.jpg)
Abscissa for RK and REXI methods
0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.20.6
0.4
0.2
0.0
0.2
0.4
0.6Abscissa
GaussGauss diagGauss rbdiagREXI 33Cauchy 16
![Page 26: On Time Integration for Strong Scalability - Jed Brown · I Implicit methods require an implicit solve in each stage. I Time step size proportional to CFL for accuracy reasons. I](https://reader033.fdocuments.us/reader033/viewer/2022050417/5f8ce606d2e79c3793206ff9/html5/thumbnails/26.jpg)
Stability regions
10 5 0 5 1020
15
10
5
0
5
10
15
20Gauss(4) Stability
0.0
0.5
1.0
1.5
2.0
10 5 0 5 1020
15
10
5
0
5
10
15
20Cauchy(64) Stability
0.0
0.5
1.0
1.5
2.0
![Page 27: On Time Integration for Strong Scalability - Jed Brown · I Implicit methods require an implicit solve in each stage. I Time step size proportional to CFL for accuracy reasons. I](https://reader033.fdocuments.us/reader033/viewer/2022050417/5f8ce606d2e79c3793206ff9/html5/thumbnails/27.jpg)
Order stars
10 5 0 5 1020
15
10
5
0
5
10
15
20Gauss(4) Order Star
100
10 1
10 2
10 3
0
10 3
10 2
10 1
100
10 5 0 5 1020
15
10
5
0
5
10
15
20Cauchy(64) Order Star
100
10 1
10 2
10 3
0
10 3
10 2
10 1
100
![Page 28: On Time Integration for Strong Scalability - Jed Brown · I Implicit methods require an implicit solve in each stage. I Time step size proportional to CFL for accuracy reasons. I](https://reader033.fdocuments.us/reader033/viewer/2022050417/5f8ce606d2e79c3793206ff9/html5/thumbnails/28.jpg)
Motivations Rational Approximation of Exponential Integrator) SWE on the plane SWE on the sphere REXI on sphere Results REXI and implicit Runge-Kutta Summary & future work
Computational Performance (SWE on the plane)
Spectral solver with RK4 time stepping methodvs. REXI with spectral solver
Resolution = 1282
More than one order ofmagnitude faster withsimilar accuracy
Proof of concept thatREXI works with spectralmethods
Computed on Linux Cluster, LRZ / Technical University of Munich
M.Schreiber, P. Peixoto, TS.Haut, BA.Wingate - Beyond spatial scalability limitations with a
massively parallel method for linear oscillatory problems, in International Journal of High
Performance Computing Applications
18 / 63
![Page 29: On Time Integration for Strong Scalability - Jed Brown · I Implicit methods require an implicit solve in each stage. I Time step size proportional to CFL for accuracy reasons. I](https://reader033.fdocuments.us/reader033/viewer/2022050417/5f8ce606d2e79c3793206ff9/html5/thumbnails/29.jpg)
Outlook on Kronecker product solvers
I⊗S + J⊗T
I (Block) diagonal S is usually sufficientI Best opportunity for “time parallel” (for linear problems)
I Is it possible to beat explicit wave propagation with high efficiency?
I Same structure for stochastic Galerkin and other UQ methodsI IRK unintrusively offers bandwidth reuse and vectorizationI Need polynomial smoothers for IRK spectraI Change number of stages on spatially-coarse grids (p-MG, or even increase)?I Experiment with SOR-type smoothers
I Prefer point-block Jacobi in smoothers for spatial parallelism
I Possible IRK correction for IMEX (non-smooth explicit function)I PETSc implementation (works in parallel, hardening in progress)I Thanks to DOE ASCR