Mixed-Precision GPU-Multigrid Solvers with Strong ...
Transcript of Mixed-Precision GPU-Multigrid Solvers with Strong ...
![Page 1: Mixed-Precision GPU-Multigrid Solvers with Strong ...](https://reader034.fdocuments.us/reader034/viewer/2022052620/628e530d65331f372167bcfb/html5/thumbnails/1.jpg)
Mixed-Precision GPU-Multigrid Solverswith Strong Smoothers
and Applications in CFD and CSM
Dominik Goddeke
Institut fur Angewandte Mathematik (LS3)TU Dortmund
SIMTECH 2011International Conference on Simulation Technology
University of Stuttgart, June 16 2011
![Page 2: Mixed-Precision GPU-Multigrid Solvers with Strong ...](https://reader034.fdocuments.us/reader034/viewer/2022052620/628e530d65331f372167bcfb/html5/thumbnails/2.jpg)
Motivation
Hardware isn’t our friend any more
Paradigm shift towards parallelism and heterogeneity
In a single chip: Multicores, GPUs, . . .In a workstation, cluster node, . . .In a big cluster, supercomputer, . . .
Data movement cost gets prohibitively expensive
Technical reason: Power wall + memory wall + ILP wall = brick wall
Challenges in numerical HPC
Existing codes don’t run faster automatically any more
Compilers can’t solve these problems, libraries are limited
Traditional numerics is often contrary to these hardware trends
We (the numerics people) have to take action
![Page 3: Mixed-Precision GPU-Multigrid Solvers with Strong ...](https://reader034.fdocuments.us/reader034/viewer/2022052620/628e530d65331f372167bcfb/html5/thumbnails/3.jpg)
Hardware-oriented numerics
Conflicting situations
Existing methods no longer hardware-compatible
Neither want less numerical efficiency, nor less hardware efficiency
Challenge: New algorithmic way of thinking
Balance these conflicting goals
Consider short-term hardware details in actual implementations,but long-term hardware trends in the design of numerical schemes
Locality, locality, locality
Commmunication-avoiding (-delaying) algorithms between allflavours of parallelism
Multilevel methods, hardware-aware preconditioning
![Page 4: Mixed-Precision GPU-Multigrid Solvers with Strong ...](https://reader034.fdocuments.us/reader034/viewer/2022052620/628e530d65331f372167bcfb/html5/thumbnails/4.jpg)
Grid and Matrix Structures
Flexibility ↔ Performance
![Page 5: Mixed-Precision GPU-Multigrid Solvers with Strong ...](https://reader034.fdocuments.us/reader034/viewer/2022052620/628e530d65331f372167bcfb/html5/thumbnails/5.jpg)
Grid and matrix structures
General sparse matrices (unstructured grids)
CSR (and variants): General data structure for arbitrary grids
Maximum flexibility, but during SpMV
Indirect, irregular memory accessesIndex overhead reduces already low arithm. intensity further
Performance depends on nonzero pattern (grid numbering)
Structured sparse matrices
Example: Structured grids, suitable numbering ⇒ band matrices
Important: No stencils, fully variable coefficients
Direct regular memory accesses, fast independent of mesh
Exploitation in the design of strong MG components
![Page 6: Mixed-Precision GPU-Multigrid Solvers with Strong ...](https://reader034.fdocuments.us/reader034/viewer/2022052620/628e530d65331f372167bcfb/html5/thumbnails/6.jpg)
Example: Poisson on unstructured mesh
0
5
10
15
20
25
30
35
40
2LVL CM XYZ HIER BAND
<--
-- s
mal
ler
is b
ette
r <
----
linea
r so
lver
(se
c)
1 Thread4 Threads
GPUMPI (4x)
Nehalem vs. GT200, ≈ 2M bilinear FE, MG-JAC solver
Unstructured formats highly numbering-dependent
Multicore 2–3x over singlecore, GPU 8–12x over multicore
Banded format (here: 8 ‘blocks’) 2–3x faster than best unstructuredlayout and predictably on par with multicore
![Page 7: Mixed-Precision GPU-Multigrid Solvers with Strong ...](https://reader034.fdocuments.us/reader034/viewer/2022052620/628e530d65331f372167bcfb/html5/thumbnails/7.jpg)
Discretisation and SolverStructures in FEAST
Scalable, Locality-preservingParallel Multilevel Solvers
![Page 8: Mixed-Precision GPU-Multigrid Solvers with Strong ...](https://reader034.fdocuments.us/reader034/viewer/2022052620/628e530d65331f372167bcfb/html5/thumbnails/8.jpg)
Approach in FEAST
Combination of structured and unstructured advantages
Global macro-mesh: Unstructured, flexible, complex domains
Local micro-meshes: Structured (logical TP-structure), fast
Important: Structured 6= simple meshes!
UU
“window” formatrix-vectormultiplication,per macro
hierarchicallyrefined subdomain(= “macro”),rowwise numbered
unstructured mesh
UDUL
DUDDDL
LULDLL
I-1
I
I+1
I-M-1I-M
I-M+1I+M-1
I+M
I+M+1
Ωi
Solver approach ScaRC exploits data layout
Parallel efficiency: Strong and weak scalability
Numerical scalability: Convergence rates independent of problemsize and partitioning (multigrid!)
Robustness: Mesh and operator anisotropies (strong smoothers!)
![Page 9: Mixed-Precision GPU-Multigrid Solvers with Strong ...](https://reader034.fdocuments.us/reader034/viewer/2022052620/628e530d65331f372167bcfb/html5/thumbnails/9.jpg)
ScaRC: Concepts
ScaRC for scalar systems
Hybrid multilevel domain decomposition method
Minimal overlap by extended Dirichlet BCs
Inspired by parallel MG (‘best of both worlds’)
Multiplicative between levels, global coarse grid problem (MG-like)Additive horizontally: block-Jacobi / Schwarz smoother (DD-like)
Schwarz smoother encapsulates local irregularities
Robust and fast multigrid (‘gain a digit’), strong smoothersMaximum exploitation of local structure
UU
“window” formatrix-vectormultiplication,per macro
hierarchicallyrefined subdomain(= “macro”),rowwise numbered
unstructured mesh
UDUL
DUDDDL
LULDLL
I-1
I
I+1
I-M-1I-M
I-M+1I+M-1
I+M
I+M+1
Ωi
global BiCGStab
preconditioned by
global multilevel (V 1+1)
additively smoothed by
for all Ωi: local multigrid
coarse grid solver: UMFPACK
![Page 10: Mixed-Precision GPU-Multigrid Solvers with Strong ...](https://reader034.fdocuments.us/reader034/viewer/2022052620/628e530d65331f372167bcfb/html5/thumbnails/10.jpg)
ScaRC for multivariate problems
Block-structured systems
Guiding idea: Tune scalar case once per architecture instead of overand over again per application
Blocks correspond to scalar subequations, coupling via specialpreconditioners
Block-wise treatment enables multivariate ScaRC solvers
(
A11 A12
A21 A22
)(
u1
u2
)
= f ,
A11 0 B1
0 A22 B2
BT1 BT
2 0
v1
v2
p
= f ,
A11 A12 B1
A21 A22 B2
BT1 BT
2 CC
v1
v2
p
= f
A11 and A22 correspond to scalar (elliptic) operators⇒ Tuned linear algebra and tuned solvers
![Page 11: Mixed-Precision GPU-Multigrid Solvers with Strong ...](https://reader034.fdocuments.us/reader034/viewer/2022052620/628e530d65331f372167bcfb/html5/thumbnails/11.jpg)
Minimal invasive accelerator integration
Bandwidth distribution in a hybrid CPU/GPU node
![Page 12: Mixed-Precision GPU-Multigrid Solvers with Strong ...](https://reader034.fdocuments.us/reader034/viewer/2022052620/628e530d65331f372167bcfb/html5/thumbnails/12.jpg)
Minimally invasive accelerator integration
Guiding concept: locality
Accelerators: Most time-consuming inner component
CPUs: Outer MLDD solver (only hardware capable of MPI anyway)
Block-structured approach inside MPI rank allows double-bufferingand PCIe communication overlap
Employ mixed precision approach
global BiCGStab
preconditioned by
global multilevel (V 1+1)
additively smoothed by
for all Ωi: local multigrid
coarse grid solver: UMFPACK
![Page 13: Mixed-Precision GPU-Multigrid Solvers with Strong ...](https://reader034.fdocuments.us/reader034/viewer/2022052620/628e530d65331f372167bcfb/html5/thumbnails/13.jpg)
Minimally invasive accelerator integration
Benefits and challenges
Balance acceleration potential and integration effort
Accelerate many different applications built on top of one central FEand solver toolkit
Diverge code paths as late as possible
Develop on a single GPU and scale out later
Retain all functionality
Do not sacrifice accuracy
No changes to application code!
Challenges
Heterogeneous task assignment to maximise throughput
Overlapping CPU and GPU computations with transfers
![Page 14: Mixed-Precision GPU-Multigrid Solvers with Strong ...](https://reader034.fdocuments.us/reader034/viewer/2022052620/628e530d65331f372167bcfb/html5/thumbnails/14.jpg)
Strong Smoothers
Parallelising InherentlySequential Operations
![Page 15: Mixed-Precision GPU-Multigrid Solvers with Strong ...](https://reader034.fdocuments.us/reader034/viewer/2022052620/628e530d65331f372167bcfb/html5/thumbnails/15.jpg)
Motivation: Why strong smoothers?
Test case: Generalised Poisson problem with anisotropic diffusion
−∇ · (G ∇u) = f on unit square (one FEAST patch)
G = I: standard Poisson problem, G 6= I: arbitrarily challenging
Example: G introduces anisotropic diffusion along some vector field
0.01
0.1
1
10
100
332
L=5652
L=61292
L=72572
L=85132
L=910252
L=10
<--
-- s
mal
ler
is b
ette
r <
----
Tim
e pe
r di
git p
er D
OF
(lo
g10)
CPU, double precision
BICGSTAB(JAC)MG(JAC)
BICGSTAB(ADITRIGS)MG(ADITRIGS)
Only multigrid with a strong smoother is competitive
![Page 16: Mixed-Precision GPU-Multigrid Solvers with Strong ...](https://reader034.fdocuments.us/reader034/viewer/2022052620/628e530d65331f372167bcfb/html5/thumbnails/16.jpg)
Gauß-Seidel smoother
Disclaimer: Not necessarily a good smoother, but a good didactical example.
Sequential algorithm
Forward elimination, sequential dependencies between matrix rows
Illustrative: Coupling to the left and bottom
1st idea: Classical wavefront-parallelisation (exact)
Pro: Always works to resolve explicit dependencies
Con: Irregular parallelism and access patterns, implementable?
![Page 17: Mixed-Precision GPU-Multigrid Solvers with Strong ...](https://reader034.fdocuments.us/reader034/viewer/2022052620/628e530d65331f372167bcfb/html5/thumbnails/17.jpg)
Gauß-Seidel smoother
2nd idea: Decouple dependencies via multicolouring (inexact)
Jacobi (red) – coupling to left (green) – coupling to bottom (blue) –coupling to left and bottom (yellow)
Analysis
Parallel efficiency: 4 sweeps with ≈ N/4 parallel work each
Regular data access, but checkerboard pattern challenging forSIMD/GPUs due to strided access
Numerical efficiency: Sequential coupling only in last sweep
![Page 18: Mixed-Precision GPU-Multigrid Solvers with Strong ...](https://reader034.fdocuments.us/reader034/viewer/2022052620/628e530d65331f372167bcfb/html5/thumbnails/18.jpg)
Gauß-Seidel smoother
3rd idea: Multicolouring = renumbering
After decoupling: ‘Standard’ update (left+bottom) is suboptimal
Does not include all already available results
Recoupling: Jacobi (red) – coupling to left and right (green) – topand bottom (blue) – all 8 neighbours (yellow)
More computations that standard decoupling
Experiments: Convergence rates of sequential variant recovered (inabsence of preferred direction)
![Page 19: Mixed-Precision GPU-Multigrid Solvers with Strong ...](https://reader034.fdocuments.us/reader034/viewer/2022052620/628e530d65331f372167bcfb/html5/thumbnails/19.jpg)
Tridiagonal smoother (line relaxation)
Starting point
Good for ‘line-wise’ anisotropies
‘Alternating Direction Implicit (ADI)’technique alternates rows and columns
CPU implementation: Thomas-Algorithm(inherently sequential)
Observations
One independent tridiagonal system per mesh row
⇒ top-level parallelisation across mesh rows
Implicit coupling: Wavefront and colouring techniques not applicable
![Page 20: Mixed-Precision GPU-Multigrid Solvers with Strong ...](https://reader034.fdocuments.us/reader034/viewer/2022052620/628e530d65331f372167bcfb/html5/thumbnails/20.jpg)
Tridiagonal smoother (line relaxation)
Cyclic reduction for tridiagonal systems
Exact, stable (w/o pivoting) and cost-efficient
Problem: Classical formulation parallelises computation but notmemory accesses on GPUs (bank conflicts in shared memory)
Developed a better formulation, 2-4x faster
Index challenge, general idea: Recursive padding between odd andeven indices on all levels
![Page 21: Mixed-Precision GPU-Multigrid Solvers with Strong ...](https://reader034.fdocuments.us/reader034/viewer/2022052620/628e530d65331f372167bcfb/html5/thumbnails/21.jpg)
Combined GS and TRIDI
Starting point
CPU implementation: Shift previous row toRHS and solve remaining tridiagonal systemwith Thomas-Algorithm
Combined with ADI, this is the best generalsmoother (we know) for this matrix structure
Observations and implementation
Difference to tridiagonal solvers: Mesh rows depend sequentially oneach other
Use colouring (#c ≥ 2) to decouple the dependencies between rows(more colours = more similar to sequential variant)
![Page 22: Mixed-Precision GPU-Multigrid Solvers with Strong ...](https://reader034.fdocuments.us/reader034/viewer/2022052620/628e530d65331f372167bcfb/html5/thumbnails/22.jpg)
Evaluation: Total efficiency on CPU and GPU
Test problem: Generalised Poisson with anisotropic diffusion
Total efficiency: Time per unknown per digit (µs)
Mixed precision iterative refinement multigrid solver
Intel Westmere vs. NVIDIA Fermi
0.1
1
10
33L=5
65L=6
129L=7
257L=8
513L=9
1025L=10
----
> la
rger
is b
ette
r --
-->
Tot
al r
untim
e ef
ficie
ncy
(log1
0)
Problem size
GSROW(1.0),CPUADITRIDI(0.8),CPU
ADITRIGS(1.0),CPU
0.1
1
10
33L=5
65L=6
129L=7
257L=8
513L=9
1025L=10
----
> la
rger
is b
ette
r --
-->
Tot
al r
untim
e ef
ficie
ncy
(log1
0)
Problem size
MC-GSROW(1.0),GPUADITRIDI(0.8),GPU
MC-ADITRIGS(1.0),GPU
![Page 23: Mixed-Precision GPU-Multigrid Solvers with Strong ...](https://reader034.fdocuments.us/reader034/viewer/2022052620/628e530d65331f372167bcfb/html5/thumbnails/23.jpg)
Speedup GPU vs. CPU
0.01
0.1
1
10
33L=5
65L=6
129L=7
257L=8
513L=9
1025L=10
----
> la
rger
is b
ette
r --
-->
Spe
edup
(lo
g10)
Problem size
GSROWADITRIDI
ADITRIGS
Summary: Smoother parallelisation
Factor 10-30 (dep. on precision and smoother selection) speedupover already highly tuned CPU implementation
Same functionality on CPU and GPU
Balancing of numerical and parallel efficiency (hardware-orientednumerics)
![Page 24: Mixed-Precision GPU-Multigrid Solvers with Strong ...](https://reader034.fdocuments.us/reader034/viewer/2022052620/628e530d65331f372167bcfb/html5/thumbnails/24.jpg)
Cluster Results
![Page 25: Mixed-Precision GPU-Multigrid Solvers with Strong ...](https://reader034.fdocuments.us/reader034/viewer/2022052620/628e530d65331f372167bcfb/html5/thumbnails/25.jpg)
Linearised elasticity
„
A11 A12
A21 A22
« „
u1
u2
«
= f
(2µ + λ)∂xx + µ∂yy (µ + λ)∂xy
(µ + λ)∂yx µ∂xx + (2µ + λ)∂yy
!
global multivariate BiCGStabblock-preconditioned by
Global multivariate multilevel (V 1+1)additively smoothed (block GS) by
for all Ωi: solve A11c1 = d1
bylocal scalar multigrid
update RHS: d2 = d2 − A21c1
for all Ωi: solve A22c2 = d2
bylocal scalar multigrid
coarse grid solver: UMFPACK
![Page 26: Mixed-Precision GPU-Multigrid Solvers with Strong ...](https://reader034.fdocuments.us/reader034/viewer/2022052620/628e530d65331f372167bcfb/html5/thumbnails/26.jpg)
Speedup
0
50
100
150
200
250
300
BLOCK PIPE CRACK FRAME
<--
-- s
mal
ler
is b
ette
r <
----
linea
r so
lver
(se
c)
SinglecoreDualcore
GPU
USC cluster in Los Alamos, 16 dualcore nodes (Opteron Santa Rosa,Quadro FX5600)
Problem size 128 M DOF
Dualcore 1.6x faster than singlecore (memory wall)
GPU 2.6x faster than singlecore, 1.6x than dualcore
![Page 27: Mixed-Precision GPU-Multigrid Solvers with Strong ...](https://reader034.fdocuments.us/reader034/viewer/2022052620/628e530d65331f372167bcfb/html5/thumbnails/27.jpg)
Speedup analysis
Theoretical model of expected speedup
Integration of GPUs increases resources
Correct model: Strong scaling within each node
Acceleration potential of the elasticity solver: Racc = 2/3(remaining time in MPI and the outer solver)
Smax = 11−Racc
Smodel = 1(1−Racc)+(Racc/Slocal)
This example
Accelerable fraction Racc 66%Local speedup Slocal 9xModeled speedup Smodel 2.5xMeasured speedup Stotal 2.6xUpper bound Smax 3x
1
2
3
4
5
6
7
8
9
10
1 5 10 15 20 25 30 35
----
> la
rger
is b
ette
r --
-->
Sm
odel
Slocal
B=0.900B=0.750B=0.666
![Page 28: Mixed-Precision GPU-Multigrid Solvers with Strong ...](https://reader034.fdocuments.us/reader034/viewer/2022052620/628e530d65331f372167bcfb/html5/thumbnails/28.jpg)
Weak scalability
Simultaneous doubling of problem size and resources
Left: Poisson, 160 dual Xeon / FX1400 nodes, max. 1.3 B DOF
Right: Linearised elasticity, 64 nodes, max. 0.5 B DOF
10
20
30
40
50
60
70
80
64M
N=
8
128M
N=
16
256M
N=
32
512M
N=
64
1024
MN
=12
8
<--
-- s
mal
ler
is b
ette
r <
----
linea
r so
lver
(se
c)
2 CPUsGPU
80
90
100
110
120
130
140
150
160
32M
N=
4
64M
N=
8
128M
N=
16
256M
N=
32
512M
N=
64
<--
-- s
mal
ler
is b
ette
r <
----
linea
r so
lver
(sec
)
2 CPUsGPU
Results
No loss of weak scalability despite local acceleration
1.3 billion unknowns (no stencil!) on 160 GPUs in less than 50 s
![Page 29: Mixed-Precision GPU-Multigrid Solvers with Strong ...](https://reader034.fdocuments.us/reader034/viewer/2022052620/628e530d65331f372167bcfb/html5/thumbnails/29.jpg)
Stationary laminar flow (Navier-Stokes)
0
@
A11 A12 B1
A21 A22 B2
BT1
BT2
C
1
A
0
@
u1
u2
p
1
A =
0
@
f1f2g
1
A
fixed point iterationassemble linearised subproblems and solve with
global BiCGStab (reduce initial residual by 1 digit)Block-Schurcomplement preconditioner1) approx. solve for velocities with
global MG (V 1+0), additively smoothed by
for all Ωi: solve for u1 withlocal MG
for all Ωi: solve for u2 withlocal MG
2) update RHS: d3 = −d3 + BT(c1, c2)
T
3) scale c3 = (MLp)−1
d3
![Page 30: Mixed-Precision GPU-Multigrid Solvers with Strong ...](https://reader034.fdocuments.us/reader034/viewer/2022052620/628e530d65331f372167bcfb/html5/thumbnails/30.jpg)
Stationary laminar flow (Navier-Stokes)
Solver configuration
Driven cavity: Jacobi smoother sufficient
Channel flow: ADI-TRIDI smoother required
Speedup analysis
Racc Slocal Stotal
L9 L10 L9 L10 L9 L10DC Re250 52% 62% 9.1x 24.5x 1.63x 2.71xChannel flow 48% – 12.5x – 1.76x –
Shift away from domination by linear solver
Fraction of FE assembly and linear solver of total time, max.problem size
DC Re250 ChannelCPU GPU CPU GPU12:88 31:67 38:59 68:28
![Page 31: Mixed-Precision GPU-Multigrid Solvers with Strong ...](https://reader034.fdocuments.us/reader034/viewer/2022052620/628e530d65331f372167bcfb/html5/thumbnails/31.jpg)
Summary
![Page 32: Mixed-Precision GPU-Multigrid Solvers with Strong ...](https://reader034.fdocuments.us/reader034/viewer/2022052620/628e530d65331f372167bcfb/html5/thumbnails/32.jpg)
Summary
ScaRC solver scheme
Globally-unstructured-locally-structured
Tight co-design of discretisation (grid and finite elements) withmultilevel solver
Beneficial on CPUs and GPUs
Numerically and computationally future-proof (some odd ends stillto be resolved)
GPU computing
Parallelising strong recursive smoothers
Minimally invasive acceleration with legacy codes
Significant speedups
On a single device: one order of magnitude
On the application level: Reduced due to Amdahl’s Law
![Page 33: Mixed-Precision GPU-Multigrid Solvers with Strong ...](https://reader034.fdocuments.us/reader034/viewer/2022052620/628e530d65331f372167bcfb/html5/thumbnails/33.jpg)
Acknowledgements
Collaborative work with
FEAST group (TU Dortmund): Ch. Becker, S.H.M. Buijssen, M.Geveler, D. Goddeke, M. Koster, D. Ribbrock, Th. Rohkamper, S.Turek, H. Wobker, P. Zajac
Robert Strzodka (Max Planck Institut Informatik)
Jamaludin Mohd-Yusof, Patrick McCormick (Los Alamos NationalLaboratory)
Supported by
DFG: TU 102/22-1, TU 102/22-2
BMBF: HPC Software fur skalierbare Parallelrechner: SKALBproject 01IH08003D
http://www.mathematik.tu-dortmund.de/~goeddeke