On the scalability of BDDC-based fast parallel iterative ...
Transcript of On the scalability of BDDC-based fast parallel iterative ...
On the scalability of BDDC-based fast parallel iterative solvers forthe discrete Stokes problem with continuous pressures
Alberto F. Martın, Santiago Badia
HPSC Team, Centre Internacional de Metodes Numerics a l’Enginyeria (CIMNE)Castelldefels, Spain
Universitat Politecnica de CatalunyaBarcelona, Spain
WCCM XI - ECCM V - ECFD VI, Barcelona, July 2014
1 / 21
Outline
1 Introduction
2 BDDC preconditioner and its highly scalable parallel implementation
3 Preconditioners for the Stokes problem
4 Numerical Experiments
5 Conclusions and future work
1 / 21
Outline
1 Introduction
2 BDDC preconditioner and its highly scalable parallel implementation
3 Preconditioners for the Stokes problem
4 Numerical Experiments
5 Conclusions and future work
1 / 21
Introduction
• This work grounds on recent efforts∗,† towards the development of highlyscalable domain decomposition linear solver codes tailored for FE analysis
• These codes rely on a novel implementation approach of the BalancingDomain Decomposition by Constraints (BDDC) preconditioning ideas
• Scalability systematically assessed for 3D elliptic PDEs (Poisson, Elasticity)with remarkable results (e.g., weakly scales up to 100K IBM BG/Q cores)
• Final goal is extreme-scale multiphysics solvers+, where highly scalable solvercodes for one-physics problems critical-for-scalability basic building blocks
• In the road to more complex problems, some experiences with BDDC-basedparallel solvers for discrete Stokes flows (continuous pressure spaces)
∗ S. Badia, A. F. Martın and J. Principe. A highly scalable parallel implementation of balancing domaindecomposition by constraints. SIAM J. Sci. Comput. Vol. 36(2), pp. C190-C218, 2014.† S. Badia, A. F. Martın and J. Principe. On the scalability of inexact balancing domain decompositionby constraints with overlapped coarse/fine corrections. Submitted, 2014.+ S. Badia, A. F. Martın and R. Planas. Block recursive LU preconditioners for the thermally coupledincompressible inductionless MHD problem. J. Comput. Physics, Vol. 274, pp. 562-591, 2014.
XXX 2 / 21
Outline
1 Introduction
2 BDDC preconditioner and its highly scalable parallel implementation
3 Preconditioners for the Stokes problem
4 Numerical Experiments
5 Conclusions and future work
2 / 21
Domain DecompositionGiven Ω and a FE partition T (Ω)→ Build a conforming FE space Vh ⊂ H1
0 (Ω) (also for dG)
• Variational problem: Find
u ∈ Vh : a(u, v) = (f , v), for any v ∈ Vh,
assuming a(·, ·) symmetric, coercive (e.g. Laplacian or linear elasticity)
• Algebraic problem: Find
x ∈ Rn : Ax = b,
A is a large, sparse, and symmetric positive definite (also for nonsym.)
Motivation:Efficient exploitation of distributed-memorymachines for large scale FE problems ⇒Domain decomposition framework
: interior DoFs (I ); : interface dofs (Γ)
3 / 21
Domain DecompositionGiven Ω and a FE partition T (Ω)→ Build a conforming FE space Vh ⊂ H1
0 (Ω) (also for dG)
• Variational problem: Find
u ∈ Vh : a(u, v) = (f , v), for any v ∈ Vh,
assuming a(·, ·) symmetric, coercive (e.g. Laplacian or linear elasticity)
• Algebraic problem: Find
x ∈ Rn : Ax = b,
A is a large, sparse, and symmetric positive definite (also for nonsym.)
Motivation:Efficient exploitation of distributed-memorymachines for large scale FE problems ⇒Domain decomposition framework
: interior DoFs (I ); : interface dofs (Γ)
3 / 21
Domain DecompositionGiven Ω and a FE partition T (Ω)→ Build a conforming FE space Vh ⊂ H1
0 (Ω) (also for dG)
• Variational problem: Find
u ∈ Vh : a(u, v) = (f , v), for any v ∈ Vh,
assuming a(·, ·) symmetric, coercive (e.g. Laplacian or linear elasticity)
• Algebraic problem: Find
x ∈ Rn : Ax = b,
A is a large, sparse, and symmetric positive definite (also for nonsym.)
Motivation:Efficient exploitation of distributed-memorymachines for large scale FE problems ⇒Domain decomposition framework
: interior DoFs (I ); : interface dofs (Γ)
3 / 21
BDDC Balancing DD by constraints
BDDC preconditioner [Dohrmann, . . .]
• Replace Vh by Vh (reduced continuity)
• Define the injection I : Vh −→ Vh
(weight, comm and add)
• Find xh ∈ Vh such that:
a(xh, vh) = 〈I trh, vh〉, ∀vh ∈ Vh
and obtain zh = MBDDC rh = E I xh
• Last correction: E is the harmonic extensionof the boundary values, which implies localDirichlet solvers
Vh
6I I t
?
Vh
4 / 21
BDDC Balancing DD by constraints
BDDC preconditioner [Dohrmann, . . .]
• Replace Vh by Vh (reduced continuity)
• Define the injection I : Vh −→ Vh
(weight, comm and add)
• Find xh ∈ Vh such that:
a(xh, vh) = 〈I trh, vh〉, ∀vh ∈ Vh
and obtain zh = MBDDC rh = E I xh
• Last correction: E is the harmonic extensionof the boundary values, which implies localDirichlet solvers
Vh
6I I t
?
Vh
4 / 21
BDDC Balancing DD by constraints
BDDC preconditioner [Dohrmann, . . .]
• Replace Vh by Vh (reduced continuity)
• Define the injection I : Vh −→ Vh
(weight, comm and add)
• Find xh ∈ Vh such that:
a(xh, vh) = 〈I trh, vh〉, ∀vh ∈ Vh
and obtain zh = MBDDC rh = E I xh
• Last correction: E is the harmonic extensionof the boundary values, which implies localDirichlet solvers
Vh
6I I t
?
Vh
4 / 21
BDDC Balancing DD by constraints
BDDC preconditioner [Dohrmann, . . .]
• Alternatively, find x ∈ Rn such that:
Ax = I tr
and obtain z = MBDDC r = E I x
• A is a sub-assembled global matrix (onlyassembled the red corners in the figure)
• E =
[0 −A−1
II AI Γ
0 IΓ
],
with AII = diag(A
(1)II ,A
(2)II , . . . ,A
(P)II
)
Vh
6I I t
?
Vh
4 / 21
BDDC preconditioning
• Let Vh = [v v ] and decompose Vh as
Vh = VF ⊕ VC , with
VF = [v 0]
VC ⊥A VF
• Now, problem split into fine-grid (xF ) and coarse-grid (xC ) correction
Fine-grid correction (xF )
• Find xF ∈ Rn such that
AxF = I tr , constrained to (xF ) = 0
• Equivalent to P independent problems
Find x(i)F ∈ Rn(i)
such that
A(i)x(i)F = I t
i r , constrained to (x(i)F ) = 0
Vh
5 / 21
BDDC preconditioning
• Let Vh = [v v ] and decompose Vh as
Vh = VF ⊕ VC , with
VF = [v 0]
VC ⊥A VF
• Now, problem split into fine-grid (xF ) and coarse-grid (xC ) correction
Fine-grid correction (xF )
• Find xF ∈ Rn such that
AxF = I tr , constrained to (xF ) = 0
• Equivalent to P independent problems
Find x(i)F ∈ Rn(i)
such that
A(i)x(i)F = I t
i r , constrained to (x(i)F ) = 0
Vh
5 / 21
BDDC preconditioning
• Let Vh = [v v ] and decompose Vh as
Vh = VF ⊕ VC , with
VF = [v 0]
VC ⊥A VF
• Now, problem split into fine-grid (xF ) and coarse-grid (xC ) correction
Fine-grid correction (xF )
• Find xF ∈ Rn such that
AxF = I tr , constrained to (xF ) = 0
• Equivalent to P independent problems
Find x(i)F ∈ Rn(i)
such that
A(i)x(i)F = I t
i r , constrained to (x(i)F ) = 0
Vh
5 / 21
BDDC preconditioning
• Let Vh = [v v ] and decompose Vh as
Vh = VF ⊕ VC , with
VF = [v 0]
VC ⊥A VF
• Now, problem split into fine-grid (xF ) and coarse-grid (xC ) correction
Coarse-grid correction (xC )
Computation of VC = spanΦ1,Φ2, . . . ,ΦnC
• Find Φ ∈ Rn×nC such that
AΦ = 0, constrained to Φ = I
• Equivalent to P independent problems
Find Φ(i) ∈ Rn×n(i)C such that
A(i)Φ(i) = 0, constrained to Φ(i)
= I Vh
5 / 21
BDDC coarse corner function
Circle domain partitioned into 9subdomains
Φj (VC ’s basis vector)
6 / 21
BDDC preconditioning
• Let Vh = [v v ] and decompose Vh as
Vh = VF ⊕ VC , with
VF = [v 0]
VC ⊥A VF
• Now, problem split into fine-grid (xF ) and coarse-grid (xC ) correction
Coarse-grid correction (xC )
Assembly and solution of coarse-grid problem
AC = assembly(ΦtA(i)Φ), Solve ACαc = Φt I tr , xC = ΦαC
Coarse-grid problem is
• Global, i.e. couples all subdomains
• But much smaller than original Schur complement S (size nC)
• Potential loss of parallel efficiency with P
7 / 21
Overlapped implementation
Typical parallel implementation
Highly-scalable parallel implementation
global communication
fine-gridcorrection
coarse-gridcorrection
core
1
core
2
core
3
core
4
core
P
TC
TF
time
idling
main MPI communicator
core
1
core
2
core
3
core
4
core
PF
global communication
fine-grid MPIcommunicator
core
1
core
2
core
PC
coarse-grid MPIcommunicator
TF
TC
PC
OpenMP-based coarse-gridsolution
• All MPI tasks have f-g duties andone/several have also c-g duties
• Computation of f-g/c-g duties serialized(but they are independent!)
• TC ∝ O(P2) → idling ' PTC
• mem ∝ O(P43 ) → mem per core rapidly
exceeded
• MPI tasks have either f-g OR c-g duties
• f-g/c-g corrections OVERLAPPED in time(asynchronous)
• c-g tasks can be MASKED with f-g tasksduties
• MPI-based or OpenMP-based (this work)solutions are possible for c-g correction
8 / 21
Overlapped implementation
Typical parallel implementation Highly-scalable parallel implementation
global communication
fine-gridcorrection
coarse-gridcorrection
core
1
core
2
core
3
core
4
core
P
TC
TF
time
idling
main MPI communicator
core
1
core
2
core
3
core
4
core
PF
global communication
fine-grid MPIcommunicator
core
1
core
2
core
PC
coarse-grid MPIcommunicator
TF
TC
PC
OpenMP-based coarse-gridsolution
• All MPI tasks have f-g duties andone/several have also c-g duties
• Computation of f-g/c-g duties serialized(but they are independent!)
• TC ∝ O(P2) → idling ' PTC
• mem ∝ O(P43 ) → mem per core rapidly
exceeded
• MPI tasks have either f-g OR c-g duties
• f-g/c-g corrections OVERLAPPED in time(asynchronous)
• c-g tasks can be MASKED with f-g tasksduties
• MPI-based or OpenMP-based (this work)solutions are possible for c-g correction
8 / 21
Overlapped implementation
Typical parallel implementation Highly-scalable parallel implementation
global communication
fine-gridcorrection
coarse-gridcorrection
core
1
core
2
core
3
core
4
core
P
TC
TF
time
idling
main MPI communicator
core
1
core
2
core
3
core
4
core
PF
global communication
fine-grid MPIcommunicator
core
1
core
2
core
PC
coarse-grid MPIcommunicator
TF
TC
PC
MPI-based coarse-gridsolution
• All MPI tasks have f-g duties andone/several have also c-g duties
• Computation of f-g/c-g duties serialized(but they are independent!)
• TC ∝ O(P2) → idling ' PTC
• mem ∝ O(P43 ) → mem per core rapidly
exceeded
• MPI tasks have either f-g OR c-g duties
• f-g/c-g corrections OVERLAPPED in time(asynchronous)
• c-g tasks can be MASKED with f-g tasksduties
• MPI-based or OpenMP-based (this work)solutions are possible for c-g correction
9 / 21
Outline
1 Introduction
2 BDDC preconditioner and its highly scalable parallel implementation
3 Preconditioners for the Stokes problem
4 Numerical Experiments
5 Conclusions and future work
9 / 21
The Stokes system
The continuous stationary problem:
−ν∆u +∇p = f,
∇ · u = 0.
The discrete problem (e.g. using Galerkin/stabilized FEM):[A BT
B −C
] [up
]=
[fg
],
where:
• u and p are the velocity and pressure field unknowns
• A is a vector-Laplacian matrix (SPD)
• C is a SPD stabilization matrix (C = 0 for inf-sup stable FEs)
10 / 21
Preconditioning the discrete Stokes problem
Approach 1: Block preconditioning
Considers block triangular matrices of the form
M =
[A BT
0 −S
]with M ≈
[A BT
B −C
],
where S = −C − BA−1BT , the pressure Schur complement, is a full SPD matrix
Solve Mz = r
1: Solve Szp = −rp
2: Solve Azu = ru − BT zp
Some properties of M:
• Easy-to-proof optimality: conditionnumber bounded irrespective of h
• 1 vector-Laplacian-type and 1 pressureSchur complement solves per iteration
• Direction solution of linear systems withA and S computationally prohibitive
11 / 21
Preconditioning the discrete Stokes problem
Approach 1: Block preconditioning
Considers block triangular matrices of the form
M =
[A BT
0 −S
]with M ≈
[A BT
B −C
],
where S = −C − BA−1BT , the pressure Schur complement, is a full SPD matrix
Solve Mz = r
1: Solve Szp = −rp
2: Solve Azu = ru − BT zp
Some properties of M:
• Easy-to-proof optimality: conditionnumber bounded irrespective of h
• 1 vector-Laplacian-type and 1 pressureSchur complement solves per iteration
• Direction solution of linear systems withA and S computationally prohibitive
11 / 21
Preconditioning the discrete Stokes problem
Approach 1: Block preconditioning
Choosing “suitable approximations” (spectrally equivalent) MA ≈ A and MS ≈ S
M =
[MA BT
0 −MS
]with M ≈
[A BT
B −C
],
condition number bounds irrespective of h can still be attained
Cheap-to-invert, highly-scalable choices at our disposal:
• MA =
MBDDC
A : BDDC preconditioner + exact (i.e., sparse direct) solvers
MBDDCA : BDDC preconditioner + inexact (i.e., AMG) solvers
• MS = Dp, with Dp the lumped pressure mass matrix (i.e., a diagonal matrix)
Solve Mz = r
1: Solve Dpzp = −rp
2: Solve MBDDCA zu = ru −BT zp
or MBDDCA zu = ru − BT zp
Some properties of M:
• Easy-to-proof optimality: conditionnumber bounded irrespective of h
• 1 BDDC preconditioner and 1 diagonalpreconditioner application per iteration
• Application of MBDDCA or MBDDC
A , and Dp
is cheap and highly scalable12 / 21
Preconditioning the discrete Stokes problem
Approach 1: Block preconditioning
Choosing “suitable approximations” (spectrally equivalent) MA ≈ A and MS ≈ S
M =
[MA BT
0 −MS
]with M ≈
[A BT
B −C
],
condition number bounds irrespective of h can still be attained
Cheap-to-invert, highly-scalable choices at our disposal:
• MA =
MBDDC
A : BDDC preconditioner + exact (i.e., sparse direct) solvers
MBDDCA : BDDC preconditioner + inexact (i.e., AMG) solvers
• MS = Dp, with Dp the lumped pressure mass matrix (i.e., a diagonal matrix)
Solve Mz = r
1: Solve Dpzp = −rp
2: Solve MBDDCA zu = ru −BT zp
or MBDDCA zu = ru − BT zp
Some properties of M:
• Easy-to-proof optimality: conditionnumber bounded irrespective of h
• 1 BDDC preconditioner and 1 diagonalpreconditioner application per iteration
• Application of MBDDCA or MBDDC
A , and Dp
is cheap and highly scalable12 / 21
Preconditioning the discrete Stokes problem
Approach 1: Block preconditioning
Choosing “suitable approximations” (spectrally equivalent) MA ≈ A and MS ≈ S
M =
[MA BT
0 −MS
]with M ≈
[A BT
B −C
],
condition number bounds irrespective of h can still be attained
Cheap-to-invert, highly-scalable choices at our disposal:
• MA =
MBDDC
A : BDDC preconditioner + exact (i.e., sparse direct) solvers
MBDDCA : BDDC preconditioner + inexact (i.e., AMG) solvers
• MS = Dp, with Dp the lumped pressure mass matrix (i.e., a diagonal matrix)
Solve Mz = r
1: Solve Dpzp = −rp
2: Solve MBDDCA zu = ru −BT zp
or MBDDCA zu = ru − BT zp
Some properties of M:
• Easy-to-proof optimality: conditionnumber bounded irrespective of h
• 1 BDDC preconditioner and 1 diagonalpreconditioner application per iteration
• Application of MBDDCA or MBDDC
A , and Dp
is cheap and highly scalable12 / 21
Preconditioning the discrete Stokes problem
Approach 2: Monolithic BDDC
• Apply BDDC methodology to Stokes problem
• Theory not fully understood, only for disc. pressures [Li & Widlund, 2006]
Well-posedness:† (Weak) inf-sup condition not obvious for (Vh, Qh)
BDDC(c/ce/cef) are well-posed preconditioners
Proof: inspired by the nonconforming Crouzeix-Raviart FE
Both for inf-sup stable and stabilized formulations with continuous pressures
The BDDC problem
Find uh ∈ Vh and ph ∈ Qh st
ν(∇uh,∇vh)− (ph,∇ · vh)− (qh,∇ · uh) = 〈rh, vh〉
†S. Badia, and A. F. Martın. Balancing domain decomposition preconditioning for the discrete Stokes
problem with continuous pressures. In preparation, 2014.
13 / 21
Preconditioning the discrete Stokes problem
Approach 2: Monolithic BDDC
• Apply BDDC methodology to Stokes problem
• Theory not fully understood, only for disc. pressures [Li & Widlund, 2006]
Well-posedness:† (Weak) inf-sup condition not obvious for (Vh, Qh)
BDDC(c/ce/cef) are well-posed preconditioners
Proof: inspired by the nonconforming Crouzeix-Raviart FE
Both for inf-sup stable and stabilized formulations with continuous pressures
The BDDC problem
Find uh ∈ Vh and ph ∈ Qh st
ν(∇uh,∇vh)− (ph,∇ · vh)− (qh,∇ · uh) = 〈rh, vh〉
†S. Badia, and A. F. Martın. Balancing domain decomposition preconditioning for the discrete Stokes
problem with continuous pressures. In preparation, 2014.
13 / 21
Outline
1 Introduction
2 BDDC preconditioner and its highly scalable parallel implementation
3 Preconditioners for the Stokes problem
4 Numerical Experiments
5 Conclusions and future work
13 / 21
FEMPAR Software
FEMPAR (in-house developed HPC software, free software GNU-GPL):Finite Element Multiphysics PARallel software
• Massively parallel sw for FE simulation of multiphysics PDEs
• Scalable preconditioning of fully coupled and implicit system via blockpreconditioning techniques
• Scalable preconditioning for one-physics (elliptic) PDEs relies on BDDC→ hybrid MPI/OpenMP implementation
• Relies on highly-efficient vendor implementations of the dense/sparse BLAS(Intel MKL, IBM ESSL, etc.)
• Interfaces to external multi-threaded sparse direct solvers (PARDISO,HSL MA87, etc.) and serial AMG preconditioners (HSL MI20)
14 / 21
Weak scalability for 3D Stokes
Target machine: HELIOS@IFERC-CSC4,410 bullx B510 compute blades (2 Intel Xeon E5-2680 8-core CPUs)
• Target problem: Stokes on Ω = [0, 1]3 (lid-driven cavity problem)
• Uniform global mesh (Q1-Q1 FEs, ASGS-stabilized) + Uniform partition
• 8, 432, . . . , 16000 cores for fine duties
• Entire 16-core blade for coarse duties
• Three different preconditioners: (a) mono, (b) blk-ex, (c) blk-inex
• Direct (a & b) or AMG(1) (c) solves for Dirichlet/Neumann/Coarse probs.
• Gradually larger local problem sizes Hh
= 5, 10, . . . , 35
15 / 21
Weak scaling BDDC-based solvers
Stokes problem, Hh
= 20 (20× 40× 40 = 32K FEs/core)
0
10
20
30
40
50
60
70
128 2K 3.5K 5.5K 8K 11.7K 16K
#RG
MR
ES
iter
atio
ns
#cores
Weak scaling for BDDC(ce)-based solvers with H/h=20 (32K FEs/core)
monoblk-ex
blk-inex 0
5
10
15
20
25
30
35
40
45
128 2K 3.5K 5.5K 8K 11.7K 16K
#RG
MR
ES
iter
atio
ns
#cores
Weak scaling for BDDC(cef)-based solvers with H/h=20 (32K FEs/core)
monoblk-ex
blk-inex
BDDC(c+e)-based BDDC(c+e+f)-based
# of RGMRES iterations
• mono requires c+e+f constraints to reach asymptotically constant # of iters.
• Also observed by [Sıstek et. al., 2011] for Taylor-Hood FEs (inf-sup stable)
16 / 21
Weak scaling BDDC-based solvers
Stokes problem, Hh
= 20 (20× 40× 40 = 32K FEs/core)
0
20
40
60
80
100
120
140
160
180
128 2K 3.5K 5.5K 8K 11.7K 16K
Tot
al W
all c
lock
tim
e (s
ecs.
)
#cores
Weak scaling for BDDC(ce)-based solvers with H/h=20 (32K FEs/core)
monoblk-ex
blk-inex
0
20
40
60
80
100
120
140
128 2K 3.5K 5.5K 8K 11.7K 16K
Tot
al W
all c
lock
tim
e (s
ecs.
)
#cores
Weak scaling for BDDC(cef)-based solvers with H/h=20 (32K FEs/core)
monoblk-ex
blk-inex
BDDC(c+e)-based BDDC(c+e+f)-based
Total time (secs.)
• blk-ex and blk-inex much faster than mono
• mono-BDDC(cef) weak scalability degrades beyond 8K cores
• Reason: time spent in coarse duties exceeds that spent in fine duties
• Bad news for mono, Hh
= 20 is the largest problem size that fits into memory !17 / 21
Weak scaling BDDC(c+e)-based solvers
Stokes problem, Hh
= 25 (25× 50× 50 = 62K FEs/core)
0
10
20
30
40
50
60
70
80
128 2K 3.5K 5.5K 8K 11.7K 16K
#RG
MR
ES
iter
atio
ns
#cores
Weak scaling for BDDC(ce)-based solvers with H/h=25 (62K FEs/core)
blk-exblk-inex
0
10
20
30
40
50
60
70
128 2K 3.5K 5.5K 8K 11.7K 16K
Tot
al W
all c
lock
tim
e (s
ecs.
)
#cores
Weak scaling for BDDC(ce)-based solvers with H/h=25 (62K FEs/core)
blk-exblk-inex
# of RGMRES iters. Total time (secs.)
BDDC(c+e)-based solvers
• blk-ex and blk-inex weakly scalable up-to 16K cores
• blk-inex becomes faster than blk-ex for “sufficiently large” Hh
• Reason: O(n2) vs. O(n) complexity of AMG vs. direct solvers
18 / 21
Weak scaling BDDC(c+e)-based solvers
Stokes problem, Hh
= 30 (30× 60× 60 = 108K FEs/core)
0
10
20
30
40
50
60
70
80
128 2K 3.5K 5.5K 8K 11.7K 16K
#RG
MR
ES
iter
atio
ns
#cores
Weak scaling for BDDC(ce)-based solvers with H/h=30 (108K FEs/core)
blk-exblk-inex
0
20
40
60
80
100
120
140
160
128 2K 3.5K 5.5K 8K 11.7K 16K
Tot
al W
all c
lock
tim
e (s
ecs.
)
#cores
Weak scaling for BDDC(ce)-based solvers with H/h=30 (108K FEs/core)
blk-exblk-inex
# of RGMRES iters. Total time (secs.)
BDDC(c+e)-based solvers
• blk-ex and blk-inex weakly scalable up-to 16K cores
• blk-inex becomes faster than blk-ex for “sufficiently large” Hh
• Reason: O(n2) vs. O(n) complexity of AMG vs. direct solvers
19 / 21
Memory consumption
Memory consumption on fine-grid MPI tasksLocal problem size (FEs/core)
Precond. 0.5K 4K 13.5K 32K 62.5K 108K 171Kmono 155.0M 320.3M 973.2M 2.7G † † †blk-ex 97.3M 149.1M 336.9M 809.9M 1.7G 3.3G †
blk-inex ∗ ∗ ∗ 465.5M 823.5M 1.3G 2.1G∗: experiment not performed †: out of memory
Memory consumption on coarse-grid MPI task# subdomains
Constraints Precond. 128 1K 3.5K 8K 16Kc+e mono 45.3M 258.0M 998.2M 2.7G 6.1G
blk-ex 34.7M 130.1M 419.8M 1.0G 2.1Gblk-inex 30.9M 109.8M 326.3M 757.4M 1.4G
20 / 21
Outline
1 Introduction
2 BDDC preconditioner and its highly scalable parallel implementation
3 Preconditioners for the Stokes problem
4 Numerical Experiments
5 Conclusions and future work
20 / 21
Conclusions and future work
Conclusions:
• 3 different parallel preconditioners for the discrete Stokes problem evaluated→ mono, blk-ex and blk-inex
• Block preconditioning is faster and more scalable than monolithic approach
• For “large enough” Hh
, blk-inex achieves the best trade-off amongpreconditioner quality and preconditioner set-up/application time
Future work:
• Development/evaluation of inexact BDDC variants for monolithic problem
• Multilevel extension + MPI-based coarse-grid implementation
• Potential to scale easily up to O(106) cores for three levels (on the way)
21 / 21
References
S. Badia, A. F. Martın and J. Principe. Enhanced balancing Neumann-Neumann precondition-
ing in computational fluid and solid mechanics. International Journal for Numerical Methods inEngineering. Vol. 96(4), pp. 203-230, 2013.
S. Badia, A. F. Martın and J. Principe. Implementation and scalability analysis of balancing domain
decomposition methods. Archives of Computational Methods in Engineering. Vol. 20(3), pp. 239-262, 2013.
S. Badia, A. F. Martın and J. Principe. A highly scalable parallel implementation of balancing
domain decomposition by constraints. SIAM Journal on Scientific Computing. Vol. 36(2), pp.C190-C218, 2014.
S. Badia, A. F. Martın and R. Planas. Block recursive LU preconditioners for the thermally coupled
incompressible inductionless MHD problem. Journal of Computational Physics, Vol. 274, pp. 562–591, 2014.
S. Badia, A. F. Martın and J. Principe. On the scalability of inexact balancing domain decomposition
by constraints with overlapped coarse/fine corrections. Submitted, 2014.
S. Badia, and A. F. Martın. Balancing domain decomposition preconditioning for the discrete
Stokes problem with continuous pressures. In preparation, 2014.
Preprints available at http://badia.rmee.upc.edu/sbadia_ar.html
HPSC team: https://web.cimne.upc.edu/groups/comfus/
Work funded by the European Research Council under Starting Grant258443 - COMFUS: Computational Methods for Fusion Technology 2007 - 2012
European Research CouncilYears of excellent IDEAS
21 / 21