Scalable Solvers and Software for PDE Applications
description
Transcript of Scalable Solvers and Software for PDE Applications
Scalable Solvers and Software for PDE Applications
The Pennsylvania State University29 April 2003
David E. KeyesCenter for Computational Science
Old Dominion University&
Institute for Scientific Computing ResearchLawrence Livermore National Laboratory
PennState Seminar, 29 April 2003
Happy Poincaré’s Birthday! Born: 29 April 1854, Nancy Fundamental contributions to topology,
analysis, number theory, potential theory, quantum theory, fluid mechanics, the special theory of relativity, and the philosophy of science
Académie des Sciences, 1887 (President, 1906)
Fellow, Royal Society, 1894 Died: 17 July 1912, Paris “The last universalist in mathematics”-
anon. “It is by logic that we prove; it is by
intuition that we invent.” – Henri Poincaré
PennState Seminar, 29 April 2003
Plan of presentation Imperative of “optimal” algorithms for
terascale computing Basic domain decomposition and multilevel
algorithmic concepts Examples of applications Example of Bell-prize winning solver
performance on ASCI platforms Conclusions and outlook
PennState Seminar, 29 April 2003
Motivation: optimality Convergence rate nearly independent of discretization parameters
Multilevel schemes for rapid linear convergence of linear problems Newton-like schemes for quadratic convergence of nonlinear problems
Convergence rate as independent as possible of physical parameters
Continuation schemes Physics-based preconditioning
unscalable
scalable
Problem Size (increasing with number of processors)
Tim
e to
Sol
utio
n
200
150
50
0
100
10 100 10001
Steel/rubber compositeParallel multigrid c/o M. Adams, Berkeley-Sandia
The solver is a key part, but not the only part, of the simulation that needs to be scalable
PennState Seminar, 29 April 2003
Why optimal algorithms? The more powerful the computer, the greater the
importance of optimality Example:
Suppose Alg1 solves a problem in time CN2, where N is the input size
Suppose Alg2 solves the same problem in time CN Suppose that the machine on which Alg1 and Alg2
have been parallelized to run has 10,000 processors In constant time (compared to serial), Alg1 can run a
problem 100X larger, whereas Alg2 can run a problem 10,000X larger
PennState Seminar, 29 April 2003
Why optimal?, cont. Alternatively, filling the machine’s memory, Alg1 requires
100X time, whereas Alg2 runs in constant time Is 10,000 processors a reasonable expectation?
Yes, we have it today (ASCI White (IBM), Red Storm (Cray))! Could computational scientists really use 10,000X scaling?
Of course; we are approximating the continuum A grid for weather prediction allows points every 1km versus every
100km on the earth’s surface In 2D 10,000X disappears fast; in 3D even faster
However, these machines are expensive (Earth Simulator is $0.5B, plus ongoing operating costs), and optimal algorithms are the only algorithms that we can afford to run on them
PennState Seminar, 29 April 2003
Decomposition strategies for Lu=f in Operator decomposition
Function space decomposition
Domain decomposition
k
kLL
k
kkk
kk uuff ,
kk
fuuyxkk II )()1(][ LL
Consider, e.g., the implicitly discretized parabolic case
PennState Seminar, 29 April 2003
Operator decomposition Consider ADI
fuyuxkk II )()2/1( ][][ 2/2/ LL
fuxuykk II )2/1()1( ][][ 2/2/ LL
Iteration matrix consists of four sequential (“multiplicative”) substeps per timestep two sparse matrix-vector multiplies two sets of unidirectional bandsolves
Parallelism within each substep But global data exchanges between bandsolve substeps
PennState Seminar, 29 April 2003
Function space decomposition Consider a spectral Galerkin method
),()(),,(1
yxtatyxu j
N
jj
Nifuu iiidtd ,...,1),,(),(),( L
Nifa ijjijdtda
jijj ,...,1),,(),(),( L
fMKaMdtda 11
System of ordinary differential equations Perhaps are diagonal
matrices Perfect parallelism across spectral index But global data exchanges to transform back to
physical variables at each step
)],[()],,[( ijij KM L
PennState Seminar, 29 April 2003
Domain decomposition Consider restriction and extension
operators for subdomains, , and for possible coarse grid,
Replace discretized with
Solve by a Krylov method, e.g., CG Matrix-vector multiplies with
parallelism on each subdomain nearest-neighbor exchanges, global reductions possible small global system (not needed for parabolic case)
iiR
0R
TRR 00 ,
Tii RR ,
fAu fBAuB 11
iiTii
T RARRARB 10
100
1
Tiii ARRA
=
PennState Seminar, 29 April 2003
Comparison Operator decomposition (ADI)
natural row-based assignment requires all-to-all, bulk data exchanges in each step (for transpose)
Function space decomposition (Fourier) natural mode-based assignment requires all-to-all,
bulk data exchanges in each step (for transform) Domain decomposition (Schwarz)
natural domain-based assignment requires local, nearest neighbor data exchanges, global reductions, and optional small global problem
PennState Seminar, 29 April 2003
Theoretical scaling of domain decomposition (for three common network topologies)
With logarithmic-time (hypercube- or tree-based) global reductions and scalable nearest neighbor interconnects: optimal number of processors scales linearly with problem size
(“scalable”, assumes one subdomain per processor) With power-law-time (3D torus-based) global reductions and
scalable nearest neighbor interconnects: optimal number of processors scales as three-fourths power of
problem size (“almost scalable”) With linear-time (common bus) network:
optimal number of processors scales as one-fourth power of problem size (*not* scalable)
bad news for conventional Beowulf clusters, but see 2000 & 2001 Bell Prize “price-performance awards” using multiple commodity NICs per Beowulf node!
PennState Seminar, 29 April 2003
Three Basic Concepts Iterative correction Schwarz preconditioning Schur preconditioning
Some “Advanced” Concepts Polynomial combinations of Schwarz projections Schwarz-Schur combinations
Schwarz on Schur-reduced system Schwarz inside Schur-reduced system
Nonlinear Schwarz
recentoptimization
PennState Seminar, 29 April 2003
Iterative correction The most basic idea in iterative methods
Evaluate residual accurately, but solve approximately, where is an approximate inverse to A
A sequence of complementary solves can be used, e.g., with first and then one has
)(1 AufBuu
)]([ 11
12
12
11 AufABBBBuu
2B1B
1B
RRARRB TT 112 )(
)( 1AB Optimal polynomials of lead to various preconditioned Krylov methods
Scale recurrence, e.g., with , leads to multilevel methods
PennState Seminar, 29 April 2003
smoother
Finest Grid
First Coarse Grid
coarser grid has fewer cells (less work & storage)
Restrictiontransfer from fine to coarse grid
Recursively apply this idea until we have an easy problem to solve
A Multigrid V-cycle
Prolongationtransfer from coarse to fine grid
Multilevel preconditioning
PennState Seminar, 29 April 2003
Example of Hypre’s scaled efficiency
PFMG-CG on Red (40x40x40)
0
0.2
0.4
0.6
0.8
1
0 1000 2000 3000 4000
procs / problem size
scal
ed e
ffici
ency
Setup
Solve
64K DOFs
200M DOFs
PennState Seminar, 29 April 2003
Schwarz preconditioning Given A x = b , partition x into
subvectors, corresp. to subdomains of the domain of the PDE, nonempty, possibly overlapping, whose union is all of the elements of nx
iR
thi
thi
xRx ii Tiii ARRA
iiTii RARB 11
i
x
Let Boolean rectangular matrix extract the subset of :
Let The Boolean matrices are gather/scatter operators, mapping between a global vector and its subdomain support
PennState Seminar, 29 April 2003
SPMD parallelism w/domain decomposition
Partitioning of the grid induces block structure on the Jacobian
1
2
3
A23A21 A22
rows assigned to proc “2”
PennState Seminar, 29 April 2003
Iteration count estimates from the Schwarz theory
In terms of N and P, where for d-dimensional isotropic problems, N=h-d and P=H-d, for mesh parameter h and subdomain diameter H, iteration counts may be estimated as follows:
Ο(P1/3)Ο(P1/2)1-level Additive Schwarz
Ο(1)Ο(1)2-level Additive Schwarz
Ο((NP)1/6)Ο((NP)1/4)Domain Jacobi (=0)
Ο(N1/3)Ο(N1/2)Point Jacobi
in 3Din 2DPreconditioning Type
Krylov-Schwarz iterative methods typically converge in a number of iterations that scales as the square-root of the condition number of the Schwarz-preconditioned system
PennState Seminar, 29 April 2003
Comments on the Schwarz theory Basic Schwarz estimates are for:
self-adjoint operators with smooth coefficients positive definite operators exact subdomain solves, two-way overlapping with generous overlap, =O(H) (otherwise 2-level result is O(1+H/))
Extensible to: nonself-adjointness (e.g, convection) and jumping coefficients indefiniteness (e.g., wave Helmholtz) inexact subdomain solves one-way overlap communication (“restricted additive Schwarz”) small overlap
Tii RR ,
1iA
PennState Seminar, 29 April 2003
Schur preconditioning Given a partition
Condense:
Let M be a good preconditioner for S Then is a preconditioner for A
Moreover, solves with may be done approximately if all degrees of freedom are retained
ff
uu
AAAA ii
i
iii
MAAI
IAA iii
i
ii
00 1
gSu
iiii AAAAS 1iiii fAAfg 1
iiA
PennState Seminar, 29 April 2003
Schwarz polynomials Polynomials of Schwarz projections that are
combinations of additive and multiplicative may be appropriate for certain implementations
We may solve the fine subdomains concurrently and follow with a coarse grid (redundantly/cooperatively)
)(1 AufBuu ii
)(10 AufBuu
))(( 110
10
1 ii BABIBB
This leads to algorithm “Hybrid II” in S-B-G’96:
Convenient for “SPMD” (single prog/multiple data)
PennState Seminar, 29 April 2003
Schwarz-on-Schur Preconditioning the Schur complement is complex in and of
itself; Schwarz can be used on the reduced problem “Neumann-Neumann” alg
“Balancing Neumann-Neumann” alg))()(( 1
011
01
01 SMIDRSRDSMIMM iii
Tiii
iiiTiii DRSRDM 11
Multigrid on the Schur complement
ii S,
41iD
PennState Seminar, 29 April 2003
Schwarz-inside-Schur Consider Newton’s method for solving the nonlinear rootfinding problem derived from the necessary conditions for constrained optimization Constraint Objective Lagrangian Form the gradient of the Lagrangian with respect to each of x, u, and :
NMN fuxuxf ;;;0),( ;),(min uxu
NT uxfux ;),(),(
0),(),( uxfux xxT
0),( uxf
0),(),( uxfux uuT
PennState Seminar, 29 April 2003
Schwarz-inside-Schur Equality constrained optimization leads to the KKT
system for states x , designs u , and multipliers
fgg
ux
JJJWWJWW
u
x
ux
Tuuuux
Tx
Tuxxx
0
Then
Newton Reduced SQP solves the Schur complement system H u = g , where H is the reduced Hessian
fJWWJJgJJgg xuxxxT
xTux
Tx
Tuu
1)( uxuxxx
Tx
Tu
Tux
Tx
Tuuu JJWWJJWJJWH 1)(
uJfxJ ux uWxWgJ T
uxxxxTx
PennState Seminar, 29 April 2003
Schwarz-inside-Schur, cont. Problems
is the Jacobian of a PDE huge! involve Hessians of objective and constraints
second derivatives and huge H is unreasonable to form, store, or invert
xJW
Solutions Use Schur preconditioning on full system Form forward action of Hessians by automatic
differentiation (vector-to-vector map) Form approximate inverse action of state Jacobian and its
transpose by Schwarz
PennState Seminar, 29 April 2003
Example of PDE-constrained Optimization
c/o G. Biros and O. Ghattas
Lagrange-Newton-Krylov-Schur implemented in Veltisto/PETSc
wing tip vortices, no control (l); optimal control (r)wing tip vortices, no control (l); optimal control (r)
optimal boundary controls shown as velocity vectorsoptimal boundary controls shown as velocity vectors
Optimal control of laminar viscous flow optimization variables are surface suction/injection objective is minimum drag 700,000 states; 4,000 controls 128 Cray T3E processors ~5 hrs for optimal solution (~1 hr for analysis)
www.cs.nyu.edu/~biros/veltisto/
PennState Seminar, 29 April 2003
Nonlinear Schwarz preconditioning Nonlinear Schwarz has Newton both inside and
outside and is fundamentally Jacobian-free It replaces with a new nonlinear system
possessing the same root, Define a correction to the partition (e.g.,
subdomain) of the solution vector by solving the following local nonlinear system:
where is nonzero only in the components of the partition
Then sum the corrections:
0)( uF0)( uthi
thi
)(ui
0))(( uuFR ii n
i u )(
)()( uu ii
PennState Seminar, 29 April 2003
Nonlinear Schwarz, cont. It is simple to prove that if the Jacobian of F(u) is
nonsingular in a neighborhood of the desired root then and have the same unique root
To lead to a Jacobian-free Newton-Krylov algorithm we need to be able to evaluate for any : The residual The Jacobian-vector product
Remarkably, (Cai-Keyes, 2000) it can be shown that
where and All required actions are available in terms of !
0)( u
nvu ,)()( uu ii
0)( uF
vu ')(
JvRJRvu iiTii )()( 1'
)(' uFJ Tiii JRRJ
)(uF
PennState Seminar, 29 April 2003
Example of nonlinear Schwarz
Newton’s methodAdditive Schwarz Preconditioned Inexact Newton
(ASPIN)
Difficulty at critical Re
Stagnation beyond
critical Re
Convergence for all Re
PennState Seminar, 29 April 2003
“Unreasonable effectiveness” of Schwarz When does the sum of partial inverses equal the
inverse of the sums? When the decomposition is right!
Good decompositions are a compromise between conditioning and parallel complexity, in practice
iriii raAr T
iii Arra Let be a complete set of orthonormal row eigenvectors for A : or
iiT
ii rarA Then
iT
iiT
iiiiT
ii rArrrrarA 111 )( and
— the Schwarz formula!
PennState Seminar, 29 April 2003
Newton-Krylov-Schwarz – a parallel PDE “workhorse”
Newtonnonlinear solver
asymptotically quadratic
Krylovaccelerator
spectrally adaptive
Schwarzpreconditionerparallelizable
Popularized in parallel Jacobian-free form under this name by Cai, Gropp, Keyes & Tidriri (1994), in PETSc since Balay’s MS project at ODU (1995)
PennState Seminar, 29 April 2003
Jacobian-free Newton-Krylov method In the Jacobian-free Newton-Krylov (JFNK)
method, a Krylov method solves the linear Newton correction equation, requiring Jacobian-vector products
These are approximated by the Fréchet derivatives
so that the actual Jacobian elements are never
explicitly needed, where is chosen with a fine balance between approximation and floating point rounding error
Schwarz preconditions, using approximate elements
)]()([1)( uFvuFvuJ
PennState Seminar, 29 April 2003
Philosophy of Jacobian-free NK To evaluate the linear residual, we use the true F’(u) , giving a true
Newton step and asymptotic quadratic Newton convergence To precondition the linear residual, we do anything convenient that
uses understanding of the dominant physics/mathematics in the system and respects the limitations of the parallel computer architecture and the cost of various operations:
combinations of operator-split Jacobians (for reasons of physics or reasons of numerics)
Jacobian of related discretization (for “fast” solves) Jacobian of lower-order discretization (for more stability, less storage) Jacobian with “lagged” values for expensive terms (for less computation per
degree of freedom) Jacobian stored in lower precision (for less memory traffic per
preconditioning step) Jacobian blocks decomposed for parallelism
PennState Seminar, 29 April 2003
Philosophy of Jacobian-free NK, cont. These motivations are not new; most large-scale application
codes also take “short cuts” on the approximate Jacobian operator to be inverted – showing physical intuition
The problem with many codes is that they do not anywhere have an accurate global Jacobian operator; they use only the weak Jacobian
This leads to a weakly nonlinearly converging “defect correction method”
Defect correction:
in contrast to preconditioned Newton:
)()( 11 kkk uFBuuJB
)( kk uFuB
PennState Seminar, 29 April 2003
Physics-based preconditioning Consider an algorithm that leaves
first-order splitting error as solver In the Jacobian-free Newton-Krylov
framework, this solver, which maps a residual into a correction, can be regarded as a preconditioner
The true Jacobian is never formed yet the time-implicit nonlinear residual at each time step can be made as small as needed for nonlinear consistency in long time integrations
PennState Seminar, 29 April 2003
Physics-based preconditioning In Newton iteration, one seeks to obtain a correction (“delta”)
to solution, by inverting the Jacobian matrix on (the negative of) the nonlinear residual:
A typical operator-split code also derives a “delta” to the solution, by some implicitly defined means, through a series of implicit and explicit substeps
This implicitly defined mapping from residual to “delta” is a natural preconditioner
Software must accommodate this!
)()]([ 1 kkk uFuJu
kk uuF )(
PennState Seminar, 29 April 2003
Ex.: 1D shallow water preconditioning Define continuity residual for each timestep:
Define momentum residual for each timestep:
_)]([ R
xu
uR
xgu n _][)(
Continuity delta-form (*):
Momentum delta form (**):
xuR
nnn
11 )(_
xg
xuuuuR
nn
nnn
121 )()()(_
PennState Seminar, 29 April 2003
1D Shallow water preconditioning, cont. Solving (**) for and substituting into (*),
After this parabolic equation is solved for , we have
This completes the application of the preconditioner to one Newton-Krylov iteration at one timestep Of course, the parabolic solve need not be done exactly; one sweep of multigrid can be used See paper by Mousseau et al. (2002) in Ref [1] for impressive results for longtime weather integration
)( u
)_(_)][( 22 uRx
Rxx
g n
uRx
gu n _][)(
PennState Seminar, 29 April 2003
Operator-split preconditioning Subcomponents of a PDE operator often have special
structure that can be exploited if they are treated separately
Algebraically, this is just a generalization of Schwarz, by term instead of by subdomain
Suppose and a preconditioner is to be constructed, where and are each “easy” to invert
Form a preconditioned vector from as follows:
Equivalent to replacing with First-order splitting error, yet often used as a solver!
RSIJ 1SI RI
u
J SRRSI 1
uSIRI 111 )()(
PennState Seminar, 29 April 2003
Operator-split preconditioning, cont. Suppose S is convection-diffusion and R is reaction,
among a collection of fields stored as gridfunctions On a small regular 2D grid with a five-point stencil:
R is trivially invertible in block diagonal form S is invertible with one multilevel solve per field
J = S + R
PennState Seminar, 29 April 2003
Preconditioners assembled from just the “strong” elements of the Jacobian, alternating the source term and the diffusion term operators, are competitive in convergence rates with block-ILU on the Jacobian
particularly, since the decoupled scalar diffusion systems are amenable to simple multigrid treatment – not as trivial for the coupled system
The decoupled preconditioners store many fewer elements and significantly reduce memory bandwidth requirements and are expected to be much faster per iteration when carefully implemented
See “alternative block factorization” by Bank et al. in Ref [1]; incorporated into SciDAC TSI solver by D’Azevedo
Operator-split preconditioning, cont.
PennState Seminar, 29 April 2003
Using Jacobian of related discretization To precondition a variable coefficient operator, such
as ·( ) , use , based on a constant coefficient average
Brown & Saad (1980) showed that, because of the availability of fast solvers, it may even be acceptable to use to precondition something like
2
yv
xu
)()()(2
2
PennState Seminar, 29 April 2003
Using Jacobian of lower order discretization Orszag popularized the use of linear finite element
discretizations as preconditioners for high-order spectral element discretizations in the 1970s; both approach the same continuous operator
It is common in CFD to employ first-order upwinded convective operators as approximate inversions for higher-order operators: better factorization stability smaller matrix bandwidth and complexity
With Jacobian-free NK, we can have the best of both worlds – a stable factorization/cheap solve and a true Jacobian step
PennState Seminar, 29 April 2003
Using Jacobian with lagged terms Newton-chord methods (e.g., papers by Smooke et al.) “freeze”
the Jacobian matrices: saves Jacobian evaluation and factorization, which can be up to 90%
of the running time of the code in some apps however, nonlinear convergence degrades to linear rate
In Jacobian-free NK, we can “freeze” some or all of the terms in the Jacobian preconditioner, while always accessing the action of the true Jacobian for the Krylov matrix-vector multiply:
still saves Jacobian work maintains asymptotically quadratic rate for nonlinear convergence
See Knoll-Keyes (2002) for example with coupled edge plasma and Navier-Stokes, showing five-fold improvement over full Newton with constantly refreshed Jacobian on LHS, versus JFNK with preconditioner refreshed once each ten timesteps
PennState Seminar, 29 April 2003
Using Jacobian with lower precision elements Memory bandwidth is the critical architectural
parameter for sparse linear algebra computations Storing the preconditioner elements in single precision
effectively doubles memory bandwidth (and potentially halves runtime) for this critical phase
We still form the Jacobian-vector product with full precision and “zero-pad” the preconditioner elements back to full length in the arithmetic unit, so the numerical quality of the Krylov subspace does not degrade
PennState Seminar, 29 April 2003
Memory BW bottleneck revealed via precision reduction
106s122s16s31s120
181s205s34s60s64
331s373s67s117s32
657s746s136s223s16
SingleDoubleSingleDouble
OverallLinear Solve
Computational PhaseNumber of Processors
Execution times for unstructured NKS Euler Simulation on Origin 2000: double precision matrices versus single precision preconditioner
Note that times are nearly halved, along with precision, for the BW-limited linear solve phase, indicating that the BW can be at least doubled before hitting the next
bottleneck!
PennState Seminar, 29 April 2003
PETSc and Hypre combined in “Terascale Optimal PDE Simulations” (TOPS) ISIC
Nine institutions, five years, 24 co-PIs
PennState Seminar, 29 April 2003
Scope for TOPS Design and implementation of “solvers”
Time integrators
Nonlinear solvers
Optimizers
Linear solvers
Eigensolvers
Software integration Performance optimization
0),,,( ptxxf
0),( pxF
bAx
BxAx
0,0),(..),(min uuxFtsuxu
Optimizer
Linear solver
Eigensolver
Time integrator
Nonlinear solver
Indicates dependence
Sens. Analyzer(w/ sens. anal.)
(w/ sens. anal.)
PennState Seminar, 29 April 2003
Ex.: Hall magnetic reconnectionMagnetic Reconnection: Applications to Sawtooth Oscillations, Error Field Induced Islands and the Dynamo EffectThe research goals of this project include producing a unique high performance code and using this code to study magnetic reconnection in astrophysical plasmas, in smaller scale laboratory experiments, and in fusion devices. The modular code that will be developed will be a fully three-dimensional, compressible Hall MHD code with options to run in slab, cylindrical and toroidal geometry and flexible enough to allow change in algorithms as needed. The code will use adaptive grid refinement, will run on massively parallel computers, and will be portable and scalable. The research goals include studies that will provide increased understanding of sawtooth oscillations in tokamaks, magnetotail substorms, error-fields in tokamaks, reverse field pinch dynamos, astrophysical dynamos, and laboratory reconnection experiments.PI: Amitava BhattacharjeeUniversity of Iowa
PennState Seminar, 29 April 2003
Summary of progress on CMRS CMRS team has provided TOPS with discretization of model 2D
multicomponent MHD evolution code in PETSc’s FormFunctionLocal format using DMMG and automatic differentiation for Jacobian objects
TOPS has implemented fully nonlinearly implicit GMRES-MG-ILU parallel solver with custom deflation of nullspace in CMRS’s doubly periodic formulation
CMRS and TOPS reproduce the same dynamics on the same grids with the same time-stepping, up to a finite-time singularity due to collapse of current sheet (that falls below presently uniform mesh resolution)
TOPS code, being implicit, can choose timesteps an order of magnitude larger, with potential for higher ratio in more physically realistic parameter regimes, but is still slower in wall-clock time
PLAN: tune PETSc solver by profiling, blocking, reuse, etc. PLAN: go to higher-order in time PLAN: identify the numerical complexity benefits from implicitness (in
suppressing fast timescales) and quantify (explicit versus implicit) PLAN (with APDEC team): incorporate AMR
PennState Seminar, 29 April 2003
Equilibrium:
Model equations: (Porcelli et al., 1993, 1999)2D Hall MHD sawtooth instability
figures c/o A. Bhattacharjee, CMRS
Vorticity, early time
Vorticity, later time
zoom
ex29.c in
PETSc 2.5.1
PennState Seminar, 29 April 2003
Time-implicit Newton-Krylov-SchwarzFor nonlinear robustness, NKS iteration is wrapped in time-
stepping:for (l = 0; l < n_time; l++) {
select time stepfor (k = 0; k < n_Newton; k++) { compute nonlinear residual and Jacobian
for (j = 0; j < n_Krylov; j++) { forall (i = 0; i < n_Precon ; i++) {
solve subdomain problems concurrently } // End of loop over subdomains perform Jacobian-vector product enforce Krylov basis conditions update optimal coefficients check linear convergence } // End of linear solver perform DAXPY update check nonlinear convergence } // End of nonlinear loop} // End of time-step loop
NKS loop
Pseudo-time loop
PennState Seminar, 29 April 2003
PETSc’s DMMG in Hall MHD application Mesh and time refinement studies of CMRS Hall magnetic reconnection model
problem (4 mesh sizes, dt=0.1 (nondimensional, near CFL limit for fastest wave) on left, dt=0.8 on right)
Measure of functional inverse to thickness of current sheet versus time, for 0<t<200 (nondimensional), where singularity occurs around t=215
PennState Seminar, 29 April 2003
PETSc’s DMMG in Hall MR application, cont. Implicit timestep increase studies of CMRS Hall magnetic reconnection model
problem, on finest (192192) mesh of previous slide, in absolute magnitude, rather than semi-log
PennState Seminar, 29 April 2003
Ex.: Computational aerodynamics
mesh c/o D. Mavriplis, ICASE
Implemented in PETSc
www.mcs.anl.gov/petsc
Transonic “Lambda” Shock, Mach contours on surfaces
PennState Seminar, 29 April 2003
Fixed-size Parallel Scaling Results
Four orders of magnitude in 13 years
c/o K. Anderson, W. Gropp, D. Kaushik, D. Keyes and B. Smith
128 nodes 128 nodes 43min43min
3072 nodes 3072 nodes 2.5min, 2.5min, 226Gf/s226Gf/s
11M unknowns 11M unknowns 1515µs/unknown µs/unknown 70% efficient70% efficient
This scaling study, featuring our widest range of processor number, was done for the incompressible case.
PennState Seminar, 29 April 2003
Scaling to new architectures w/Bulk Synchronous Processing (BSP) model
local scatter
Jac-vec multiply
precond sweep
daxpys
inner products
Krylov iteration
…
What happens if, for instance, in this (schematicized) iteration, arithmetic speed is doubled, scalar all-gather is quartered, and local scatter is cut by one-third? Each phase is considered separately. Answer is to the right.
P1:
P2:
Pn:
…P1:
P2:
Pn:
PennState Seminar, 29 April 2003
0
50
100
150
200
250
300
350
400
1024 Red 1024 BG/L 2048 BG/L 4096 BG/L 8192 BG/L
Total time Comm. Time
*performed by A. Sugavanam; data pair is speedup (w.r. to 1024 BG/L nodes) and communication %
Primitive BSP-like BG/L extrapolation*
1.0 (11%) 2.0 (11%) 3.7 (20%) 7.9 (25%)
PennState Seminar, 29 April 2003
Conclusions Domain decomposition and multilevel iteration the
dominant paradigm in contemporary terascale PDE simulation
Several freely available software toolkits exist, and successfully scale to thousands of tightly coupled processors for problems on quasi-static meshes
Concerted efforts underway to make elements of these toolkits interoperate, and to allow expression of the best methods, which tend to be modular, hierarchical, recursive, and above all — adaptive!
Tunability of NKS algorithmics allows solver adaption to application/architecture combinations
Next generation software should incorporate “best practices” in applications as preconditioners
PennState Seminar, 29 April 2003
Acknowledgments Collaborators or Contributors:
Xiao-Chuan Cai (Univ. Colorado, Boulder) Omar Ghattas (Carnegie-Mellon) Dinesh Kaushik (ODU) Dana Knoll (LANL) Dimitri Mavriplis (ICASE) PETSc team at Argonne National Laboratory: Satish Balay, Bill Gropp, Lois McInnes, Barry Smith
Sponsors: DOE, NASA, NSF Computer Resources: LLNL, LANL, SNL, NERSC
PennState Seminar, 29 April 2003
Related URLs Personal homepage: papers, talks, etc.
http://www.math.odu.edu/~keyes SciDAC initiative
http://www.science.doe.gov/scidac TOPS software project
http://www.tops-scidac.org PETSc software project
http://www.mcs.anl.gov/petsc Hypre software project
http://www.llnl.gov/CASC/hypre
Slides from 14-hour Peking University CS&E short course with Bill Gropp (in August 2002) on-line
PennState Seminar, 29 April 2003
Bibliography Jacobian-Free Newton-Krylov Methods: Approaches and Applications, Knoll & Keyes, 2002,
submitted to J. Comp. Phys.
Nonlinearly Preconditioned Inexact Newton Algorithms, Cai & Keyes, 2002, SIAM J. Sci. Comp. 24:183-200
High Performance Parallel Implicit CFD, Gropp, Kaushik, Keyes & Smith, 2001, Parallel Computing 27:337-362
Four Horizons for Enhancing the Performance of Parallel Simulations based on Partial Differential Equations, Keyes, 2000, Lect. Notes Comp. Sci., Springer, 1900:1-17
Globalized Newton-Krylov-Schwarz Algorithms and Software for Parallel CFD, Gropp, Keyes, McInnes & Tidriri, 2000, Int. J. High Performance Computing Applications 14:102-136
Achieving High Sustained Performance in an Unstructured Mesh CFD Application, Anderson, Gropp, Kaushik, Keyes & Smith, 1999, Proceedings of SC'99
Prospects for CFD on Petaflops Systems, Keyes, Kaushik & Smith, 1999, in “Parallel Solution of Partial Differential Equations,” Springer, pp. 247-278
How Scalable is Domain Decomposition in Practice?, Keyes, 1998, in “Proceedings of the 11th Intl. Conf. on Domain Decomposition Methods,” Domain Decomposition Press, pp. 286-297
PennState Seminar, 29 April 2003
EOF