Scalable Solvers and Software for PDE Applications

Scalable Solvers and Software for PDE Applications

The Pennsylvania State University29 April 2003

David E. KeyesCenter for Computational Science

Old Dominion University&

Institute for Scientific Computing ResearchLawrence Livermore National Laboratory

PennState Seminar, 29 April 2003

Happy Poincaré’s Birthday! Born: 29 April 1854, Nancy Fundamental contributions to topology,

analysis, number theory, potential theory, quantum theory, fluid mechanics, the special theory of relativity, and the philosophy of science

Académie des Sciences, 1887 (President, 1906)

Fellow, Royal Society, 1894 Died: 17 July 1912, Paris “The last universalist in mathematics”-

anon. “It is by logic that we prove; it is by

intuition that we invent.” – Henri Poincaré


Plan of presentation Imperative of “optimal” algorithms for

terascale computing Basic domain decomposition and multilevel

algorithmic concepts Examples of applications Example of Bell-prize winning solver

performance on ASCI platforms Conclusions and outlook


Motivation: optimality Convergence rate nearly independent of discretization parameters

Multilevel schemes for rapid linear convergence of linear problems Newton-like schemes for quadratic convergence of nonlinear problems

Convergence rate as independent as possible of physical parameters

Continuation schemes Physics-based preconditioning

unscalable

scalable

Problem Size (increasing with number of processors)

Tim

e to

Sol

utio

n

200

150

50

0

100

10 100 10001

Steel/rubber compositeParallel multigrid c/o M. Adams, Berkeley-Sandia

The solver is a key part, but not the only part, of the simulation that needs to be scalable


Why optimal algorithms? The more powerful the computer, the greater the

importance of optimality Example:

Suppose Alg1 solves a problem in time CN2, where N is the input size

Suppose Alg2 solves the same problem in time CN Suppose that the machine on which Alg1 and Alg2

have been parallelized to run has 10,000 processors In constant time (compared to serial), Alg1 can run a

problem 100X larger, whereas Alg2 can run a problem 10,000X larger


Why optimal?, cont. Alternatively, filling the machine’s memory, Alg1 requires

100X time, whereas Alg2 runs in constant time Is 10,000 processors a reasonable expectation?

Yes, we have it today (ASCI White (IBM), Red Storm (Cray))! Could computational scientists really use 10,000X scaling?

Of course; we are approximating the continuum A grid for weather prediction allows points every 1km versus every

100km on the earth’s surface In 2D 10,000X disappears fast; in 3D even faster

However, these machines are expensive (Earth Simulator is $0.5B, plus ongoing operating costs), and optimal algorithms are the only algorithms that we can afford to run on them


Decomposition strategies for Lu=f in Operator decomposition

Function space decomposition

Domain decomposition

k

kLL

k

kkk

kk uuff ,

kk

fuuyxkk II )()1(][ LL

Consider, e.g., the implicitly discretized parabolic case


Operator decomposition Consider ADI

fuyuxkk II )()2/1( ][][ 2/2/ LL

fuxuykk II )2/1()1( ][][ 2/2/ LL

Iteration matrix consists of four sequential (“multiplicative”) substeps per timestep two sparse matrix-vector multiplies two sets of unidirectional bandsolves

Parallelism within each substep But global data exchanges between bandsolve substeps


Function space decomposition Consider a spectral Galerkin method

),()(),,(1

yxtatyxu j

N

jj

Nifuu iiidtd ,...,1),,(),(),( L

Nifa ijjijdtda

jijj ,...,1),,(),(),( L

fMKaMdtda 11

System of ordinary differential equations Perhaps are diagonal

matrices Perfect parallelism across spectral index But global data exchanges to transform back to

physical variables at each step

)],[()],,[( ijij KM L


Domain decomposition Consider restriction and extension

operators for subdomains, , and for possible coarse grid,

Replace discretized with

Solve by a Krylov method, e.g., CG Matrix-vector multiplies with

parallelism on each subdomain nearest-neighbor exchanges, global reductions possible small global system (not needed for parabolic case)

iiR

0R

TRR 00 ,

Tii RR ,

fAu fBAuB 11

iiTii

T RARRARB 10

100

1

Tiii ARRA

=


Comparison Operator decomposition (ADI)

natural row-based assignment requires all-to-all, bulk data exchanges in each step (for transpose)

Function space decomposition (Fourier) natural mode-based assignment requires all-to-all,

bulk data exchanges in each step (for transform) Domain decomposition (Schwarz)

natural domain-based assignment requires local, nearest neighbor data exchanges, global reductions, and optional small global problem


Theoretical scaling of domain decomposition (for three common network topologies)

With logarithmic-time (hypercube- or tree-based) global reductions and scalable nearest neighbor interconnects: optimal number of processors scales linearly with problem size

(“scalable”, assumes one subdomain per processor) With power-law-time (3D torus-based) global reductions and

scalable nearest neighbor interconnects: optimal number of processors scales as three-fourths power of

problem size (“almost scalable”) With linear-time (common bus) network:

optimal number of processors scales as one-fourth power of problem size (*not* scalable)

bad news for conventional Beowulf clusters, but see 2000 & 2001 Bell Prize “price-performance awards” using multiple commodity NICs per Beowulf node!


Three Basic Concepts Iterative correction Schwarz preconditioning Schur preconditioning

Some “Advanced” Concepts Polynomial combinations of Schwarz projections Schwarz-Schur combinations

Schwarz on Schur-reduced system Schwarz inside Schur-reduced system

Nonlinear Schwarz

recentoptimization


Iterative correction The most basic idea in iterative methods

Evaluate residual accurately, but solve approximately, where is an approximate inverse to A

A sequence of complementary solves can be used, e.g., with first and then one has

)(1 AufBuu

)]([ 11

12

12

11 AufABBBBuu

2B1B

1B

RRARRB TT 112 )(

)( 1AB Optimal polynomials of lead to various preconditioned Krylov methods

Scale recurrence, e.g., with , leads to multilevel methods


smoother

Finest Grid

First Coarse Grid

coarser grid has fewer cells (less work & storage)

Restrictiontransfer from fine to coarse grid

Recursively apply this idea until we have an easy problem to solve

A Multigrid V-cycle

Prolongationtransfer from coarse to fine grid

Multilevel preconditioning


Example of Hypre’s scaled efficiency

PFMG-CG on Red (40x40x40)

0

0.2

0.4

0.6

0.8

1

0 1000 2000 3000 4000

procs / problem size

scal

ed e

ffici

ency

Setup

Solve

64K DOFs

200M DOFs


Schwarz preconditioning Given A x = b , partition x into

subvectors, corresp. to subdomains of the domain of the PDE, nonempty, possibly overlapping, whose union is all of the elements of nx

iR

thi

thi

xRx ii Tiii ARRA

iiTii RARB 11

i

x

Let Boolean rectangular matrix extract the subset of :

Let The Boolean matrices are gather/scatter operators, mapping between a global vector and its subdomain support


SPMD parallelism w/domain decomposition

Partitioning of the grid induces block structure on the Jacobian

1

2

3

A23A21 A22

rows assigned to proc “2”


Iteration count estimates from the Schwarz theory

In terms of N and P, where for d-dimensional isotropic problems, N=h-d and P=H-d, for mesh parameter h and subdomain diameter H, iteration counts may be estimated as follows:

Ο(P1/3)Ο(P1/2)1-level Additive Schwarz

Ο(1)Ο(1)2-level Additive Schwarz

Ο((NP)1/6)Ο((NP)1/4)Domain Jacobi (=0)

Ο(N1/3)Ο(N1/2)Point Jacobi

in 3Din 2DPreconditioning Type

Krylov-Schwarz iterative methods typically converge in a number of iterations that scales as the square-root of the condition number of the Schwarz-preconditioned system


Comments on the Schwarz theory Basic Schwarz estimates are for:

self-adjoint operators with smooth coefficients positive definite operators exact subdomain solves, two-way overlapping with generous overlap, =O(H) (otherwise 2-level result is O(1+H/))

Extensible to: nonself-adjointness (e.g, convection) and jumping coefficients indefiniteness (e.g., wave Helmholtz) inexact subdomain solves one-way overlap communication (“restricted additive Schwarz”) small overlap

Tii RR ,

1iA


Schur preconditioning Given a partition

Condense:

Let M be a good preconditioner for S Then is a preconditioner for A

Moreover, solves with may be done approximately if all degrees of freedom are retained

ff

uu

AAAA ii

i

iii

MAAI

IAA iii

i

ii

00 1

gSu

iiii AAAAS 1iiii fAAfg 1

iiA


Schwarz polynomials Polynomials of Schwarz projections that are

combinations of additive and multiplicative may be appropriate for certain implementations

We may solve the fine subdomains concurrently and follow with a coarse grid (redundantly/cooperatively)

)(1 AufBuu ii

)(10 AufBuu

))(( 110

10

1 ii BABIBB

This leads to algorithm “Hybrid II” in S-B-G’96:

Convenient for “SPMD” (single prog/multiple data)


Schwarz-on-Schur Preconditioning the Schur complement is complex in and of

itself; Schwarz can be used on the reduced problem “Neumann-Neumann” alg

“Balancing Neumann-Neumann” alg))()(( 1

011

01

01 SMIDRSRDSMIMM iii

Tiii

iiiTiii DRSRDM 11

Multigrid on the Schur complement

ii S,

41iD


Schwarz-inside-Schur Consider Newton’s method for solving the nonlinear rootfinding problem derived from the necessary conditions for constrained optimization Constraint Objective Lagrangian Form the gradient of the Lagrangian with respect to each of x, u, and :

NMN fuxuxf ;;;0),( ;),(min uxu

NT uxfux ;),(),(

0),(),( uxfux xxT

0),( uxf

0),(),( uxfux uuT


Schwarz-inside-Schur Equality constrained optimization leads to the KKT

system for states x , designs u , and multipliers

fgg

ux

JJJWWJWW

u

x

ux

Tuuuux

Tx

Tuxxx

0

Then

Newton Reduced SQP solves the Schur complement system H u = g , where H is the reduced Hessian

fJWWJJgJJgg xuxxxT

xTux

Tx

Tuu

1)( uxuxxx

Tx

Tu

Tux

Tx

Tuuu JJWWJJWJJWH 1)(

uJfxJ ux uWxWgJ T

uxxxxTx


Schwarz-inside-Schur, cont. Problems

is the Jacobian of a PDE huge! involve Hessians of objective and constraints

second derivatives and huge H is unreasonable to form, store, or invert

xJW

Solutions Use Schur preconditioning on full system Form forward action of Hessians by automatic

differentiation (vector-to-vector map) Form approximate inverse action of state Jacobian and its

transpose by Schwarz


Example of PDE-constrained Optimization

c/o G. Biros and O. Ghattas

Lagrange-Newton-Krylov-Schur implemented in Veltisto/PETSc

wing tip vortices, no control (l); optimal control (r)wing tip vortices, no control (l); optimal control (r)

optimal boundary controls shown as velocity vectorsoptimal boundary controls shown as velocity vectors

Optimal control of laminar viscous flow optimization variables are surface suction/injection objective is minimum drag 700,000 states; 4,000 controls 128 Cray T3E processors ~5 hrs for optimal solution (~1 hr for analysis)

www.cs.nyu.edu/~biros/veltisto/


Nonlinear Schwarz preconditioning Nonlinear Schwarz has Newton both inside and

outside and is fundamentally Jacobian-free It replaces with a new nonlinear system

possessing the same root, Define a correction to the partition (e.g.,

subdomain) of the solution vector by solving the following local nonlinear system:

where is nonzero only in the components of the partition

Then sum the corrections:

0)( uF0)( uthi

thi

)(ui

0))(( uuFR ii n

i u )(

)()( uu ii


Nonlinear Schwarz, cont. It is simple to prove that if the Jacobian of F(u) is

nonsingular in a neighborhood of the desired root then and have the same unique root

To lead to a Jacobian-free Newton-Krylov algorithm we need to be able to evaluate for any : The residual The Jacobian-vector product

Remarkably, (Cai-Keyes, 2000) it can be shown that

where and All required actions are available in terms of !

0)( u

nvu ,)()( uu ii

0)( uF

vu ')(

JvRJRvu iiTii )()( 1'

)(' uFJ Tiii JRRJ

)(uF


Example of nonlinear Schwarz

Newton’s methodAdditive Schwarz Preconditioned Inexact Newton

(ASPIN)

Difficulty at critical Re

Stagnation beyond

critical Re

Convergence for all Re


“Unreasonable effectiveness” of Schwarz When does the sum of partial inverses equal the

inverse of the sums? When the decomposition is right!

Good decompositions are a compromise between conditioning and parallel complexity, in practice

iriii raAr T

iii Arra Let be a complete set of orthonormal row eigenvectors for A : or

iiT

ii rarA Then

iT

iiT

iiiiT

ii rArrrrarA 111 )( and

— the Schwarz formula!


Newton-Krylov-Schwarz – a parallel PDE “workhorse”

Newtonnonlinear solver

asymptotically quadratic

Krylovaccelerator

spectrally adaptive

Schwarzpreconditionerparallelizable

Popularized in parallel Jacobian-free form under this name by Cai, Gropp, Keyes & Tidriri (1994), in PETSc since Balay’s MS project at ODU (1995)


Jacobian-free Newton-Krylov method In the Jacobian-free Newton-Krylov (JFNK)

method, a Krylov method solves the linear Newton correction equation, requiring Jacobian-vector products

These are approximated by the Fréchet derivatives

so that the actual Jacobian elements are never

explicitly needed, where is chosen with a fine balance between approximation and floating point rounding error

Schwarz preconditions, using approximate elements

)]()([1)( uFvuFvuJ


Philosophy of Jacobian-free NK To evaluate the linear residual, we use the true F’(u) , giving a true

Newton step and asymptotic quadratic Newton convergence To precondition the linear residual, we do anything convenient that

uses understanding of the dominant physics/mathematics in the system and respects the limitations of the parallel computer architecture and the cost of various operations:

combinations of operator-split Jacobians (for reasons of physics or reasons of numerics)

Jacobian of related discretization (for “fast” solves) Jacobian of lower-order discretization (for more stability, less storage) Jacobian with “lagged” values for expensive terms (for less computation per

degree of freedom) Jacobian stored in lower precision (for less memory traffic per

preconditioning step) Jacobian blocks decomposed for parallelism


Philosophy of Jacobian-free NK, cont. These motivations are not new; most large-scale application

codes also take “short cuts” on the approximate Jacobian operator to be inverted – showing physical intuition

The problem with many codes is that they do not anywhere have an accurate global Jacobian operator; they use only the weak Jacobian

This leads to a weakly nonlinearly converging “defect correction method”

Defect correction:

in contrast to preconditioned Newton:

)()( 11 kkk uFBuuJB

)( kk uFuB


Physics-based preconditioning Consider an algorithm that leaves

first-order splitting error as solver In the Jacobian-free Newton-Krylov

framework, this solver, which maps a residual into a correction, can be regarded as a preconditioner

The true Jacobian is never formed yet the time-implicit nonlinear residual at each time step can be made as small as needed for nonlinear consistency in long time integrations


Physics-based preconditioning In Newton iteration, one seeks to obtain a correction (“delta”)

to solution, by inverting the Jacobian matrix on (the negative of) the nonlinear residual:

A typical operator-split code also derives a “delta” to the solution, by some implicitly defined means, through a series of implicit and explicit substeps

This implicitly defined mapping from residual to “delta” is a natural preconditioner

Software must accommodate this!

)()]([ 1 kkk uFuJu

kk uuF )(


Ex.: 1D shallow water preconditioning Define continuity residual for each timestep:

Define momentum residual for each timestep:

_)]([ R

xu

uR

xgu n _][)(

Continuity delta-form (*):

Momentum delta form (**):

xuR

nnn

11 )(_

xg

xuuuuR

nn

nnn

121 )()()(_


1D Shallow water preconditioning, cont. Solving (**) for and substituting into (*),

After this parabolic equation is solved for , we have

This completes the application of the preconditioner to one Newton-Krylov iteration at one timestep Of course, the parabolic solve need not be done exactly; one sweep of multigrid can be used See paper by Mousseau et al. (2002) in Ref [1] for impressive results for longtime weather integration

)( u

)_(_)][( 22 uRx

Rxx

g n

uRx

gu n _][)(


Operator-split preconditioning Subcomponents of a PDE operator often have special

structure that can be exploited if they are treated separately

Algebraically, this is just a generalization of Schwarz, by term instead of by subdomain

Suppose and a preconditioner is to be constructed, where and are each “easy” to invert

Form a preconditioned vector from as follows:

Equivalent to replacing with First-order splitting error, yet often used as a solver!

RSIJ 1SI RI

u

J SRRSI 1

uSIRI 111 )()(


Operator-split preconditioning, cont. Suppose S is convection-diffusion and R is reaction,

among a collection of fields stored as gridfunctions On a small regular 2D grid with a five-point stencil:

R is trivially invertible in block diagonal form S is invertible with one multilevel solve per field

J = S + R


Preconditioners assembled from just the “strong” elements of the Jacobian, alternating the source term and the diffusion term operators, are competitive in convergence rates with block-ILU on the Jacobian

particularly, since the decoupled scalar diffusion systems are amenable to simple multigrid treatment – not as trivial for the coupled system

The decoupled preconditioners store many fewer elements and significantly reduce memory bandwidth requirements and are expected to be much faster per iteration when carefully implemented

See “alternative block factorization” by Bank et al. in Ref [1]; incorporated into SciDAC TSI solver by D’Azevedo

Operator-split preconditioning, cont.


Using Jacobian of related discretization To precondition a variable coefficient operator, such

as ·( ) , use , based on a constant coefficient average

Brown & Saad (1980) showed that, because of the availability of fast solvers, it may even be acceptable to use to precondition something like

2

yv

xu

)()()(2

2


Using Jacobian of lower order discretization Orszag popularized the use of linear finite element

discretizations as preconditioners for high-order spectral element discretizations in the 1970s; both approach the same continuous operator

It is common in CFD to employ first-order upwinded convective operators as approximate inversions for higher-order operators: better factorization stability smaller matrix bandwidth and complexity

With Jacobian-free NK, we can have the best of both worlds – a stable factorization/cheap solve and a true Jacobian step


Using Jacobian with lagged terms Newton-chord methods (e.g., papers by Smooke et al.) “freeze”

the Jacobian matrices: saves Jacobian evaluation and factorization, which can be up to 90%

of the running time of the code in some apps however, nonlinear convergence degrades to linear rate

In Jacobian-free NK, we can “freeze” some or all of the terms in the Jacobian preconditioner, while always accessing the action of the true Jacobian for the Krylov matrix-vector multiply:

still saves Jacobian work maintains asymptotically quadratic rate for nonlinear convergence

See Knoll-Keyes (2002) for example with coupled edge plasma and Navier-Stokes, showing five-fold improvement over full Newton with constantly refreshed Jacobian on LHS, versus JFNK with preconditioner refreshed once each ten timesteps


Using Jacobian with lower precision elements Memory bandwidth is the critical architectural

parameter for sparse linear algebra computations Storing the preconditioner elements in single precision

effectively doubles memory bandwidth (and potentially halves runtime) for this critical phase

We still form the Jacobian-vector product with full precision and “zero-pad” the preconditioner elements back to full length in the arithmetic unit, so the numerical quality of the Krylov subspace does not degrade


Memory BW bottleneck revealed via precision reduction

106s122s16s31s120

181s205s34s60s64

331s373s67s117s32

657s746s136s223s16

SingleDoubleSingleDouble

OverallLinear Solve

Computational PhaseNumber of Processors

Execution times for unstructured NKS Euler Simulation on Origin 2000: double precision matrices versus single precision preconditioner

Note that times are nearly halved, along with precision, for the BW-limited linear solve phase, indicating that the BW can be at least doubled before hitting the next

bottleneck!


PETSc and Hypre combined in “Terascale Optimal PDE Simulations” (TOPS) ISIC

Nine institutions, five years, 24 co-PIs


Scope for TOPS Design and implementation of “solvers”

Time integrators

Nonlinear solvers

Optimizers

Linear solvers

Eigensolvers

Software integration Performance optimization

0),,,( ptxxf

0),( pxF

bAx

BxAx

0,0),(..),(min uuxFtsuxu

Optimizer

Linear solver

Eigensolver

Time integrator

Nonlinear solver

Indicates dependence

Sens. Analyzer(w/ sens. anal.)

(w/ sens. anal.)


Ex.: Hall magnetic reconnectionMagnetic Reconnection: Applications to Sawtooth Oscillations, Error Field Induced Islands and the Dynamo EffectThe research goals of this project include producing a unique high performance code and using this code to study magnetic reconnection in astrophysical plasmas, in smaller scale laboratory experiments, and in fusion devices. The modular code that will be developed will be a fully three-dimensional, compressible Hall MHD code with options to run in slab, cylindrical and toroidal geometry and flexible enough to allow change in algorithms as needed. The code will use adaptive grid refinement, will run on massively parallel computers, and will be portable and scalable. The research goals include studies that will provide increased understanding of sawtooth oscillations in tokamaks, magnetotail substorms, error-fields in tokamaks, reverse field pinch dynamos, astrophysical dynamos, and laboratory reconnection experiments.PI: Amitava BhattacharjeeUniversity of Iowa


Summary of progress on CMRS CMRS team has provided TOPS with discretization of model 2D

multicomponent MHD evolution code in PETSc’s FormFunctionLocal format using DMMG and automatic differentiation for Jacobian objects

TOPS has implemented fully nonlinearly implicit GMRES-MG-ILU parallel solver with custom deflation of nullspace in CMRS’s doubly periodic formulation

CMRS and TOPS reproduce the same dynamics on the same grids with the same time-stepping, up to a finite-time singularity due to collapse of current sheet (that falls below presently uniform mesh resolution)

TOPS code, being implicit, can choose timesteps an order of magnitude larger, with potential for higher ratio in more physically realistic parameter regimes, but is still slower in wall-clock time

PLAN: tune PETSc solver by profiling, blocking, reuse, etc. PLAN: go to higher-order in time PLAN: identify the numerical complexity benefits from implicitness (in

suppressing fast timescales) and quantify (explicit versus implicit) PLAN (with APDEC team): incorporate AMR


Equilibrium:

Model equations: (Porcelli et al., 1993, 1999)2D Hall MHD sawtooth instability

figures c/o A. Bhattacharjee, CMRS

Vorticity, early time

Vorticity, later time

zoom

ex29.c in

PETSc 2.5.1


Time-implicit Newton-Krylov-SchwarzFor nonlinear robustness, NKS iteration is wrapped in time-

stepping:for (l = 0; l < n_time; l++) {

select time stepfor (k = 0; k < n_Newton; k++) { compute nonlinear residual and Jacobian

for (j = 0; j < n_Krylov; j++) { forall (i = 0; i < n_Precon ; i++) {

solve subdomain problems concurrently } // End of loop over subdomains perform Jacobian-vector product enforce Krylov basis conditions update optimal coefficients check linear convergence } // End of linear solver perform DAXPY update check nonlinear convergence } // End of nonlinear loop} // End of time-step loop

NKS loop

Pseudo-time loop


PETSc’s DMMG in Hall MHD application Mesh and time refinement studies of CMRS Hall magnetic reconnection model

problem (4 mesh sizes, dt=0.1 (nondimensional, near CFL limit for fastest wave) on left, dt=0.8 on right)

Measure of functional inverse to thickness of current sheet versus time, for 0<t<200 (nondimensional), where singularity occurs around t=215


PETSc’s DMMG in Hall MR application, cont. Implicit timestep increase studies of CMRS Hall magnetic reconnection model

problem, on finest (192192) mesh of previous slide, in absolute magnitude, rather than semi-log


Ex.: Computational aerodynamics

mesh c/o D. Mavriplis, ICASE

Implemented in PETSc

www.mcs.anl.gov/petsc

Transonic “Lambda” Shock, Mach contours on surfaces


Fixed-size Parallel Scaling Results

Four orders of magnitude in 13 years

c/o K. Anderson, W. Gropp, D. Kaushik, D. Keyes and B. Smith

128 nodes 128 nodes 43min43min

3072 nodes 3072 nodes 2.5min, 2.5min, 226Gf/s226Gf/s

11M unknowns 11M unknowns 1515µs/unknown µs/unknown 70% efficient70% efficient

This scaling study, featuring our widest range of processor number, was done for the incompressible case.


Scaling to new architectures w/Bulk Synchronous Processing (BSP) model

local scatter

Jac-vec multiply

precond sweep

daxpys

inner products

Krylov iteration

…

What happens if, for instance, in this (schematicized) iteration, arithmetic speed is doubled, scalar all-gather is quartered, and local scatter is cut by one-third? Each phase is considered separately. Answer is to the right.

P1:

P2:

Pn:

…P1:

P2:

Pn:


0

50

100

150

200

250

300

350

400

1024 Red 1024 BG/L 2048 BG/L 4096 BG/L 8192 BG/L

Total time Comm. Time

*performed by A. Sugavanam; data pair is speedup (w.r. to 1024 BG/L nodes) and communication %

Primitive BSP-like BG/L extrapolation*

1.0 (11%) 2.0 (11%) 3.7 (20%) 7.9 (25%)


Conclusions Domain decomposition and multilevel iteration the

dominant paradigm in contemporary terascale PDE simulation

Several freely available software toolkits exist, and successfully scale to thousands of tightly coupled processors for problems on quasi-static meshes

Concerted efforts underway to make elements of these toolkits interoperate, and to allow expression of the best methods, which tend to be modular, hierarchical, recursive, and above all — adaptive!

Tunability of NKS algorithmics allows solver adaption to application/architecture combinations

Next generation software should incorporate “best practices” in applications as preconditioners


Acknowledgments Collaborators or Contributors:

Xiao-Chuan Cai (Univ. Colorado, Boulder) Omar Ghattas (Carnegie-Mellon) Dinesh Kaushik (ODU) Dana Knoll (LANL) Dimitri Mavriplis (ICASE) PETSc team at Argonne National Laboratory: Satish Balay, Bill Gropp, Lois McInnes, Barry Smith

Sponsors: DOE, NASA, NSF Computer Resources: LLNL, LANL, SNL, NERSC


Related URLs Personal homepage: papers, talks, etc.

http://www.math.odu.edu/~keyes SciDAC initiative

http://www.science.doe.gov/scidac TOPS software project

http://www.tops-scidac.org PETSc software project

http://www.mcs.anl.gov/petsc Hypre software project

http://www.llnl.gov/CASC/hypre

Slides from 14-hour Peking University CS&E short course with Bill Gropp (in August 2002) on-line

http://www.mcs.anl.gov/petsc-fun3d

http://www.mcs.anl.gov/petsc

http://www.tops-scidac.org/

http://www.tops-scidac.org/











Bibliography Jacobian-Free Newton-Krylov Methods: Approaches and Applications, Knoll & Keyes, 2002,

submitted to J. Comp. Phys.

Nonlinearly Preconditioned Inexact Newton Algorithms, Cai & Keyes, 2002, SIAM J. Sci. Comp. 24:183-200

High Performance Parallel Implicit CFD, Gropp, Kaushik, Keyes & Smith, 2001, Parallel Computing 27:337-362

Four Horizons for Enhancing the Performance of Parallel Simulations based on Partial Differential Equations, Keyes, 2000, Lect. Notes Comp. Sci., Springer, 1900:1-17

Globalized Newton-Krylov-Schwarz Algorithms and Software for Parallel CFD, Gropp, Keyes, McInnes & Tidriri, 2000, Int. J. High Performance Computing Applications 14:102-136

Achieving High Sustained Performance in an Unstructured Mesh CFD Application, Anderson, Gropp, Kaushik, Keyes & Smith, 1999, Proceedings of SC'99

Prospects for CFD on Petaflops Systems, Keyes, Kaushik & Smith, 1999, in “Parallel Solution of Partial Differential Equations,” Springer, pp. 247-278

How Scalable is Domain Decomposition in Practice?, Keyes, 1998, in “Proceedings of the 11th Intl. Conf. on Domain Decomposition Methods,” Domain Decomposition Press, pp. 286-297


EOF

Scalable Solvers and Software for PDE Applications

Documents

Transcript of Scalable Solvers and Software for PDE Applications