Scalable Solvers and Software for PDE Applications

64
Scalable Solvers and Software for PDE Applications The Pennsylvania State University 29 April 2003 David E. Keyes Center for Computational Science Old Dominion University & Institute for Scientific Computing Research Lawrence Livermore National Laboratory

description

Scalable Solvers and Software for PDE Applications. The Pennsylvania State University 29 April 2003 David E. Keyes Center for Computational Science Old Dominion University & Institute for Scientific Computing Research Lawrence Livermore National Laboratory. Happy Poincar é ’s Birthday!. - PowerPoint PPT Presentation

Transcript of Scalable Solvers and Software for PDE Applications

Page 1: Scalable Solvers and Software  for PDE Applications

Scalable Solvers and Software for PDE Applications

The Pennsylvania State University29 April 2003

David E. KeyesCenter for Computational Science

Old Dominion University&

Institute for Scientific Computing ResearchLawrence Livermore National Laboratory

Page 2: Scalable Solvers and Software  for PDE Applications

PennState Seminar, 29 April 2003

Happy Poincaré’s Birthday! Born: 29 April 1854, Nancy Fundamental contributions to topology,

analysis, number theory, potential theory, quantum theory, fluid mechanics, the special theory of relativity, and the philosophy of science

Académie des Sciences, 1887 (President, 1906)

Fellow, Royal Society, 1894 Died: 17 July 1912, Paris “The last universalist in mathematics”-

anon. “It is by logic that we prove; it is by

intuition that we invent.” – Henri Poincaré

Page 3: Scalable Solvers and Software  for PDE Applications

PennState Seminar, 29 April 2003

Plan of presentation Imperative of “optimal” algorithms for

terascale computing Basic domain decomposition and multilevel

algorithmic concepts Examples of applications Example of Bell-prize winning solver

performance on ASCI platforms Conclusions and outlook

Page 4: Scalable Solvers and Software  for PDE Applications

PennState Seminar, 29 April 2003

Motivation: optimality Convergence rate nearly independent of discretization parameters

Multilevel schemes for rapid linear convergence of linear problems Newton-like schemes for quadratic convergence of nonlinear problems

Convergence rate as independent as possible of physical parameters

Continuation schemes Physics-based preconditioning

unscalable

scalable

Problem Size (increasing with number of processors)

Tim

e to

Sol

utio

n

200

150

50

0

100

10 100 10001

Steel/rubber compositeParallel multigrid c/o M. Adams, Berkeley-Sandia

The solver is a key part, but not the only part, of the simulation that needs to be scalable

Page 5: Scalable Solvers and Software  for PDE Applications

PennState Seminar, 29 April 2003

Why optimal algorithms? The more powerful the computer, the greater the

importance of optimality Example:

Suppose Alg1 solves a problem in time CN2, where N is the input size

Suppose Alg2 solves the same problem in time CN Suppose that the machine on which Alg1 and Alg2

have been parallelized to run has 10,000 processors In constant time (compared to serial), Alg1 can run a

problem 100X larger, whereas Alg2 can run a problem 10,000X larger

Page 6: Scalable Solvers and Software  for PDE Applications

PennState Seminar, 29 April 2003

Why optimal?, cont. Alternatively, filling the machine’s memory, Alg1 requires

100X time, whereas Alg2 runs in constant time Is 10,000 processors a reasonable expectation?

Yes, we have it today (ASCI White (IBM), Red Storm (Cray))! Could computational scientists really use 10,000X scaling?

Of course; we are approximating the continuum A grid for weather prediction allows points every 1km versus every

100km on the earth’s surface In 2D 10,000X disappears fast; in 3D even faster

However, these machines are expensive (Earth Simulator is $0.5B, plus ongoing operating costs), and optimal algorithms are the only algorithms that we can afford to run on them

Page 7: Scalable Solvers and Software  for PDE Applications

PennState Seminar, 29 April 2003

Decomposition strategies for Lu=f in Operator decomposition

Function space decomposition

Domain decomposition

k

kLL

k

kkk

kk uuff ,

kk

fuuyxkk II )()1(][ LL

Consider, e.g., the implicitly discretized parabolic case

Page 8: Scalable Solvers and Software  for PDE Applications

PennState Seminar, 29 April 2003

Operator decomposition Consider ADI

fuyuxkk II )()2/1( ][][ 2/2/ LL

fuxuykk II )2/1()1( ][][ 2/2/ LL

Iteration matrix consists of four sequential (“multiplicative”) substeps per timestep two sparse matrix-vector multiplies two sets of unidirectional bandsolves

Parallelism within each substep But global data exchanges between bandsolve substeps

Page 9: Scalable Solvers and Software  for PDE Applications

PennState Seminar, 29 April 2003

Function space decomposition Consider a spectral Galerkin method

),()(),,(1

yxtatyxu j

N

jj

Nifuu iiidtd ,...,1),,(),(),( L

Nifa ijjijdtda

jijj ,...,1),,(),(),( L

fMKaMdtda 11

System of ordinary differential equations Perhaps are diagonal

matrices Perfect parallelism across spectral index But global data exchanges to transform back to

physical variables at each step

)],[()],,[( ijij KM L

Page 10: Scalable Solvers and Software  for PDE Applications

PennState Seminar, 29 April 2003

Domain decomposition Consider restriction and extension

operators for subdomains, , and for possible coarse grid,

Replace discretized with

Solve by a Krylov method, e.g., CG Matrix-vector multiplies with

parallelism on each subdomain nearest-neighbor exchanges, global reductions possible small global system (not needed for parabolic case)

iiR

0R

TRR 00 ,

Tii RR ,

fAu fBAuB 11

iiTii

T RARRARB 10

100

1

Tiii ARRA

=

Page 11: Scalable Solvers and Software  for PDE Applications

PennState Seminar, 29 April 2003

Comparison Operator decomposition (ADI)

natural row-based assignment requires all-to-all, bulk data exchanges in each step (for transpose)

Function space decomposition (Fourier) natural mode-based assignment requires all-to-all,

bulk data exchanges in each step (for transform) Domain decomposition (Schwarz)

natural domain-based assignment requires local, nearest neighbor data exchanges, global reductions, and optional small global problem

Page 12: Scalable Solvers and Software  for PDE Applications

PennState Seminar, 29 April 2003

Theoretical scaling of domain decomposition (for three common network topologies)

With logarithmic-time (hypercube- or tree-based) global reductions and scalable nearest neighbor interconnects: optimal number of processors scales linearly with problem size

(“scalable”, assumes one subdomain per processor) With power-law-time (3D torus-based) global reductions and

scalable nearest neighbor interconnects: optimal number of processors scales as three-fourths power of

problem size (“almost scalable”) With linear-time (common bus) network:

optimal number of processors scales as one-fourth power of problem size (*not* scalable)

bad news for conventional Beowulf clusters, but see 2000 & 2001 Bell Prize “price-performance awards” using multiple commodity NICs per Beowulf node!

Page 13: Scalable Solvers and Software  for PDE Applications

PennState Seminar, 29 April 2003

Three Basic Concepts Iterative correction Schwarz preconditioning Schur preconditioning

Some “Advanced” Concepts Polynomial combinations of Schwarz projections Schwarz-Schur combinations

Schwarz on Schur-reduced system Schwarz inside Schur-reduced system

Nonlinear Schwarz

recentoptimization

Page 14: Scalable Solvers and Software  for PDE Applications

PennState Seminar, 29 April 2003

Iterative correction The most basic idea in iterative methods

Evaluate residual accurately, but solve approximately, where is an approximate inverse to A

A sequence of complementary solves can be used, e.g., with first and then one has

)(1 AufBuu

)]([ 11

12

12

11 AufABBBBuu

2B1B

1B

RRARRB TT 112 )(

)( 1AB Optimal polynomials of lead to various preconditioned Krylov methods

Scale recurrence, e.g., with , leads to multilevel methods

Page 15: Scalable Solvers and Software  for PDE Applications

PennState Seminar, 29 April 2003

smoother

Finest Grid

First Coarse Grid

coarser grid has fewer cells (less work & storage)

Restrictiontransfer from fine to coarse grid

Recursively apply this idea until we have an easy problem to solve

A Multigrid V-cycle

Prolongationtransfer from coarse to fine grid

Multilevel preconditioning

Page 16: Scalable Solvers and Software  for PDE Applications

PennState Seminar, 29 April 2003

Example of Hypre’s scaled efficiency

PFMG-CG on Red (40x40x40)

0

0.2

0.4

0.6

0.8

1

0 1000 2000 3000 4000

procs / problem size

scal

ed e

ffici

ency

Setup

Solve

64K DOFs

200M DOFs

Page 17: Scalable Solvers and Software  for PDE Applications

PennState Seminar, 29 April 2003

Schwarz preconditioning Given A x = b , partition x into

subvectors, corresp. to subdomains of the domain of the PDE, nonempty, possibly overlapping, whose union is all of the elements of nx

iR

thi

thi

xRx ii Tiii ARRA

iiTii RARB 11

i

x

Let Boolean rectangular matrix extract the subset of :

Let The Boolean matrices are gather/scatter operators, mapping between a global vector and its subdomain support

Page 18: Scalable Solvers and Software  for PDE Applications

PennState Seminar, 29 April 2003

SPMD parallelism w/domain decomposition

Partitioning of the grid induces block structure on the Jacobian

1

2

3

A23A21 A22

rows assigned to proc “2”

Page 19: Scalable Solvers and Software  for PDE Applications

PennState Seminar, 29 April 2003

Iteration count estimates from the Schwarz theory

In terms of N and P, where for d-dimensional isotropic problems, N=h-d and P=H-d, for mesh parameter h and subdomain diameter H, iteration counts may be estimated as follows:

Ο(P1/3)Ο(P1/2)1-level Additive Schwarz

Ο(1)Ο(1)2-level Additive Schwarz

Ο((NP)1/6)Ο((NP)1/4)Domain Jacobi (=0)

Ο(N1/3)Ο(N1/2)Point Jacobi

in 3Din 2DPreconditioning Type

Krylov-Schwarz iterative methods typically converge in a number of iterations that scales as the square-root of the condition number of the Schwarz-preconditioned system

Page 20: Scalable Solvers and Software  for PDE Applications

PennState Seminar, 29 April 2003

Comments on the Schwarz theory Basic Schwarz estimates are for:

self-adjoint operators with smooth coefficients positive definite operators exact subdomain solves, two-way overlapping with generous overlap, =O(H) (otherwise 2-level result is O(1+H/))

Extensible to: nonself-adjointness (e.g, convection) and jumping coefficients indefiniteness (e.g., wave Helmholtz) inexact subdomain solves one-way overlap communication (“restricted additive Schwarz”) small overlap

Tii RR ,

1iA

Page 21: Scalable Solvers and Software  for PDE Applications

PennState Seminar, 29 April 2003

Schur preconditioning Given a partition

Condense:

Let M be a good preconditioner for S Then is a preconditioner for A

Moreover, solves with may be done approximately if all degrees of freedom are retained

ff

uu

AAAA ii

i

iii

MAAI

IAA iii

i

ii

00 1

gSu

iiii AAAAS 1iiii fAAfg 1

iiA

Page 22: Scalable Solvers and Software  for PDE Applications

PennState Seminar, 29 April 2003

Schwarz polynomials Polynomials of Schwarz projections that are

combinations of additive and multiplicative may be appropriate for certain implementations

We may solve the fine subdomains concurrently and follow with a coarse grid (redundantly/cooperatively)

)(1 AufBuu ii

)(10 AufBuu

))(( 110

10

1 ii BABIBB

This leads to algorithm “Hybrid II” in S-B-G’96:

Convenient for “SPMD” (single prog/multiple data)

Page 23: Scalable Solvers and Software  for PDE Applications

PennState Seminar, 29 April 2003

Schwarz-on-Schur Preconditioning the Schur complement is complex in and of

itself; Schwarz can be used on the reduced problem “Neumann-Neumann” alg

“Balancing Neumann-Neumann” alg))()(( 1

011

01

01 SMIDRSRDSMIMM iii

Tiii

iiiTiii DRSRDM 11

Multigrid on the Schur complement

ii S,

41iD

Page 24: Scalable Solvers and Software  for PDE Applications

PennState Seminar, 29 April 2003

Schwarz-inside-Schur Consider Newton’s method for solving the nonlinear rootfinding problem derived from the necessary conditions for constrained optimization Constraint Objective Lagrangian Form the gradient of the Lagrangian with respect to each of x, u, and :

NMN fuxuxf ;;;0),( ;),(min uxu

NT uxfux ;),(),(

0),(),( uxfux xxT

0),( uxf

0),(),( uxfux uuT

Page 25: Scalable Solvers and Software  for PDE Applications

PennState Seminar, 29 April 2003

Schwarz-inside-Schur Equality constrained optimization leads to the KKT

system for states x , designs u , and multipliers

fgg

ux

JJJWWJWW

u

x

ux

Tuuuux

Tx

Tuxxx

0

Then

Newton Reduced SQP solves the Schur complement system H u = g , where H is the reduced Hessian

fJWWJJgJJgg xuxxxT

xTux

Tx

Tuu

1)( uxuxxx

Tx

Tu

Tux

Tx

Tuuu JJWWJJWJJWH 1)(

uJfxJ ux uWxWgJ T

uxxxxTx

Page 26: Scalable Solvers and Software  for PDE Applications

PennState Seminar, 29 April 2003

Schwarz-inside-Schur, cont. Problems

is the Jacobian of a PDE huge! involve Hessians of objective and constraints

second derivatives and huge H is unreasonable to form, store, or invert

xJW

Solutions Use Schur preconditioning on full system Form forward action of Hessians by automatic

differentiation (vector-to-vector map) Form approximate inverse action of state Jacobian and its

transpose by Schwarz

Page 27: Scalable Solvers and Software  for PDE Applications

PennState Seminar, 29 April 2003

Example of PDE-constrained Optimization

c/o G. Biros and O. Ghattas

Lagrange-Newton-Krylov-Schur implemented in Veltisto/PETSc

wing tip vortices, no control (l); optimal control (r)wing tip vortices, no control (l); optimal control (r)

optimal boundary controls shown as velocity vectorsoptimal boundary controls shown as velocity vectors

Optimal control of laminar viscous flow optimization variables are surface suction/injection objective is minimum drag 700,000 states; 4,000 controls 128 Cray T3E processors ~5 hrs for optimal solution (~1 hr for analysis)

www.cs.nyu.edu/~biros/veltisto/

Page 28: Scalable Solvers and Software  for PDE Applications

PennState Seminar, 29 April 2003

Nonlinear Schwarz preconditioning Nonlinear Schwarz has Newton both inside and

outside and is fundamentally Jacobian-free It replaces with a new nonlinear system

possessing the same root, Define a correction to the partition (e.g.,

subdomain) of the solution vector by solving the following local nonlinear system:

where is nonzero only in the components of the partition

Then sum the corrections:

0)( uF0)( uthi

thi

)(ui

0))(( uuFR ii n

i u )(

)()( uu ii

Page 29: Scalable Solvers and Software  for PDE Applications

PennState Seminar, 29 April 2003

Nonlinear Schwarz, cont. It is simple to prove that if the Jacobian of F(u) is

nonsingular in a neighborhood of the desired root then and have the same unique root

To lead to a Jacobian-free Newton-Krylov algorithm we need to be able to evaluate for any : The residual The Jacobian-vector product

Remarkably, (Cai-Keyes, 2000) it can be shown that

where and All required actions are available in terms of !

0)( u

nvu ,)()( uu ii

0)( uF

vu ')(

JvRJRvu iiTii )()( 1'

)(' uFJ Tiii JRRJ

)(uF

Page 30: Scalable Solvers and Software  for PDE Applications

PennState Seminar, 29 April 2003

Example of nonlinear Schwarz

Newton’s methodAdditive Schwarz Preconditioned Inexact Newton

(ASPIN)

Difficulty at critical Re

Stagnation beyond

critical Re

Convergence for all Re

Page 31: Scalable Solvers and Software  for PDE Applications

PennState Seminar, 29 April 2003

“Unreasonable effectiveness” of Schwarz When does the sum of partial inverses equal the

inverse of the sums? When the decomposition is right!

Good decompositions are a compromise between conditioning and parallel complexity, in practice

iriii raAr T

iii Arra Let be a complete set of orthonormal row eigenvectors for A : or

iiT

ii rarA Then

iT

iiT

iiiiT

ii rArrrrarA 111 )( and

— the Schwarz formula!

Page 32: Scalable Solvers and Software  for PDE Applications

PennState Seminar, 29 April 2003

Newton-Krylov-Schwarz – a parallel PDE “workhorse”

Newtonnonlinear solver

asymptotically quadratic

Krylovaccelerator

spectrally adaptive

Schwarzpreconditionerparallelizable

Popularized in parallel Jacobian-free form under this name by Cai, Gropp, Keyes & Tidriri (1994), in PETSc since Balay’s MS project at ODU (1995)

Page 33: Scalable Solvers and Software  for PDE Applications

PennState Seminar, 29 April 2003

Jacobian-free Newton-Krylov method In the Jacobian-free Newton-Krylov (JFNK)

method, a Krylov method solves the linear Newton correction equation, requiring Jacobian-vector products

These are approximated by the Fréchet derivatives

so that the actual Jacobian elements are never

explicitly needed, where is chosen with a fine balance between approximation and floating point rounding error

Schwarz preconditions, using approximate elements

)]()([1)( uFvuFvuJ

Page 34: Scalable Solvers and Software  for PDE Applications

PennState Seminar, 29 April 2003

Philosophy of Jacobian-free NK To evaluate the linear residual, we use the true F’(u) , giving a true

Newton step and asymptotic quadratic Newton convergence To precondition the linear residual, we do anything convenient that

uses understanding of the dominant physics/mathematics in the system and respects the limitations of the parallel computer architecture and the cost of various operations:

combinations of operator-split Jacobians (for reasons of physics or reasons of numerics)

Jacobian of related discretization (for “fast” solves) Jacobian of lower-order discretization (for more stability, less storage) Jacobian with “lagged” values for expensive terms (for less computation per

degree of freedom) Jacobian stored in lower precision (for less memory traffic per

preconditioning step) Jacobian blocks decomposed for parallelism

Page 35: Scalable Solvers and Software  for PDE Applications

PennState Seminar, 29 April 2003

Philosophy of Jacobian-free NK, cont. These motivations are not new; most large-scale application

codes also take “short cuts” on the approximate Jacobian operator to be inverted – showing physical intuition

The problem with many codes is that they do not anywhere have an accurate global Jacobian operator; they use only the weak Jacobian

This leads to a weakly nonlinearly converging “defect correction method”

Defect correction:

in contrast to preconditioned Newton:

)()( 11 kkk uFBuuJB

)( kk uFuB

Page 36: Scalable Solvers and Software  for PDE Applications

PennState Seminar, 29 April 2003

Physics-based preconditioning Consider an algorithm that leaves

first-order splitting error as solver In the Jacobian-free Newton-Krylov

framework, this solver, which maps a residual into a correction, can be regarded as a preconditioner

The true Jacobian is never formed yet the time-implicit nonlinear residual at each time step can be made as small as needed for nonlinear consistency in long time integrations

Page 37: Scalable Solvers and Software  for PDE Applications

PennState Seminar, 29 April 2003

Physics-based preconditioning In Newton iteration, one seeks to obtain a correction (“delta”)

to solution, by inverting the Jacobian matrix on (the negative of) the nonlinear residual:

A typical operator-split code also derives a “delta” to the solution, by some implicitly defined means, through a series of implicit and explicit substeps

This implicitly defined mapping from residual to “delta” is a natural preconditioner

Software must accommodate this!

)()]([ 1 kkk uFuJu

kk uuF )(

Page 38: Scalable Solvers and Software  for PDE Applications

PennState Seminar, 29 April 2003

Ex.: 1D shallow water preconditioning Define continuity residual for each timestep:

Define momentum residual for each timestep:

_)]([ R

xu

uR

xgu n _][)(

Continuity delta-form (*):

Momentum delta form (**):

xuR

nnn

11 )(_

xg

xuuuuR

nn

nnn

121 )()()(_

Page 39: Scalable Solvers and Software  for PDE Applications

PennState Seminar, 29 April 2003

1D Shallow water preconditioning, cont. Solving (**) for and substituting into (*),

After this parabolic equation is solved for , we have

This completes the application of the preconditioner to one Newton-Krylov iteration at one timestep Of course, the parabolic solve need not be done exactly; one sweep of multigrid can be used See paper by Mousseau et al. (2002) in Ref [1] for impressive results for longtime weather integration

)( u

)_(_)][( 22 uRx

Rxx

g n

uRx

gu n _][)(

Page 40: Scalable Solvers and Software  for PDE Applications

PennState Seminar, 29 April 2003

Operator-split preconditioning Subcomponents of a PDE operator often have special

structure that can be exploited if they are treated separately

Algebraically, this is just a generalization of Schwarz, by term instead of by subdomain

Suppose and a preconditioner is to be constructed, where and are each “easy” to invert

Form a preconditioned vector from as follows:

Equivalent to replacing with First-order splitting error, yet often used as a solver!

RSIJ 1SI RI

u

J SRRSI 1

uSIRI 111 )()(

Page 41: Scalable Solvers and Software  for PDE Applications

PennState Seminar, 29 April 2003

Operator-split preconditioning, cont. Suppose S is convection-diffusion and R is reaction,

among a collection of fields stored as gridfunctions On a small regular 2D grid with a five-point stencil:

R is trivially invertible in block diagonal form S is invertible with one multilevel solve per field

J = S + R

Page 42: Scalable Solvers and Software  for PDE Applications

PennState Seminar, 29 April 2003

Preconditioners assembled from just the “strong” elements of the Jacobian, alternating the source term and the diffusion term operators, are competitive in convergence rates with block-ILU on the Jacobian

particularly, since the decoupled scalar diffusion systems are amenable to simple multigrid treatment – not as trivial for the coupled system

The decoupled preconditioners store many fewer elements and significantly reduce memory bandwidth requirements and are expected to be much faster per iteration when carefully implemented

See “alternative block factorization” by Bank et al. in Ref [1]; incorporated into SciDAC TSI solver by D’Azevedo

Operator-split preconditioning, cont.

Page 43: Scalable Solvers and Software  for PDE Applications

PennState Seminar, 29 April 2003

Using Jacobian of related discretization To precondition a variable coefficient operator, such

as ·( ) , use , based on a constant coefficient average

Brown & Saad (1980) showed that, because of the availability of fast solvers, it may even be acceptable to use to precondition something like

2

yv

xu

)()()(2

2

Page 44: Scalable Solvers and Software  for PDE Applications

PennState Seminar, 29 April 2003

Using Jacobian of lower order discretization Orszag popularized the use of linear finite element

discretizations as preconditioners for high-order spectral element discretizations in the 1970s; both approach the same continuous operator

It is common in CFD to employ first-order upwinded convective operators as approximate inversions for higher-order operators: better factorization stability smaller matrix bandwidth and complexity

With Jacobian-free NK, we can have the best of both worlds – a stable factorization/cheap solve and a true Jacobian step

Page 45: Scalable Solvers and Software  for PDE Applications

PennState Seminar, 29 April 2003

Using Jacobian with lagged terms Newton-chord methods (e.g., papers by Smooke et al.) “freeze”

the Jacobian matrices: saves Jacobian evaluation and factorization, which can be up to 90%

of the running time of the code in some apps however, nonlinear convergence degrades to linear rate

In Jacobian-free NK, we can “freeze” some or all of the terms in the Jacobian preconditioner, while always accessing the action of the true Jacobian for the Krylov matrix-vector multiply:

still saves Jacobian work maintains asymptotically quadratic rate for nonlinear convergence

See Knoll-Keyes (2002) for example with coupled edge plasma and Navier-Stokes, showing five-fold improvement over full Newton with constantly refreshed Jacobian on LHS, versus JFNK with preconditioner refreshed once each ten timesteps

Page 46: Scalable Solvers and Software  for PDE Applications

PennState Seminar, 29 April 2003

Using Jacobian with lower precision elements Memory bandwidth is the critical architectural

parameter for sparse linear algebra computations Storing the preconditioner elements in single precision

effectively doubles memory bandwidth (and potentially halves runtime) for this critical phase

We still form the Jacobian-vector product with full precision and “zero-pad” the preconditioner elements back to full length in the arithmetic unit, so the numerical quality of the Krylov subspace does not degrade

Page 47: Scalable Solvers and Software  for PDE Applications

PennState Seminar, 29 April 2003

Memory BW bottleneck revealed via precision reduction

106s122s16s31s120

181s205s34s60s64

331s373s67s117s32

657s746s136s223s16

SingleDoubleSingleDouble

OverallLinear Solve

Computational PhaseNumber of Processors

Execution times for unstructured NKS Euler Simulation on Origin 2000: double precision matrices versus single precision preconditioner

Note that times are nearly halved, along with precision, for the BW-limited linear solve phase, indicating that the BW can be at least doubled before hitting the next

bottleneck!

Page 48: Scalable Solvers and Software  for PDE Applications

PennState Seminar, 29 April 2003

PETSc and Hypre combined in “Terascale Optimal PDE Simulations” (TOPS) ISIC

Nine institutions, five years, 24 co-PIs

Page 49: Scalable Solvers and Software  for PDE Applications

PennState Seminar, 29 April 2003

Scope for TOPS Design and implementation of “solvers”

Time integrators

Nonlinear solvers

Optimizers

Linear solvers

Eigensolvers

Software integration Performance optimization

0),,,( ptxxf

0),( pxF

bAx

BxAx

0,0),(..),(min uuxFtsuxu

Optimizer

Linear solver

Eigensolver

Time integrator

Nonlinear solver

Indicates dependence

Sens. Analyzer(w/ sens. anal.)

(w/ sens. anal.)

Page 50: Scalable Solvers and Software  for PDE Applications

PennState Seminar, 29 April 2003

Ex.: Hall magnetic reconnectionMagnetic Reconnection: Applications to Sawtooth Oscillations, Error Field Induced Islands and the Dynamo EffectThe research goals of this project include producing a unique high performance code and using this code to study magnetic reconnection in astrophysical plasmas, in smaller scale laboratory experiments, and in fusion devices. The modular code that will be developed will be a fully three-dimensional, compressible Hall MHD code with options to run in slab, cylindrical and toroidal geometry and flexible enough to allow change in algorithms as needed. The code will use adaptive grid refinement, will run on massively parallel computers, and will be portable and scalable. The research goals include studies that will provide increased understanding of sawtooth oscillations in tokamaks, magnetotail substorms, error-fields in tokamaks, reverse field pinch dynamos, astrophysical dynamos, and laboratory reconnection experiments.PI: Amitava BhattacharjeeUniversity of Iowa

Page 51: Scalable Solvers and Software  for PDE Applications

PennState Seminar, 29 April 2003

Summary of progress on CMRS CMRS team has provided TOPS with discretization of model 2D

multicomponent MHD evolution code in PETSc’s FormFunctionLocal format using DMMG and automatic differentiation for Jacobian objects

TOPS has implemented fully nonlinearly implicit GMRES-MG-ILU parallel solver with custom deflation of nullspace in CMRS’s doubly periodic formulation

CMRS and TOPS reproduce the same dynamics on the same grids with the same time-stepping, up to a finite-time singularity due to collapse of current sheet (that falls below presently uniform mesh resolution)

TOPS code, being implicit, can choose timesteps an order of magnitude larger, with potential for higher ratio in more physically realistic parameter regimes, but is still slower in wall-clock time

PLAN: tune PETSc solver by profiling, blocking, reuse, etc. PLAN: go to higher-order in time PLAN: identify the numerical complexity benefits from implicitness (in

suppressing fast timescales) and quantify (explicit versus implicit) PLAN (with APDEC team): incorporate AMR

Page 52: Scalable Solvers and Software  for PDE Applications

PennState Seminar, 29 April 2003

Equilibrium:

Model equations: (Porcelli et al., 1993, 1999)2D Hall MHD sawtooth instability

figures c/o A. Bhattacharjee, CMRS

Vorticity, early time

Vorticity, later time

zoom

ex29.c in

PETSc 2.5.1

Page 53: Scalable Solvers and Software  for PDE Applications

PennState Seminar, 29 April 2003

Time-implicit Newton-Krylov-SchwarzFor nonlinear robustness, NKS iteration is wrapped in time-

stepping:for (l = 0; l < n_time; l++) {

select time stepfor (k = 0; k < n_Newton; k++) { compute nonlinear residual and Jacobian

for (j = 0; j < n_Krylov; j++) { forall (i = 0; i < n_Precon ; i++) {

solve subdomain problems concurrently } // End of loop over subdomains perform Jacobian-vector product enforce Krylov basis conditions update optimal coefficients check linear convergence } // End of linear solver perform DAXPY update check nonlinear convergence } // End of nonlinear loop} // End of time-step loop

NKS loop

Pseudo-time loop

Page 54: Scalable Solvers and Software  for PDE Applications

PennState Seminar, 29 April 2003

PETSc’s DMMG in Hall MHD application Mesh and time refinement studies of CMRS Hall magnetic reconnection model

problem (4 mesh sizes, dt=0.1 (nondimensional, near CFL limit for fastest wave) on left, dt=0.8 on right)

Measure of functional inverse to thickness of current sheet versus time, for 0<t<200 (nondimensional), where singularity occurs around t=215

Page 55: Scalable Solvers and Software  for PDE Applications

PennState Seminar, 29 April 2003

PETSc’s DMMG in Hall MR application, cont. Implicit timestep increase studies of CMRS Hall magnetic reconnection model

problem, on finest (192192) mesh of previous slide, in absolute magnitude, rather than semi-log

Page 56: Scalable Solvers and Software  for PDE Applications

PennState Seminar, 29 April 2003

Ex.: Computational aerodynamics

mesh c/o D. Mavriplis, ICASE

Implemented in PETSc

www.mcs.anl.gov/petsc

Transonic “Lambda” Shock, Mach contours on surfaces

Page 57: Scalable Solvers and Software  for PDE Applications

PennState Seminar, 29 April 2003

Fixed-size Parallel Scaling Results

Four orders of magnitude in 13 years

c/o K. Anderson, W. Gropp, D. Kaushik, D. Keyes and B. Smith

128 nodes 128 nodes 43min43min

3072 nodes 3072 nodes 2.5min, 2.5min, 226Gf/s226Gf/s

11M unknowns 11M unknowns 1515µs/unknown µs/unknown 70% efficient70% efficient

This scaling study, featuring our widest range of processor number, was done for the incompressible case.

Page 58: Scalable Solvers and Software  for PDE Applications

PennState Seminar, 29 April 2003

Scaling to new architectures w/Bulk Synchronous Processing (BSP) model

local scatter

Jac-vec multiply

precond sweep

daxpys

inner products

Krylov iteration

What happens if, for instance, in this (schematicized) iteration, arithmetic speed is doubled, scalar all-gather is quartered, and local scatter is cut by one-third? Each phase is considered separately. Answer is to the right.

P1:

P2:

Pn:

…P1:

P2:

Pn:

Page 59: Scalable Solvers and Software  for PDE Applications

PennState Seminar, 29 April 2003

0

50

100

150

200

250

300

350

400

1024 Red 1024 BG/L 2048 BG/L 4096 BG/L 8192 BG/L

Total time Comm. Time

*performed by A. Sugavanam; data pair is speedup (w.r. to 1024 BG/L nodes) and communication %

Primitive BSP-like BG/L extrapolation*

1.0 (11%) 2.0 (11%) 3.7 (20%) 7.9 (25%)

Page 60: Scalable Solvers and Software  for PDE Applications

PennState Seminar, 29 April 2003

Conclusions Domain decomposition and multilevel iteration the

dominant paradigm in contemporary terascale PDE simulation

Several freely available software toolkits exist, and successfully scale to thousands of tightly coupled processors for problems on quasi-static meshes

Concerted efforts underway to make elements of these toolkits interoperate, and to allow expression of the best methods, which tend to be modular, hierarchical, recursive, and above all — adaptive!

Tunability of NKS algorithmics allows solver adaption to application/architecture combinations

Next generation software should incorporate “best practices” in applications as preconditioners

Page 61: Scalable Solvers and Software  for PDE Applications

PennState Seminar, 29 April 2003

Acknowledgments Collaborators or Contributors:

Xiao-Chuan Cai (Univ. Colorado, Boulder) Omar Ghattas (Carnegie-Mellon) Dinesh Kaushik (ODU) Dana Knoll (LANL) Dimitri Mavriplis (ICASE) PETSc team at Argonne National Laboratory: Satish Balay, Bill Gropp, Lois McInnes, Barry Smith

Sponsors: DOE, NASA, NSF Computer Resources: LLNL, LANL, SNL, NERSC

Page 62: Scalable Solvers and Software  for PDE Applications

PennState Seminar, 29 April 2003

Related URLs Personal homepage: papers, talks, etc.

http://www.math.odu.edu/~keyes SciDAC initiative

http://www.science.doe.gov/scidac TOPS software project

http://www.tops-scidac.org PETSc software project

http://www.mcs.anl.gov/petsc Hypre software project

http://www.llnl.gov/CASC/hypre

Slides from 14-hour Peking University CS&E short course with Bill Gropp (in August 2002) on-line

Page 63: Scalable Solvers and Software  for PDE Applications

PennState Seminar, 29 April 2003

Bibliography Jacobian-Free Newton-Krylov Methods: Approaches and Applications, Knoll & Keyes, 2002,

submitted to J. Comp. Phys.

Nonlinearly Preconditioned Inexact Newton Algorithms, Cai & Keyes, 2002, SIAM J. Sci. Comp. 24:183-200

High Performance Parallel Implicit CFD, Gropp, Kaushik, Keyes & Smith, 2001, Parallel Computing 27:337-362

Four Horizons for Enhancing the Performance of Parallel Simulations based on Partial Differential Equations, Keyes, 2000, Lect. Notes Comp. Sci., Springer, 1900:1-17

Globalized Newton-Krylov-Schwarz Algorithms and Software for Parallel CFD, Gropp, Keyes, McInnes & Tidriri, 2000, Int. J. High Performance Computing Applications 14:102-136

Achieving High Sustained Performance in an Unstructured Mesh CFD Application, Anderson, Gropp, Kaushik, Keyes & Smith, 1999, Proceedings of SC'99

Prospects for CFD on Petaflops Systems, Keyes, Kaushik & Smith, 1999, in “Parallel Solution of Partial Differential Equations,” Springer, pp. 247-278

How Scalable is Domain Decomposition in Practice?, Keyes, 1998, in “Proceedings of the 11th Intl. Conf. on Domain Decomposition Methods,” Domain Decomposition Press, pp. 286-297

Page 64: Scalable Solvers and Software  for PDE Applications

PennState Seminar, 29 April 2003

EOF