NVIDIA MPI-enabled Iterative Solvers for Large Scale...

NVIDIA MPI-enabled Iterative Solvers for Large Scale Problems Joe Eaton Manager, AmgX CUDA Library

NVIDIA

ANSYS Fluent

Fluent control flow

Solve Linear System of Equations: Ax = b

Assemble Linear System of Equations

Converged ? No Yes

Accelerate

this first

Runtime: Non-linear iterations

Break it Down

Iterative Solvers

Low memory requirements

Best choice if only a few digits of accuracy are needed

Optimal complexity: O(N)

Multigrid methods

Optimal for discretized elliptic and mixed elliptic-hyperbolic type PDEs

For a black-box method, choose Algebraic Multigrid (AMG)

Krylov methods

PCG, GMRES, BiCGSTAB, IDR-> these work together with AMG, complementary

Easy to create parallel implementations

10 100 1000 10000

Log10(O

AMG PCG+ILU PCG+AMG

5Million Ops

100Thousand Ops

50Thousand Ops

Scalability- LogLog of Ops vs N

Multigrid on GPUs

Notice I didn’t claim Multigrid was easy to implement in parallel

For GPUs, we need fine-grain parallelism, in addition to DD

MPI for coarse grain parallelism, CUDA + OpenMP on each node.

NVIDIA set out to create a world-class AMG implementation on GPUs

3 years of effort, 7 Ph.D. contributors, 9 Patents later

AmgX is HERE!

Parallel AMG (Classical and Aggregation)

Smoothers (Block-Jacobi, GS, DILU, Polynomial)

MG cycles (V,W,F,CG)

Classical Ruge-Steuben AMG

Robust convergence

- High setup cost, two phases

• Only scalar matrices

Unsmoothed Aggregation AMG

Easier to parallelize

Lower grid and Operator complexity

Block structured matrices

Extras

Preconditioned & Accelerated Eigensolvers

Power method, inverse power method, shifted inverse

LOBPCG

Jacobi-Davidson

Subspace Iteration

Find some extremal eigenvalues fast

Find eigenvalues near a given value

Krylov Methods

All Rely on massively parallel SpMV

CG,PCG, PCGF

GMRES, FGMRES, DQGMRES

BiCGStab, BiPCGStab

IDR(s)

IDR(s) is the new kid on the block, allows a smooth variation

between BiCGStab behavior and GMRES-like behavior

Graph Utilities

Partitioning with METIS

Graph coloring for parallel smoothers

Produce fast and high-quality results

New non-zero-counting load balancing

Flexible Solver Construction

All methods composable, nestable to create complex solvers easily

(PCG+AMG+GMRES, GMRES+AMG, AMG+AMG)

Offers unique NVIDIA methods for massive parallelism

Classical AMG

Aggregation

GMRES LU

Aggregation

Size 4

Aggregation

Size 8

Standards Compliance

Supports standard matrix formats

Matrix-Market, CSR, BSR, COO

Easy-to-integrate C API

Tested with multiple MPI implementations

PCMPI, MVAPICH, OpenMPI

Works with CUDA 5.0 and CUDA 5.5

Easy to Use

No CUDA experience necessary to use the library

Small API (<30 calls), no bloat

Upload a matrix from the host, or hand it over in device memory

Uses CUDA Unified Virtual Addresses for zero-copy uploads

Supports NVIDIA GPUs Fermi and newer

Single, Double, Mixed precision

Handles consolidation of matrix on fine or coarse level

Single GPU, Multi-GPU, clusters with MPI

Minimal Example- 11 calls

AMGX_initialize();

AMGX_initialize_plugins();

AMGX_create_config(&cfg, cfgfile);

AMGX_resources_create_simple(&res, cfg);

AMGX_solver_create(&solver, res, mode, cfg);

AMGX_matrix_create(&A,res,mode);

AMGX_vector_create(&x,res,mode);

AMGX_vector_create(&b,res,mode);

AMGX_read_system(&A,&x,&b, matrixfile);

AMGX_solver_setup(solver,A);

AMGX_solver_solve(solver, b, x);

Capacity & Performance

We have verified 3 million rows per GB of GPU memory

-> 12 million rows on Tesla K20x (6GB)

-> 24 million rows on Tesla K40 (12GB)

Single GPU performance typically 5million rows /sec for 1 digit

-> 10 million rows solved to 6 digits (1e-6 relative tolerance)

10/5 * 6= 12 seconds

-> 8 million row system, to 10 digits

8/5 * 10 = 16 seconds

Your mileage may vary: miniFE app delivers 7.44 M row/sec for 1 digit

MPI Performance

Dominated by coarse grid communication overhead

Consolidate coarse grids to a single GPU

Try out asynchronous coarse corrections

o When we engage multiple nodes, we see big hit

o Weak scaling good to 4 GPUs (95% efficient)

o Above 16 GPUs, we have 50% scaling for

OpenFOAM Weak Scaling

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5

GPU/AMG

CPU/GAMG

Strong Scaling Fluent

0 1 2 3 4 5 6

Num Nodes- 1GPU per node

Efficiency

Strong Scaling 3M rows, Single Node

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5GPUs

Perfect Naive

Scotch Metis

Beta Users

We have started a private Beta test program,14 institutions and ISVs

Public release in Nov, 2013 at SC’13

Use cases so far:

Oil&Gas ( exploration and production)

CFD (external aerodynamics,incompressible and compressible flow)

Flooding and storm surge simulation

Tsunami simulation

Coupled heat transfer and flow in pipes

Network/graph metrics (page rank, maximal flow, partitioning)

Computer Vision

Intel Xeon E5-2667, 2.90GHz

Intel Xeon E5-2667, 2.90GHz + Tesla K20X

2 x Nodes, 4 CPUs (24 Cores Total);

8 GPUs (4 ea Node)

Lower is

Better

ANSYS Fluent for 14M Cell Aerodynamic Case

14 M Mixed cells

DES Turbulence

Coupled PBNS, SP

Times for 1 Iteration

AMG F-cycle on CPU

GPU: Preconditioned FGMRES with AMG

Truck Body Model

NOTE: All jobs

solver time only

4 x Nodes, 8 CPUs (48 Cores Total);

16 GPUs (4 ea Node)

1 x Nodes, 2 CPUs (12 Cores Total)

ANSYS Fluent 15.0 Preview Performance – Results by NVIDIA, Jun 2013

tion (S

CPU CPU+GPU

Segregated Solver

1.9x 2.2x

Lower is

Better

Coupled Solver

GPU Acceleration on Coupled Solvers

Sedan geometry

3.6 M mixed cells

Steady, turbulent

External aerodynamics

Coupled PBNS, DP

AMG F-cycle on CPU

AMG V-cycle on GPU

Sedan Model

Preview of ANSYS Fluent 15.0 Performance

NOTE: Times

for total solution

Tremendous Collaboration and Team

Thanks to an awesome team (in alphabetical order):

Marat Arsaev, Patrice Castonguay, Jonathan Cohen, Julien Demouth, Bhushan

Desam, Joe Eaton, Justin Luitjens, Nikolay Markovskiy, Maxim Naumov, Stan

Posey, Nikolai Sakharnykh, Robert Strzodka, Han Van Holder, Zhenhai Zhu

Star interns:

Peter Zaspel, Simon Layton, Lu Wang, Istvan Reguly, Francesco Rossi, Christoph

Franke, Felix Abecassis, Rohit Gupta

Our collaborators and AMG advisors:

ANSYS: Sunil Sathe, Prasad Alavilli, Rongguang Jia

PSU: James Brannick, Ludmil Zikatanov, Jinchao Xu, Xiaozhe Hu Developers at

companies to be named later…

NVIDIA MPI-enabled Iterative Solvers for Large Scale...

Documents

Transcript of NVIDIA MPI-enabled Iterative Solvers for Large Scale...

Parallel Iterative Solvers with the Selective Blocking Preconditioning for Simulations of Fault-Zone Contact Kengo Nakajima GeoFEM/RIST, Japan. 3rd ACES.

ANSYS Solvers: Usage and Performancetynecomp.co.uk/Xansys/solver_2002.pdf · ANSYS Solvers: Usage and Performance Gene Poole Ansys equation solvers: usage and guidelines Ansys Solvers

Partial Differential Equations - MGNetdouglas/Classes/cs521/pde/pde2004.pdf · Partial Differential Equations • Introduction, Adam Zornes • Discretizations and Iterative Solvers,

FAST ITERATIVE SOLVERS FOR CONVECTION-DIFFUSION CONTROL PROBLEMS · 2012-12-11 · FAST ITERATIVE SOLVERS FOR CONVECTION-DIFFUSION CONTROL PROBLEMS JOHN W. PEARSON AND ANDREW J. WATHENy

1 Iterative Solvers for Linear Systems of Equations Presented by: Kaveh Rahnema rahnema@in.tum.de Supervisor: Dr. Stefan Zimmer zimmer@in.tum.de.

Using Belos: Next-generation Iterative Solvers November 6 th, 2007 Using Belos: Next-generation Iterative Solvers November 6 th, 2007 Mike Heroux Rich.

Preconditioning Methods for Iterative Solversnkl.cc.u-tokyo.ac.jp/15e/05-Advanced/precond.pdfFEM3D-Part3 2 Preconditioning for Iterative Solvers Convergence rate of iterative solvers

Preconditioned Iterative Linear Solvers for Unstructured Grids on the Earth Simulator Kengo Nakajima nakajima@cc.u-tokyo.ac.jp Supercomputing Division,

Iterative Solvers for Linear Systems of Equations

Improved RANSAC performance using simple, iterative minimal-set … · 2012-01-09 · Improved RANSAC performance using simple, iterative minimal-set solvers E. Rosten, G. Reitmayry,

NVIDIA PROFESSIONAL GRAPHICS SOLUTIONS - PNYpny.quadro-selector.com/telechargements/QUADRO_FAMILY_SPECS.pdf · NVIDIA Quadro ® / NVIDIA Tesla / NVIDIA Grid / SSDs NVIDIA PROFESSIONAL

Module 5.7: nag sparse lin sys Sparse Linear System ... · Linear Equations Module Contents Module 5.7: nag sparse lin sys Sparse Linear System Iterative Solvers nagsparselinsysprovides

Fast Iterative Solvers for Markov Chains, with Application ...hdesterc/websiteW/Data/presentations/pres... · Fast Iterative Solvers for Markov Chains, with Application to Google's

Parallel Iterative Solvers of GeoFEM with Selective …sc-conference.org/sc2003/paperpdfs/pap155.pdf1 Parallel Iterative Solvers of GeoFEM with Selective Blocking Preconditioning for

Accelerated Discontinuous Galerkin Solvers with the ... · Accelerated Discontinuous Galerkin Solvers with the Chebyshev Iterative Method on the Graphics Processing Unit by Toni Kathleen

MUMPS USERS DAYS JUNE 1ST / 2 2017mumps.enseeiht.fr/doc/ud_2017/Courteille_Talk.pdf · Krylov methods, basic iterative solvers, AMG Eigenvalue Solvers Subspace Iteration, Restarted

Polyhedral-Based Data Reuse Optimization for …pouchet/doc/fpga-slides.13.pdf · Polyhedral-Based Data Reuse Optimization ... (NSF CDSC project). Linear algebra. Iterative solvers

Yousef Saad Department of Computer Science and …saad/PDF/HPC_days_Lyon...HPC Days 04/08/2016 p. 6 PART 1: SPARSE ITERATIVE SOLVERS Introduction: Linear System Solvers General Purpose

Fine-grained parallel iterative solvers in OpenFOAM ... · Fine-grained parallel iterative solvers in ... OpenFOAM plug-in AMG and GAMG AMG I Purely algebraic multigrid solver I The

Next Generation Iterative Solvers: Belos & Anasazi Next Generation Iterative Solvers: Belos & Anasazi November 3 rd, 9:30-10:30am Mike Heroux Ulrich Hetmaniuk.