Algorithm Design for High Performance CFD Solvers on ...

Algorithm Design for High Performance CFD Solvers on Structured GridsHengjie Wang, Aparna Chandramowlishwaran (Advisor), HPC Forge Lab, University of California-Irvine

Motivations

Contributions

CFD + Deep Learning

• Unify the network properties, edge cuts (# messages) and communication volume into one cost function

cost = latency ⋅ cuts +volume

bandwidthNetwork Properties

Messages Generated by Partitioner

• Design new algorithms for partitioning

Group small blocks

With new cost function:‣ REB: Recursive Edge Bisection‣ IF: Integer Factorization

Cut large blocks:

Novel algorithms:‣ CCG: Cut and Combine blocks Greedily‣ GGS: Sweep blocks with Graph Growth methods

• Performance EvaluationEvaluate the update of a Jacobi solver with Bump3D and a rocket model based on SpaceX’s Falcon-Heavy on the Mira Supercomputer at the Argonne National Lab

‣ Bump3D: 1 large block and 4 small blocks

‣ Rocket Model: 769 blocks of various sizes

• Optimal Hybrid (MPI+OpenMP) Temporal Tiling

• Overlap Communication and Computation

WJ7WJ13 WJ27

Speedup

32 Broadwell nodes, Omni-Path

Non-pipeline

Pipeline

‣ Strong and weak scalability on 16-128 Broadwell nodes

16 32 64 1280.4

Weak Scaling, 4803 Cells per node

WJ7 WJ13 WJ27

Upwind WENO3 Burgers

16 32 64 128

6.4Tim

Strong Scaling, 1920⇥ 1920⇥ 960 Cells

WJ7 WJ13 WJ27

Upwind WENO3 Burgers

• Support Multi-Block Structured Grids‣ Use DeepHalo to break the inter-block dependency that

prohibits temporal tiling for multi-block structured grids

Inter-block dependency DeepHalo

‣ Evaluate with a 6-block grid on 32 Broadwell nodes Report Speedup over MPI-Funneled with space tiling

• Generalize Networks to Geometries Unseen in Training

‣ Transfer predictions across different geometries

✓ Unify key algorithmic knobs and network properties into one cost function

✓ Design new partition algorithms for structured grids✓ Outperform state-of-the-art methods by 1.5-3x

✓ Optimal hybrid temporal tiling, up to 1.9x over Pluto✓ Pipeline computation and communication✓ Applicable to multi-block structured girds✓ 1.3-3.4x over MPI-Funneled with space tiling on

distributed machines

✓ Extract local flow patterns via deep learning✓ Predict flow over arbitrary geometries

Computational Fluid Dynamics (CFD) with structured grids has been widely utilized in many engineering disciplines such as Aerospace Engineering, Vehicle Design, etc. Its computation and communication are characterized by stencils and halo exchange.

Distributed Stencil Computation

Multi-Block Structured Grid Partitioner

CFD + Deep Learning (ongoing)

Distributed Stencil Computation (SC20)

Multi-Block Structured Grid Partitioner (ICS19)

1024 nodes 2048 nodes 4096 nodes0

·10�2

Running Time per Iteration

Communication

Computation

Others

1.5·10�2

A: Greedy

B: Metis

C: REB+CCG

D: IF+CCG

E: REB+GGS

F: IF + GGS

Star and Box Stencils Halo Exchange

We identify the following key performance limitations in start-of-the-art CFD algorithms and solvers: • Multi-Block Structured Grid Partitioner ‣ Biased towards minimizing communication volume ‣ Use graph partitioner for complex structured grid

• Distributed Stencil Computation ‣ Temporal tiling is not directly applicable to multi-

block grids on distributed-memory systems ‣ Limited overlap of computation and communication

with temporal tiling • CFD + Deep Learning ‣ Problem-specific surrogate, i.e., unable to predict

flow over unseen geometries in training

1024 nodes 2048 nodes 4096 nodes0

1.5·10�2

Running Time per Iteration

Communication

Computation

Others

1.5·10�2

A: Greedy

B: Metis

C: REB+CCG

D: IF+CCG

E: REB+GGS

F: IF + GGS

1 // t r av e r s e time

2 f o r ( i n t t=0; t<tEnd;++t ) {3 // t r av e r s e b locks

4 f o r ( i n t b=0;b<nBlocks;++b) {5 g e t b l o c k s i z e (b , s i z e ) ;

6 // t r av e r s e space

7 f o r ( i n t i =0; i<s i z e [0] ;++ i )

8 f o r ( i n t j =0; j<s i z e [ 1 ] ; ++j )

9 f o r ( i n t k=0;k<s i z e [ 2 ] ++k)

10 compute (b , i , j , k ) ;

11 }12 // b locks ’ connec t i ons

13 f o r ( i n t b=0;b<nBlocks;++b)

14 exchange boundary (b , nHalo ) ;

1 // t r av e r s e time

2 f o r ( i n t t=0; t<tEnd ; t+=tFused ) {3 // fused i t e r a i o n s

4 f o r ( i n t t t =0; tt<tFused ; ++t t ) {5 // t r av e r s e b locks

6 f o r ( i n t b=0;b<nBlocks;++b) {7 g e t b l o c k s i z e (b , s i z e ) ;

8 // t r av e r s e space

9 f o r ( i n t i =0; i<s i z e [0] ;++ i )

10 f o r ( i n t j =0; j<s i z e [ 1 ] ; ++j )

11 f o r ( i n t k=0;k<s i z e [ 2 ] ++k)

12 compute (b , i , j , k ) ;

13 }14 }15 //update deep halo

16 f o r ( i n t b=0;b<nBlocks;++b)

17 exchange boundary (b , nHalo⇤ tFused ) ;

WJ7WJ13 WJ27

Speedup

6 blocks, 32 Broadwell nodes

‣ MPI and OpenMP’s memory arrangement

Nk/2Nj

The optimal decomposition of � by 2 processes or 2 threads. The J-K plane is 2x larger for threads.

2 × Nj × Nk

‣ Temporal Tiling

Time-Space Tile

Fuse 5 iterations (5 colors) and reuse J-K planes while they reside in cache. Small J-K planes are easy to fit in cache.

‣ MPI+OpenMP Temporal Tiling

Decompose K with processes

March in I with Temporal Tiling

Decompose J with threads

WJ7 WJ13 WJ27 UpwindWeno3Burgers

Speedup

Speedup over Non-tiling on Broadwell

Pluto (Diamond)

Hybrid

Reduce the size of J-K planes per process

Apply prefetching and SIMD to K

Reuse J-K planes

The performance is evaluated across 6 numerical schemes and compared to space tiling and Pluto (diamond tiling):

WJ7: 7pt weighted Jacobi for � WJ13: 13pt weighted Jacobi for � WJ27: 27pt weighted Jacobi for � Upwind: 2nd order upwind for � Weno3: 3rd order WENO for � Burgers: CD for �

∇2p = b∇2p = b∇2p = b

∂tϕ + ⃗u ⋅ ∇ϕ = 0∂tϕ + ⃗u ⋅ ∇ϕ = 0

∂t ⃗u + ∇ ⋅ ( ⃗u ⃗u ) = ∇2 ⃗u

‣ Divide the domain into chunks and pipeline the chunks’ execution

(Pluto was unable to tile Burgers equation.)

Cl�1

Stage l

Cl�1

Stage l + 1

Pipline Communication and Computation

Ground Truth Prediction, RMAE ~6%Ground Truth Prediction, RMAE ~4%

‣ Train the network to solve � with data and physics∇2p = 0

�loss =1N

(∑ (p − p*)2 + α∑ (∇2p)2)Multigrid + Finite Difference

Lid-Driven Cavity Channel Flow

Backward facing step

‣ Learn simple flow features

‣ Predict for a more complex flowTransfer

• Next Steps: Simple Flow Features � Complex Flows→

Geometries:‣ Flat Plate‣ Channel‣ Cavity …Local Solutions:‣ Boundary Layer‣ Vortex‣ Wave …

Predictorfor Simple Features

Deep Learning

Unseen Geometries:‣ Backward facing step‣ Airfoil …

Transfer

‣ Complex flows are composed by simple flow features

Algorithm Design for High Performance CFD Solvers on ...

Documents

Transcript of Algorithm Design for High Performance CFD Solvers on ...

Accelerating Complex Simulations: An Example From … · 2014. 11. 10. · HyperWorks Solvers Thermal and CFD Highly Non-Linear Crash Safety Forming Statics NVH Thermal Non-Linear

Grid2Grid : HOS Wrapper Program for CFD solvers arXiv:1801 ...

MAE 598/494 Applied Computational Fluid Dynamics Fall …hhuang38/acfd_2017_firstday.pdf · Many commercial/industrial CFD solvers are available on the market. ... AcuSolve, Abacus,

Computational Fluid Dynamic Solver based on Cellular ...vs3.sce.carleton.ca/publications/2014/Van14/Michael.pdf · Computational Fluid Dynamic solvers (CFD). The Discrete Event System

Block coupled matrix solvers in foam-extend-3 and morehani/kurser/OS_CFD_2014/... · Block coupled matrix solvers in foam-extend-3 and more Teaching within: CFD with OpenSource software

CFD-Solver for filter applications - DHCAE Tools CFD solvers for filter applications Flow simulations (CFD) ... and extensions of OpenFOAM® for filter applications, DHCAE Tools created

Validation for CFD Predication of Mass Transport in an Aircraft … · 2012-05-31 · CFD algorithm. For this project, CFD simulations have been conducted using two firmly established

An efficient distributed randomized algorithm for solving ...baboulin/parco2014.pdf · Distributed linear algebra solvers Symmetric indeﬁnite systems LDLT factorization PaRSEC runtime

ANSYS Solvers: Usage and Performancetynecomp.co.uk/Xansys/solver_2002.pdf · ANSYS Solvers: Usage and Performance Gene Poole Ansys equation solvers: usage and guidelines Ansys Solvers

N92-2 409 - NASA · · 2013-08-30N92-2 409 INTEGRATED GEOMETRY ... Computational grids required by viscous fluid flow solvers can be generated using ICEM/CFD, ... Current ICEM/CFD

Advanced Process Simulationfolk.ntnu.no/preisig/HAP_Specials/AdvancedSimulation...Advanced Process Simulation CFD - Analysis of numerical solvers in openFOAM Extend 4.0 Written by

Why CFD is Important? - Gexcon · 2020. 4. 24. · 03 What is CFD? Introduction to CFD (Computational Fluid Dynamics) CFD is the application of algorithm and numerical techniques

GPU Acceleration of Unmodiﬁed CSM and CFD Solvers · Multiplicative vertically (between levels), global coarse grid problem (MG-like) Additive horizontally: block-Jacobi / Schwarz

Turbulent Flow Solvers –Perspectives on HPC and …turbgate.engin.umich.edu/symposium/assets/files/pdfs/Day...design on unstructured meshes (i.e., CFD!). First and foremost: a Community

Algorithm Portfolios for Noisy Optimization: Compare Solvers Early (LION8)

PIMPLE Algorithm and Partitioned FSI Solvers · 2016. 7. 28. · 23.06.2016 Page 7 FSI Solver (3) PISO Algorithm-Used flow solver, rhoPisoFoam, was unfortunately unstable for given

Hierarchical Solvers - Applied Mathematics Darve.pdf · Hierarchical solvers • Hierarchical solvers offer a bridge between direct and iteration solvers. • They lead to efﬁcient

REPORT DOCUMENTATION PAGE Form Approved · (GSM) CFD solver and CSD solvers. The research includes three parts: 1) development and improvement of the GSM-CFD solver; 2) development

Latent Class Models for Algorithm Portfolio Methods · Algorithm portfolios: what and why De nition Analgorithm portfoliois a pool of algorithms (\solvers") and a method for scheduling

Future CFD Technolgies Workshop Kissimmee, Florida ... · PDF filepropulsion organized as a set of solvers - CHARME : Reacting gas solver ... SLD, crystals ...) ... DG modal or nodal