Algorithm Design for High Performance CFD Solvers on ...

1
Algorithm Design for High Performance CFD Solvers on Structured Grids Hengjie Wang, Aparna Chandramowlishwaran (Advisor), HPC Forge Lab, University of California-Irvine Motivations Contributions CFD + Deep Learning Unify the network properties, edge cuts (# messages) and communication volume into one cost function cost = latency cuts + volume bandwidth Network Properties Messages Generated by Partitioner Design new algorithms for partitioning Group small blocks With new cost function: REB: Recursive Edge Bisection IF: Integer Factorization Cut large blocks: Novel algorithms: CCG: Cut and Combine blocks Greedily GGS: Sweep blocks with Graph Growth methods Performance Evaluation Evaluate the update of a Jacobi solver with Bump3D and a rocket model based on SpaceX’s Falcon-Heavy on the Mira Supercomputer at the Argonne National Lab Bump3D: 1 large block and 4 small blocks Rocket Model: 769 blocks of various sizes Optimal Hybrid (MPI+OpenMP) Temporal Tiling Overlap Communication and Computation WJ7 WJ13 WJ27 Upwind Weno3 Burgers 1 2 3 Speedup 32 Broadwell nodes, Omni-Path Non-pipeline Pipeline Strong and weak scalability on 16-128 Broadwell nodes 16 32 64 128 0.4 0.8 1.6 3.2 Time(s) Weak Scaling, 480 3 Cells per node WJ7 WJ13 WJ27 Upwind WENO3 Burgers 16 32 64 128 0.1 0.8 6.4 Time(s) Strong Scaling, 1920 1920 960 Cells WJ7 WJ13 WJ27 Upwind WENO3 Burgers Support Multi-Block Structured Grids Use DeepHalo to break the inter-block dependency that prohibits temporal tiling for multi-block structured grids Inter-block dependency DeepHalo Evaluate with a 6-block grid on 32 Broadwell nodes Report Speedup over MPI-Funneled with space tiling Generalize Networks to Geometries Unseen in Training Transfer predictions across dierent geometries Unify key algorithmic knobs and network properties into one cost function Design new partition algorithms for structured grids Outperform state-of-the-art methods by 1.5-3x Optimal hybrid temporal tiling, up to 1.9x over Pluto Pipeline computation and communication Applicable to multi-block structured girds 1.3-3.4x over MPI-Funneled with space tiling on distributed machines Extract local flow patterns via deep learning Predict flow over arbitrary geometries Computational Fluid Dynamics (CFD) with structured grids has been widely utilized in many engineering disciplines such as Aerospace Engineering, Vehicle Design, etc. Its computation and communication are characterized by stencils and halo exchange. Distributed Stencil Computation Multi-Block Structured Grid Partitioner CFD + Deep Learning (ongoing) Distributed Stencil Computation (SC20) Multi-Block Structured Grid Partitioner (ICS19) 1024 nodes 2048 nodes 4096 nodes 0 0.5 1 1.5 ·10 -2 A Time (s) Running Time per Iteration Communication Computation Others 0 0.5 1 1.5 ·10 -2 B C D E F A: Greedy B: Metis C: REB+CCG D: IF+CCG E: REB+GGS F: IF + GGS Star and Box Stencils Halo Exchange We identify the following key performance limitations in start-of-the-art CFD algorithms and solvers: Multi-Block Structured Grid Partitioner Biased towards minimizing communication volume Use graph partitioner for complex structured grid Distributed Stencil Computation Temporal tiling is not directly applicable to multi- block grids on distributed-memory systems Limited overlap of computation and communication with temporal tiling CFD + Deep Learning Problem-specific surrogate, i.e., unable to predict flow over unseen geometries in training 1024 nodes 2048 nodes 4096 nodes 0 0.5 1 1.5 ·10 -2 A Time (s) Running Time per Iteration Communication Computation Others 0 0.5 1 1.5 ·10 -2 B C D E F A: Greedy B: Metis C: REB+CCG D: IF+CCG E: REB+GGS F: IF + GGS //traverse time for ( int t=0;t <tEnd;++t ) { //traverse blocks for ( int b=0;b <nBlocks;++b) { get block size(b, size); //traverse space for ( int i=0;i < size[0];++i) for ( int j=0;j < size[1]; ++j ) for ( int k=0;k < size [2] ++k ) compute(b, i, j, k); } //blocks ’ connections for ( int b=0;b <nBlocks;++b) exchange boundary(b, nHalo ) ; } //traverse time for ( int t=0;t <tEnd ; t+=tFused ) { // fused iteraions for ( int tt=0; tt <tFused ; ++t t ) { //traverse blocks for ( int b=0;b <nBlocks;++b) { get block size(b, size); //traverse space for ( int i=0;i < size[0];++i) for ( int j=0;j < size[1]; ++j ) for ( int k=0;k< size [2] ++k ) compute(b, i, j, k); } } //update deep halo for ( int b=0;b <nBlocks;++b) exchange boundary(b, nHalo tFused) ; } WJ7 WJ13 WJ27 Upwind Weno3 Burgers 1 2 3 Speedup 6 blocks, 32 Broadwell nodes MPI and OpenMP’s memory arrangement k j P0 P1 T0 T1 N j N k /2 N j N k The optimal decomposition of by 2 processes or 2 threads. The J-K plane is 2x larger for threads. 2 × N j × N k Temporal Tiling t j Time-Space Tile Fuse 5 iterations (5 colors) and reuse J-K planes while they reside in cache. Small J-K planes are easy to fit in cache. MPI+OpenMP Temporal Tiling Decompose K with processes March in I with Temporal Tiling Decompose J with threads WJ7 WJ13 WJ27 UpwindWeno3 Burgers 1 2 3 4 5 Speedup Speedup over Non-tiling on Broadwell Space Pluto (Diamond) Hybrid Reduce the size of J-K planes per process Apply prefetching and SIMD to K Reuse J-K planes The performance is evaluated across 6 numerical schemes and compared to space tiling and Pluto (diamond tiling): WJ7: 7pt weighted Jacobi for WJ13: 13pt weighted Jacobi for WJ27: 27pt weighted Jacobi for Upwind: 2nd order upwind for Weno3: 3rd order WENO for Burgers: CD for 2 p = b 2 p = b 2 p = b t ϕ + u ⋅∇ ϕ =0 t ϕ + u ⋅∇ ϕ =0 t u + ∇⋅ ( u u )= 2 u Divide the domain into chunks and pipeline the chunks’ execution (Pluto was unable to tile Burgers equation.) C l-1 C l C l+1 P 0 C l-1 C l C l+1 P 1 Stage l C l-1 C l C l+1 P 0 C l-1 C l C l+1 P 1 Stage l +1 Pipline Communication and Computation Ground Truth Prediction, RMAE ~6% Ground Truth Prediction, RMAE ~4% Train the network to solve with data and physics 2 p =0 loss = 1 N ( ( p p*) 2 + α ( 2 p) 2 ) Multigrid + Finite Dierence Lid-Driven Cavity Channel Flow Backward facing step Learn simple flow features Predict for a more complex flow Transfer Next Steps: Simple Flow Features Complex Flows Training Geometries: Flat Plate Channel Cavity … Local Solutions: Boundary Layer Vortex Wave … Predictor for Simple Features Net Deep Learning Test Unseen Geometries: Backward facing step Airfoil … Transfer Complex flows are composed by simple flow features

Transcript of Algorithm Design for High Performance CFD Solvers on ...

Algorithm Design for High Performance CFD Solvers on Structured GridsHengjie Wang, Aparna Chandramowlishwaran (Advisor), HPC Forge Lab, University of California-Irvine

Motivations

Contributions

CFD + Deep Learning

• Unify the network properties, edge cuts (# messages) and communication volume into one cost function

cost = latency ⋅ cuts +volume

bandwidthNetwork Properties

Messages Generated by Partitioner

• Design new algorithms for partitioning

Group small blocks

With new cost function:‣ REB: Recursive Edge Bisection‣ IF: Integer Factorization

Cut large blocks:

Novel algorithms:‣ CCG: Cut and Combine blocks Greedily‣ GGS: Sweep blocks with Graph Growth methods

• Performance EvaluationEvaluate the update of a Jacobi solver with Bump3D and a rocket model based on SpaceX’s Falcon-Heavy on the Mira Supercomputer at the Argonne National Lab

‣ Bump3D: 1 large block and 4 small blocks

‣ Rocket Model: 769 blocks of various sizes

• Optimal Hybrid (MPI+OpenMP) Temporal Tiling

• Overlap Communication and Computation

WJ7WJ13 WJ27

Upwin

dWeno

3

Burge

rs1

2

3

Speedup

32 Broadwell nodes, Omni-Path

Non-pipeline

Pipeline

‣ Strong and weak scalability on 16-128 Broadwell nodes

16 32 64 1280.4

0.8

1.6

3.2

Tim

e(s)

Weak Scaling, 4803 Cells per node

WJ7 WJ13 WJ27

Upwind WENO3 Burgers

16 32 64 128

0.1

0.8

6.4Tim

e(s)

Strong Scaling, 1920⇥ 1920⇥ 960 Cells

WJ7 WJ13 WJ27

Upwind WENO3 Burgers

• Support Multi-Block Structured Grids‣ Use DeepHalo to break the inter-block dependency that

prohibits temporal tiling for multi-block structured grids

Inter-block dependency DeepHalo

‣ Evaluate with a 6-block grid on 32 Broadwell nodes Report Speedup over MPI-Funneled with space tiling

• Generalize Networks to Geometries Unseen in Training

‣ Transfer predictions across different geometries

✓ Unify key algorithmic knobs and network properties into one cost function

✓ Design new partition algorithms for structured grids✓ Outperform state-of-the-art methods by 1.5-3x

✓ Optimal hybrid temporal tiling, up to 1.9x over Pluto✓ Pipeline computation and communication✓ Applicable to multi-block structured girds✓ 1.3-3.4x over MPI-Funneled with space tiling on

distributed machines

✓ Extract local flow patterns via deep learning✓ Predict flow over arbitrary geometries

Computational Fluid Dynamics (CFD) with structured grids has been widely utilized in many engineering disciplines such as Aerospace Engineering, Vehicle Design, etc. Its computation and communication are characterized by stencils and halo exchange.

Distributed Stencil Computation

Multi-Block Structured Grid Partitioner

CFD + Deep Learning (ongoing)

Distributed Stencil Computation (SC20)

Multi-Block Structured Grid Partitioner (ICS19)

1024 nodes 2048 nodes 4096 nodes0

0.5

1

1.5

·10�2

A

Tim

e(s)

Running Time per Iteration

Communication

Computation

Others

0

0.5

1

1.5·10�2

BCDEF

A: Greedy

B: Metis

C: REB+CCG

D: IF+CCG

E: REB+GGS

F: IF + GGS

Star and Box Stencils Halo Exchange

We identify the following key performance limitations in start-of-the-art CFD algorithms and solvers: • Multi-Block Structured Grid Partitioner ‣ Biased towards minimizing communication volume ‣ Use graph partitioner for complex structured grid

• Distributed Stencil Computation ‣ Temporal tiling is not directly applicable to multi-

block grids on distributed-memory systems ‣ Limited overlap of computation and communication

with temporal tiling • CFD + Deep Learning ‣ Problem-specific surrogate, i.e., unable to predict

flow over unseen geometries in training

1024 nodes 2048 nodes 4096 nodes0

0.5

1

1.5·10�2

A

Tim

e(s)

Running Time per Iteration

Communication

Computation

Others

0

0.5

1

1.5·10�2

B

CDEF

A: Greedy

B: Metis

C: REB+CCG

D: IF+CCG

E: REB+GGS

F: IF + GGS

1 // t r av e r s e time

2 f o r ( i n t t=0; t<tEnd;++t ) {3 // t r av e r s e b locks

4 f o r ( i n t b=0;b<nBlocks;++b) {5 g e t b l o c k s i z e (b , s i z e ) ;

6 // t r av e r s e space

7 f o r ( i n t i =0; i<s i z e [0] ;++ i )

8 f o r ( i n t j =0; j<s i z e [ 1 ] ; ++j )

9 f o r ( i n t k=0;k<s i z e [ 2 ] ++k)

10 compute (b , i , j , k ) ;

11 }12 // b locks ’ connec t i ons

13 f o r ( i n t b=0;b<nBlocks;++b)

14 exchange boundary (b , nHalo ) ;

15 }

1 // t r av e r s e time

2 f o r ( i n t t=0; t<tEnd ; t+=tFused ) {3 // fused i t e r a i o n s

4 f o r ( i n t t t =0; tt<tFused ; ++t t ) {5 // t r av e r s e b locks

6 f o r ( i n t b=0;b<nBlocks;++b) {7 g e t b l o c k s i z e (b , s i z e ) ;

8 // t r av e r s e space

9 f o r ( i n t i =0; i<s i z e [0] ;++ i )

10 f o r ( i n t j =0; j<s i z e [ 1 ] ; ++j )

11 f o r ( i n t k=0;k<s i z e [ 2 ] ++k)

12 compute (b , i , j , k ) ;

13 }14 }15 //update deep halo

16 f o r ( i n t b=0;b<nBlocks;++b)

17 exchange boundary (b , nHalo⇤ tFused ) ;

18 }

WJ7WJ13 WJ27

Upwin

dWeno

3

Burge

rs1

2

3

Speedup

6 blocks, 32 Broadwell nodes

‣ MPI and OpenMP’s memory arrangement

k

j

P0 P1

T0

T1Nj

Nk/2Nj

Nk

The optimal decomposition of � by 2 processes or 2 threads. The J-K plane is 2x larger for threads.

2 × Nj × Nk

‣ Temporal Tiling

t

j

Time-Space Tile

Fuse 5 iterations (5 colors) and reuse J-K planes while they reside in cache. Small J-K planes are easy to fit in cache.

‣ MPI+OpenMP Temporal Tiling

Decompose K with processes

March in I with Temporal Tiling

Decompose J with threads

WJ7 WJ13 WJ27 UpwindWeno3Burgers

1

2

3

4

5

Speedup

Speedup over Non-tiling on Broadwell

Space

Pluto (Diamond)

Hybrid

Reduce the size of J-K planes per process

Apply prefetching and SIMD to K

Reuse J-K planes

The performance is evaluated across 6 numerical schemes and compared to space tiling and Pluto (diamond tiling):

WJ7: 7pt weighted Jacobi for � WJ13: 13pt weighted Jacobi for � WJ27: 27pt weighted Jacobi for � Upwind: 2nd order upwind for � Weno3: 3rd order WENO for � Burgers: CD for �

∇2p = b∇2p = b∇2p = b

∂tϕ + ⃗u ⋅ ∇ϕ = 0∂tϕ + ⃗u ⋅ ∇ϕ = 0

∂t ⃗u + ∇ ⋅ ( ⃗u ⃗u ) = ∇2 ⃗u

‣ Divide the domain into chunks and pipeline the chunks’ execution

(Pluto was unable to tile Burgers equation.)

Cl�1

Cl

Cl+1

P0

Cl�1

Cl

Cl+1

P1

Stage l

Cl�1

Cl

Cl+1

P0

Cl�1

Cl

Cl+1

P1

Stage l + 1

Pipline Communication and Computation

Ground Truth Prediction, RMAE ~6%Ground Truth Prediction, RMAE ~4%

‣ Train the network to solve � with data and physics∇2p = 0

�loss =1N

(∑ (p − p*)2 + α∑ (∇2p)2)Multigrid + Finite Difference

Lid-Driven Cavity Channel Flow

Backward facing step

‣ Learn simple flow features

‣ Predict for a more complex flowTransfer

• Next Steps: Simple Flow Features � Complex Flows→

Trai

ning

Geometries:‣ Flat Plate‣ Channel‣ Cavity …Local Solutions:‣ Boundary Layer‣ Vortex‣ Wave …

Predictorfor Simple Features

Net

Deep Learning

Test

Unseen Geometries:‣ Backward facing step‣ Airfoil …

Transfer

‣ Complex flows are composed by simple flow features