Massively Parallel Lattice Boltzmann-based Simulations with Grid … · 8192 32,768 262,144 cores...
Transcript of Massively Parallel Lattice Boltzmann-based Simulations with Grid … · 8192 32,768 262,144 cores...
Massively Parallel Lattice Boltz-mann-based Simulations with Grid Refinement SIAM PP 2014, Portland
February 19, 2014
Florian Schornbaum, Ehsan Fattahi, David Staubach, Christian Godenschwager, Martin Bauer, Ulrich Rüde
Chair for System Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
2
Outline
Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014
• Introduction • The waLBerla Simulation Framework
• The Lattice Boltzmann Method
• Uniform Grids • Domain Decomposition & Parallelization
• Performance / Benchmarks
• Statically Refined Grids • Lattice Boltzmann & Grid Refinement
• Domain Decomposition & Load Balancing
• Performance / Benchmarks
• Outlook on Dynamic Grid Refinement / Conclusion
Introduction
• The waLBerla Simulation Framework
• The Lattice Boltzmann Method
4
Introduction
• waLBerla (widely applicable Lattice Boltzmann frame-work from Erlangen): • main focus on CFD (computational fluid dynamics) simulations
based on the lattice Boltzmann method (LBM)
• at its very core designed as an HPC software framework: • scales from laptops to current petascale supercomputers
• largest simulation: 1,835,008 processes (IBM Blue Gene/Q @ Jülich)
• hybrid parallelization: MPI + OpenMP
• vectorization of compute kernels
• written in C++(11)
• support for different platforms (Linux, Windows) and compilers (GCC, Intel XE, Visual Studio, llvm/clang, IBM XL)
• coupling with in-house rigid body physics engine pe
• open source → http://www.walberla.net
Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014
5
Introduction
• The lattice Boltzmann method: • regular grid with multiple particle distribution functions (=
scalar values) per cell [D2Q9, D3Q19, D3Q27, …]
• explicit method → time stepping
• two steps: stream (neighbors) & collide (cell-local)
• For the collision, different operators exist: SRT, TRT, MRT.
• Macroscopic quantities (velocity, density, …) can be calculated from the particle distribution functions.
Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014
(stream) (collide)
Uniform Grids
• Domain Decomposition & Parallelization
• Performance / Benchmarks
7
Uniform Grids
• Domain Decomposition: • regular decomposition into blocks containing uniform grids
• Parallelization: • data exchange on borders between blocks via ghost layers
Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014
[special case of our much more general forest of
octrees data structure → non-uniform/refined grids]
receiver process
sender process
8
Uniform Grids
Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014
geometry given by surface mesh domain decomposition into blocks
empty blocks are discarded load balancing
Load balancing can be based on either space-filling curves (Z-order/Morton order, Hilbert curve) using the under-lying forest of octrees or graph partitioning (METIS, …).
Whatever fits best the needs of the simulation.
flow simulation only in here (example: complex geometry
of an artery)
9
Uniform Grids
Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014
geometry given by surface mesh domain decomposition into blocks
empty blocks are discarded load balancing
Load balancing can be based on either space-filling curves (Z-order/Morton order, Hilbert curve) using the under-lying forest of octrees or graph partitioning (METIS, …).
Whatever fits best the needs of the simulation.
10
Uniform Grids
Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014
geometry given by surface mesh domain decomposition into blocks
empty blocks are discarded load balancing
allocation of block data (→ grids)
The domain decomposition and load balancing can be performed during
the actual simulation … OR …
11
Uniform Grids
Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014
geometry given by surface mesh domain decomposition into blocks
empty blocks are discarded load balancing
allocation of block data (→ grids)
DISK
separation of domain
partitioning from simulation
file size: kilobytes to few megabytes
DISK
12
Uniform Grids
Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014
allocation of block data (→ grids)
DISK
separation of domain
partitioning from simulation
file size: kilobytes to few megabytes
DISK
empty blocks are discarded load balancing
geometry given by surface mesh domain decomposition into blocks
All of this (the entire pipeline) works just the same when grid
refinement is used.
13
• Benchmark Environments: • JUQUEEN (TOP500: 8)
• Blue Gene/Q, 459K cores, 1 GB/core
• compiler: IBM XL / IBM MPI
• SuperMUC (TOP500: 10) • Intel Xeon, 147K cores, 2 GB/core
• compiler: Intel XE / IBM MPI
• Benchmarks (LBM D3Q19):
Uniform Grids - Performance
lid-driven cavity weak scaling
(= const. number of cells per core)
coronary artery tree strong scaling (= const. total
number of cells)
Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014
⇒ C. Godenschwager, F. Schornbaum, M. Bauer, H. Köstler, and U. Rüde, A Framework for Hybrid ⇒ Parallel Flow Simulations with a Trillion Cells in Complex Geometries, SC13, Denver
14
• SuperMUC – single node (3.34 million cells per core)
⇒ LBM: low FLOPs to bytes ratio → memory intensive
0
20
40
60
80
100
120
140
160
180
MLU
P/s
SRT .
TRT
1 2 4 8 12 16 cores
LBM compute kernel type
more complex, but also much more relevant
for physical simulations!
MLUP/s: million/mega
lattice cell updates per
second
Uniform Grids - Performance
Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014
15
• SuperMUC – single node (3.34 million cells per core)
⇒ limited by memory bandwidth
0
20
40
60
80
100
120
140
160
180
MLU
P/s
SRT
TRT
SRT .
SRT .
1 2 4 8 12 16 cores naïve, straightforward
implementation
already quite optimized!
Uniform Grids - Performance
Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014
bandwidth limit
vectorized compute kernel
16
• JUQUEEN – single node (1.73 million cells per core)
⇒ limited by memory bandwidth
0
10
20
30
40
50
60
70
80
MLU
P/s
SRT
TRT
SRT .
SRT .
1 2 4 8 16 cores naïve, straightforward
implementation
already quite optimized!
Uniform Grids - Performance
Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014
vectorized compute kernel
bandwidth limit
17
• SuperMUC – TRT kernel (3.34 million cells per core)
0
1
2
3
4
5
6
7
8
9
10
MLU
P/s
pe
r c
ore
16P 1T
4P 4T
2P 8T
32 8192 147,456 cores
#processes per node
#threads per process
0.99 x 1012 cells updated per second! (19 values per cell)
Uniform Grids - Performance
Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014
18
• JUQUEEN – TRT kernel (1.73 million cells per core)
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
MLU
P/s
pe
r c
ore
64P 1T
16P 4T
8P 8T
32 512 8192 458,752 cores
#processes per node
#threads per process
2.1 x 1012 cells updated per second! (19 values per cell)
⇒ 40 x 1012 values updated per second!
⇒ 0.41 PFlop/s (0.41 x 1015 Flop/s)
Uniform Grids - Performance
Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014
19
• SuperMUC – TRT kernel (2.1 million fluid cells)
0
1,000
2,000
3,000
4,000
5,000
6,000
7,000
tim
e s
tep
s /
sec
32 128 512 2048 8192 32,768 cores
Uniform Grids - Performance
Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014
“extreme” strong scaling
more processes = shorter time to solution!
20
• JUQUEEN – TRT kernel (16.9 million fluid cells)
0
100
200
300
400
500
600
700
800
900
1000
tim
e s
tep
s /
sec
512 2048 8192 32,768 262,144 cores
Uniform Grids - Performance
Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014
“extreme” strong scaling
more processes = shorter time to solution!
Statically Refined Grids
• Lattice Boltzmann & Grid Refinement
• Domain Decomposition & Load Balancing
• Performance / Benchmarks
Statically Refined Grids
• Lattice Boltzmann & Grid Refinement: • Almost all grid refinement schemes for the lattice Boltzmann
method rely on a 2:1 balance between neighboring cells:
• waLBerla now uses a massively parallel implementation of the refinement scheme presented in [1] (also relies on 2:1 balance)
• consequences of the 2:1 balance: • twice as many time steps on the fine grid as on the next coarser grid
• In 3D, for each finer grid level, the memory requirement increases by a factor of 8 and the generated workload by a factor of 16.
22 Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014
[1] Zhao Yu and Liang-Shih Fan, 2009, An interaction potential based lattice Boltzmann method with adaptive [1] mesh refinement (AMR) for two-phase flow simulation, J. Comput. Phys. 228, 17 (September 2009), 6456-6478
Statically Refined Grids
23 Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014
• Domain Decomposition: • “blocks containing regular grids
distributed among all processes”
⇒ distributed forest of octrees (→ 2:1 balance)
• Each process only knows about its own blocks and their neighbors, but has no information about the rest of the domain.
⇒ perfectly distributed data structure that scales to huge numbers of processes without any runtime overhead
• Property of this Distributed Data Structure: • can be viewed as a graph: each block is connected to all of its
neighboring blocks → enables all kinds of graph algorithms
Statically Refined Grids
24 Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014
• Comparison with Uniform LBM: • The setup phase remains unchanged:
The decoupling (via a file) of the domain decomposition & initial load balancing from the actual simulation is still possible.
• All the refinement “magic” happens during communication
(fine and coarse blocks share ghost layer regions → refinement requires multiple ghost layers per block†).
• Time stepping becomes more complicated (simple, uniform LBM:
boundary handling → stream & collide → communication).
• All the compute kernels remain unchanged!
† Even though for the computation/interpolation multiple ghost layers are required, the com- † munication is heavily optimized and the average number of bytes communicated per block † only slightly increases compared to a uniform simulation.
Statically Refined Grids
25 Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014
• Load Balancing (adapted to our LBM refinement implementation): • one space filling curve (Hilbert curve) per grid level
3 3 3 3 3 3 3 3
2 2 2 2 2 2
1 1 1
2 2 2
0
2
1 1
3
3
2
3
3
2
2
2
2
2
2
1
1
0
1
1 1
2
2
2
3
3
3
3
Example: 24 blocks on four
different levels (forest with 3 trees)
Statically Refined Grids
26 Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014
• Load Balancing (adapted to our LBM refinement implementation): • one space filling curve (Hilbert curve) per grid level
3 3 3 3 3 3 3 3
2 2 2 2 2 2
1 1 1
2 2 2
0
2
1 1
3
3
2
3
3
2
2
2
2
2
2
1
1
0
1
1 1
2
2
2
3
3
3
3
Example: 24 blocks on four
different levels (forest with 3 trees)
Statically Refined Grids
27 Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014
• Load Balancing (adapted to our LBM refinement implementation): • one space filling curve (Hilbert curve) per grid level
3 3 3 3 3 3 3 3
2 2 2 2 2 2
1 1 1
2 2 2
0
- 0 -
- 1 1 - 1 1 - 1
2 - 2 2 2 2 2 2 2 2 - 2
3 3 3 3 3 3 3 3
2
1 1
numbers represent corresponding grid levels
Example: 24 blocks on four
different levels (forest with 3 trees)
Statically Refined Grids
28 Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014
• Load Balancing (adapted to our LBM refinement implementation): • one space filling curve (Hilbert curve) per grid level
3 3 3 3 3 3 3 3
2 2 2 2 2 2
1 1 1
2 2 2
0
- 0 -
- 1 1 - 1 1 - 1
2 - 2 2 2 2 2 2 2 2 - 2
3 3 3 3 3 3 3 3
2
1 1 Example: 24 blocks on four
different levels (forest with 3 trees)
Statically Refined Grids
29 Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014
• Load Balancing (adapted to our LBM refinement implementation): • one space filling curve (Hilbert curve) per grid level
3 3 3 3 3 3 3 3
2 2 2 2 2 2
1 1 1
2 2 2
0
- 0 -
- 1 1 - 1 1 - 1
2 - 2 2 2 2 2 2 2 2 - 2
3 3 3 3 3 3 3 3
2
1 1 Example: 24 blocks on four
different levels (forest with 3 trees)
Statically Refined Grids
30 Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014
• Load Balancing (adapted to our LBM refinement implementation): • one space filling curve (Hilbert curve) per grid level
3 3 3 3 3 3 3 3
2 2 2 2 2 2
1 1 1
2 2 2
0
- 0 -
- 1 1 - 1 1 - 1
2 - 2 2 2 2 2 2 2 2 - 2
3 3 3 3 3 3 3 3
2
1 1 Example: 24 blocks on four
different levels (forest with 3 trees)
Statically Refined Grids
31 Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014
• Load Balancing (adapted to our LBM refinement implementation): • one space filling curve (Hilbert curve) per grid level
3 3 3 3 3 3 3 3
2 2 2 2 2 2
1 1 1
2 2 2
0
- 0 -
- 1 1 - 1 1 - 1
2 - 2 2 2 2 2 2 2 2 - 2
3 3 3 3 3 3 3 3
2
1 1 Example: 24 blocks on four
different levels (forest with 3 trees)
Statically Refined Grids
32 Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014
• Load Balancing (adapted to our LBM refinement implementation): • one space filling curve (Hilbert curve) per grid level
• “even” distribution to all available processes
Example (cont.): distribution to six
available processes - level by level
process: 0 1 2 3 4 5
3 3 3 3 3 3 3 3
2 2 2 2 2 2
1 1 1
2 2 2
0
2
1 1 Example: 24 blocks on four
different levels (forest with 3 trees)
Statically Refined Grids
33 Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014
• Load Balancing (adapted to our LBM refinement implementation): • one space filling curve (Hilbert curve) per grid level
• “even” distribution to all available processes
Example (cont.): distribution to six
available processes - level by level
process: 0 1 2 3 4 5
3
3
3
3
3 3 3 3
3 3 3 3 3 3 3 3
2 2 2 2 2 2
1 1 1
2 2 2
0
2
1 1 Example: 24 blocks on four
different levels (forest with 3 trees)
Statically Refined Grids
34 Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014
• Load Balancing (adapted to our LBM refinement implementation): • one space filling curve (Hilbert curve) per grid level
• “even” distribution to all available processes
Example (cont.): distribution to six
available processes - level by level
process: 0 1 2 3 4 5
3
3
2
3
3
2
3
2
2
3
2
2
3
2
2
3
2
2
3 3 3 3 3 3 3 3
2 2 2 2 2 2
1 1 1
2 2 2
0
2
1 1 Example: 24 blocks on four
different levels (forest with 3 trees)
Statically Refined Grids
35 Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014
• Load Balancing (adapted to our LBM refinement implementation): • one space filling curve (Hilbert curve) per grid level
• “even” distribution to all available processes
Example (cont.): distribution to six
available processes - level by level
process: 0 1 2 3 4 5
3
3
2
1
3
3
2
1
3
2
2
1
3
2
2
1
3
2
2
1
3
2
2
3 3 3 3 3 3 3 3
2 2 2 2 2 2
1 1 1
2 2 2
0
2
1 1 Example: 24 blocks on four
different levels (forest with 3 trees)
Statically Refined Grids
36 Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014
• Load Balancing (adapted to our LBM refinement implementation): • one space filling curve (Hilbert curve) per grid level
• “even” distribution to all available processes
Example (cont.): distribution to six
available processes - level by level
process: 0 1 2 3 4 5
3
3
2
1
3
3
2
1
3
2
2
1
3
2
2
1
3
2
2
1
3
2
2
0
3 3 3 3 3 3 3 3
2 2 2 2 2 2
1 1 1
2 2 2
0
2
1 1 Example: 24 blocks on four
different levels (forest with 3 trees)
37
• Benchmark Environments: • JUQUEEN (TOP500: 8)
• Blue Gene/Q, 459K cores, 1 GB/core
• compiler: IBM XL / IBM MPI
• SuperMUC (TOP500: 10) • Intel Xeon, 147K cores, 2 GB/core
• compiler: Intel XE / IBM MPI
• Benchmark (LBM D3Q19 TRT):
• lid-driven cavity • weak scaling • 6.6875 blocks per
process • 4 grid levels
Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014
Statically Refined Grids
• finest level … … covers 1.4% of space … generates 80% of workload
• coarsest level … … covers 78% of space … generates 1.1% of workload
38
• SuperMUC – TRT kernel (3.34 million cells per core)
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
MLU
P/s
pe
r c
ore
16P 1T
4P 4T
2P 8T
32 8192 147,456 cores
#processes per node
#threads per process
Statically Refined Grids
Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014
39
• JUQUEEN – TRT kernel (1.81 million cells per core)
0
0.5
1
1.5
2
2.5
MLU
P/s
pe
r c
ore
64P 1T
16P 4T
8P 8T
32 512 8192 262,144 cores
#processes per node
#threads per process
Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014
Statically Refined Grids
Statically Refined Grids
40 Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014
• LBM with refinement only achieves half the number of cell updates per second as regular LBM without refinement.
• Refinement Overhead:
• much more complex communication patterns which additionally involve interpolation
• more complex time stepping scheme
• multiple blocks per process ⇔ smaller blocks
BUT
LBM with grid refinement requires much less memory and far fewer cells (100 times fewer ...) !
Statically Refined Grids
41 Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014
• Example Application: A Study of the Vocal Fold
⇒ http://youtu.be/kUPf__THVZs
Statically Refined Grids
42 Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014
• Example Application: A Study of the Vocal Fold
DNS (direct numerical simulation)
Reynolds number: 1000 / D3Q19 TRT
4300 processes @ SuperMUC
101,466,432 fluid cells
25,800 blocks with 16 x 16 x 16 cells
Statically Refined Grids
43 Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014
• Example Application: A Study of the Vocal Fold
number of different grid levels: 5
95.5 time steps / sec (finest grid)
total number of time steps (finest grid): 864,000
without refinement: 55.2 times more memory …
… and 98.6 times the workload
Outlook on Dynamic Grid Refinement / Conclusion
45
Outlook / Conclusion
• Dynamic Grid Refinement: • space filling curve? (see p4est and work done by C. Burstedde et al.)
• multiple space filling curves – one for each grid level?
• repartitioning based on graph algorithms?
→ diffusive algorithms! (see work of H. Meyerhenke et al.)
• Conclusion: • The waLBerla framework now has a highly efficient, massively
parallel implementation of LBM for …
• … uniform grids as well as …
• … statically refined grids.
⇒ next milestone: support for dynamic grid refinement / next milestone: runtime adaptivity
Massively Parallel Lattice Boltzmann-based Simulations with Grid Refinement Florian Schornbaum - FAU Erlangen-Nürnberg - February 19, 2014
THANK YOU FOR YOUR ATTENTION!
QUESTIONS ? (visit http://www.walberla.net)