Michael L. Norman Principal Investigator Interim Director, SDSC
Michael L. Norman, UC San Diego and SDSC [email protected].
-
Upload
sophie-francis -
Category
Documents
-
view
224 -
download
0
Transcript of Michael L. Norman, UC San Diego and SDSC [email protected].
![Page 1: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.](https://reader035.fdocuments.us/reader035/viewer/2022062300/56649dbd5503460f94ab06b8/html5/thumbnails/1.jpg)
ENZO AND EXTREME SCALE AMR FOR HYDRODYNAMIC COSMOLOGY
Michael L. Norman, UC San Diego and [email protected]
![Page 2: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.](https://reader035.fdocuments.us/reader035/viewer/2022062300/56649dbd5503460f94ab06b8/html5/thumbnails/2.jpg)
WHAT IS ENZO?
A parallel AMR application for astrophysics and cosmology simulations Hybrid physics: fluid + particle + gravity + radiation Block structured AMR MPI or hybrid parallelism
Under continuous development since 1994 Greg Bryan and Mike Norman @ NCSA Shared memorydistributed memoryhierarchical memory C++/C/F, >185,000 LOC
Community code in widespread use worldwide Hundreds of users, dozens of developers Version 2.0 @ http://enzo.googlecode.com
![Page 3: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.](https://reader035.fdocuments.us/reader035/viewer/2022062300/56649dbd5503460f94ab06b8/html5/thumbnails/3.jpg)
![Page 4: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.](https://reader035.fdocuments.us/reader035/viewer/2022062300/56649dbd5503460f94ab06b8/html5/thumbnails/4.jpg)
TWO PRIMARY APPLICATION DOMAINS
ASTROPHYSICAL FLUID DYNAMICS HYDRODYNAMIC COSMOLOGY
Supersonic turbulence Large scale structure
![Page 5: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.](https://reader035.fdocuments.us/reader035/viewer/2022062300/56649dbd5503460f94ab06b8/html5/thumbnails/5.jpg)
ENZO PHYSICSPhysics Equations Math type Algorithm(s) Communicati
on
Dark matter Newtonian N-body
Numerical integration
Particle-mesh Gather-scatter
Gravity Poisson Elliptic FFTmultigrid
Global
Gas dynamics
Euler Nonlinear hyperbolic
Explicit finite volume
Nearest neighbor
Magnetic fields
Ideal MHD Nonlinear hyperbolic
Explicit finite volume
Nearest neighbor
Radiation transport
Flux-limited radiation diffusion
Nonlinear parabolic
Implicit finite differenceMultigrid solves
Global
Multispecies chemistry
Kinetic equations
Coupled stiff ODEs
Explicit BE ,implicit
None
Inertial, tracer, source , and sink particles
Newtonian N-body
Numerical integration
Particle-mesh Gather-scatter
Physics modules can be used in any combination in 1D, 2D and 3D making ENZO a very powerful and versatile code
![Page 6: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.](https://reader035.fdocuments.us/reader035/viewer/2022062300/56649dbd5503460f94ab06b8/html5/thumbnails/6.jpg)
ENZO MESHING
Berger-Collela structured AMR
Cartesian base grid and subgrids
Hierarchical timetepping
![Page 7: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.](https://reader035.fdocuments.us/reader035/viewer/2022062300/56649dbd5503460f94ab06b8/html5/thumbnails/7.jpg)
Level 0
AMR = collection of grids (patches);each grid is a C++ object
Level 1
Level 2
![Page 8: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.](https://reader035.fdocuments.us/reader035/viewer/2022062300/56649dbd5503460f94ab06b8/html5/thumbnails/8.jpg)
Unigrid = collection of Level 0 grid patches
![Page 9: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.](https://reader035.fdocuments.us/reader035/viewer/2022062300/56649dbd5503460f94ab06b8/html5/thumbnails/9.jpg)
EVOLUTION OF ENZO PARALLELISM
Shared memory (PowerC) parallel (1994-1998) SMP and DSM architecture (SGI Origin 2000, Altix) Parallel DO across grids at a given refinement level
including block decomposed base grid O(10,000) grids
Distributed memory (MPI) parallel (1998-2008) MPP and SMP cluster architectures (e.g., IBM PowerN) Level 0 grid partitioned across processors Level >0 grids within a processor executed sequentially Dynamic load balancing by messaging grids to
underloaded processors (greedy load balancing) O(100,000) grids
![Page 10: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.](https://reader035.fdocuments.us/reader035/viewer/2022062300/56649dbd5503460f94ab06b8/html5/thumbnails/10.jpg)
![Page 11: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.](https://reader035.fdocuments.us/reader035/viewer/2022062300/56649dbd5503460f94ab06b8/html5/thumbnails/11.jpg)
Projection of refinement levels
160,000 grid patches at 4 refinement levels
![Page 12: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.](https://reader035.fdocuments.us/reader035/viewer/2022062300/56649dbd5503460f94ab06b8/html5/thumbnails/12.jpg)
1 MPI task per processor
Task = a Level 0 grid patch and all associated subgrids;
processed sequentially across and within levels
![Page 13: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.](https://reader035.fdocuments.us/reader035/viewer/2022062300/56649dbd5503460f94ab06b8/html5/thumbnails/13.jpg)
EVOLUTION OF ENZO PARALLELISM
Hierarchical memory (MPI+OpenMP) parallel (2008-) SMP and multicore cluster architectures (SUN
Constellation, Cray XT4/5) Level 0 grid partitioned across shared memory
nodes/multicore processors Parallel DO across grids at a given refinement
level within a node Dynamic load balancing less critical because of
larger MPI task granularity (statistical load balancing)
O(1,000,000) grids
![Page 14: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.](https://reader035.fdocuments.us/reader035/viewer/2022062300/56649dbd5503460f94ab06b8/html5/thumbnails/14.jpg)
N MPI tasks per SMPM OpenMP threads per task
Task = a Level 0 grid patch and all associated subgrids processed concurrently within levels and
sequentially across levels
Each grid is an OpenMP thread
![Page 15: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.](https://reader035.fdocuments.us/reader035/viewer/2022062300/56649dbd5503460f94ab06b8/html5/thumbnails/15.jpg)
ENZO ON PETASCALE PLATFORMS
ENZO ON CRAY XT5 1% OF THE 64003 SIMULATION
Non-AMR 64003 80 Mpc box 15,625 (253) MPI tasks,
2563 root grid tiles 6 OpenMP threads per
task 93,750 cores 30 TB per checkpoint/re-
start/data dump >15 GB/sec read, >7
GB/sec write Benefit of threading
reduce MPI overhead & improve disk I/O
![Page 16: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.](https://reader035.fdocuments.us/reader035/viewer/2022062300/56649dbd5503460f94ab06b8/html5/thumbnails/16.jpg)
ENZO ON PETASCALE PLATFORMS
ENZO ON CRAY XT5 105 SPATIAL DYNAMIC RANGE
AMR 10243 50 Mpc box, 7 levels of refinement 4096 (163) MPI tasks, 643
root grid tiles 1 to 6 OpenMP threads
per task - 4096 to 24,576 cores
Benefit of threading Thread count increases
with memory growth reduce replication of grid
hierarchy data
![Page 17: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.](https://reader035.fdocuments.us/reader035/viewer/2022062300/56649dbd5503460f94ab06b8/html5/thumbnails/17.jpg)
Using MPI+threads to access more RAM as the AMR calculation grows in size
![Page 18: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.](https://reader035.fdocuments.us/reader035/viewer/2022062300/56649dbd5503460f94ab06b8/html5/thumbnails/18.jpg)
ENZO ON PETASCALE PLATFORMS
ENZO-RHD ON CRAY XT5 COSMIC REIONIZATION
Including radiation transport 10x more expensive LLNL Hypre multigrid
solver dominates run time near ideal scaling to at
least 32K MPI tasks Non-AMR 10243 8 and
16 Mpc boxes 4096 (163) MPI tasks, 643
root grid tiles
![Page 19: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.](https://reader035.fdocuments.us/reader035/viewer/2022062300/56649dbd5503460f94ab06b8/html5/thumbnails/19.jpg)
BLUE WATERS TARGET SIMULATIONRE-IONIZING THE UNIVERSE
Cosmic Reionization is a weak-scaling problem large volumes at a fixed resolution to span range of scales
Non-AMR 40963 with ENZO-RHD Hybrid MPI and OpenMP SMT and SIMD tuning 1283 to 2563 root grid tiles 4-8 OpenMP threads per task 4-8 TBytes per checkpoint/re-start/data dump (HDF5) In-core intermediate checkpoints (?) 64-bit arithmetic, 64-bit integers and pointers Aiming for 64-128 K cores 20-40 M hours (?)
![Page 20: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.](https://reader035.fdocuments.us/reader035/viewer/2022062300/56649dbd5503460f94ab06b8/html5/thumbnails/20.jpg)
PETASCALE AND BEYOND
ENZO’s AMR infrastructure limits scalability to O(104) cores
We are developing a new, extremely scalable AMR infrastructure called Cello http://lca.ucsd.edu/projects/cello
ENZO-P will be implemented on top of Cello to scale to
![Page 21: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.](https://reader035.fdocuments.us/reader035/viewer/2022062300/56649dbd5503460f94ab06b8/html5/thumbnails/21.jpg)
CURRENT CAPABILITIES: AMR VS TREECODE
![Page 22: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.](https://reader035.fdocuments.us/reader035/viewer/2022062300/56649dbd5503460f94ab06b8/html5/thumbnails/22.jpg)
CELLO EXTREME AMR FRAMEWORK: DESIGN PRINCIPLES
Hierarchical parallelism and load balancing to improve localization
Relax global synchronization to a minimum Flexible mapping between data structures
and concurrency Object-oriented design Build on best available software for fault-
tolerant, dynamically scheduled concurrent objects (Charm++)
![Page 23: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.](https://reader035.fdocuments.us/reader035/viewer/2022062300/56649dbd5503460f94ab06b8/html5/thumbnails/23.jpg)
CELLO EXTREME AMR FRAMEWORK: APPROACH AND SOLUTIONS
1. hybrid replicated/distributed octree-based AMR approach, with novel modifications to improve AMR scaling in terms of both size and depth;
2. patch-local adaptive time steps; 3. flexible hybrid parallelization strategies; 4. hierarchical load balancing approach based on actual
performance measurements; 5. dynamical task scheduling and communication; 6. flexible reorganization of AMR data in memory to permit
independent optimization of computation, communication, and storage;
7. variable AMR grid block sizes while keeping parallel task sizes fixed;
8. address numerical precision and range issues that arise in particularly deep AMR hierarchies;
9. detecting and handling hardware or software faults during run-time to improve software resilience and enable software self-management.
![Page 24: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.](https://reader035.fdocuments.us/reader035/viewer/2022062300/56649dbd5503460f94ab06b8/html5/thumbnails/24.jpg)
IMPROVING THE AMR MESH:PATCH COALESCING
![Page 25: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.](https://reader035.fdocuments.us/reader035/viewer/2022062300/56649dbd5503460f94ab06b8/html5/thumbnails/25.jpg)
IMPROVING THE AMR MESH:TARGETED REFINEMENT
![Page 26: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.](https://reader035.fdocuments.us/reader035/viewer/2022062300/56649dbd5503460f94ab06b8/html5/thumbnails/26.jpg)
IMPROVING THE AMR MESH:TARGETED REFINEMENT WITH BACKFILL
![Page 27: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.](https://reader035.fdocuments.us/reader035/viewer/2022062300/56649dbd5503460f94ab06b8/html5/thumbnails/27.jpg)
CELLO SOFTWARE COMPONENTS
http://lca.ucsd.edu/projects/cello
![Page 28: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.](https://reader035.fdocuments.us/reader035/viewer/2022062300/56649dbd5503460f94ab06b8/html5/thumbnails/28.jpg)
ROADMAP
![Page 29: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.](https://reader035.fdocuments.us/reader035/viewer/2022062300/56649dbd5503460f94ab06b8/html5/thumbnails/29.jpg)
Enzo website (code, documentation) http://lca.ucsd.edu/projects/enzo
2010 Enzo User Workshop slides http://lca.ucsd.edu/workshops/enzo2010
yt website (analysis and vis.) http://yt.enzotools.org
Jacques website (analysis and vis.) http://jacques.enzotools.org/doc/Jacques/Ja
cques.html
ENZO RESOURCES
![Page 30: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.](https://reader035.fdocuments.us/reader035/viewer/2022062300/56649dbd5503460f94ab06b8/html5/thumbnails/30.jpg)
BACKUP SLIDES
![Page 31: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.](https://reader035.fdocuments.us/reader035/viewer/2022062300/56649dbd5503460f94ab06b8/html5/thumbnails/31.jpg)
Level 0
x x
x
Level 1
Level 2
GRID HIERARCHY DATA STRUCTURE
(0,0)
(1,0)
(2,0) (2,1)
![Page 32: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.](https://reader035.fdocuments.us/reader035/viewer/2022062300/56649dbd5503460f94ab06b8/html5/thumbnails/32.jpg)
(0)
(1,0) (1,1)
(2,0) (2,1) (2,2) (2,3) (2,4)
(3,0) (3,1) (3,2) (3,4) (3,5) (3,6) (3,7)
(4,0) (4,1) (4,3) (4,4)
Depth
(le
vel)
Breadth (# siblings)
Scaling the AMR grid hierarchy in depth and breadth
![Page 33: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.](https://reader035.fdocuments.us/reader035/viewer/2022062300/56649dbd5503460f94ab06b8/html5/thumbnails/33.jpg)
10243, 7 LEVEL AMR STATS
Level Grids Memory (MB) Work = Mem*(2^level)
0 512 179,029 179,029
1 223,275 114,629 229,258
2 51,522 21,226 84,904
3 17,448 6,085 48,680
4 7,216 1,975 31,600
5 3,370 1,006 32,192
6 1,674 599 38,336
7 794 311 39,808
Total 305,881 324,860 683,807
![Page 34: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.](https://reader035.fdocuments.us/reader035/viewer/2022062300/56649dbd5503460f94ab06b8/html5/thumbnails/34.jpg)
real grid object
virtual grid object
grid metadataphysics data
grid metadata
Current MPI Implementation
![Page 35: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.](https://reader035.fdocuments.us/reader035/viewer/2022062300/56649dbd5503460f94ab06b8/html5/thumbnails/35.jpg)
SCALING AMR GRID HIERARCHY
Flat MPI implementation is not scalable because grid hierarchy metadata is replicated in every processor For very large grid counts, this dominates memory
requirement (not physics data!) Hybrid parallel implementation helps a lot!
Now hierarchy metadata is only replicated in every SMP node instead of every processor
We would prefer fewer SMP nodes (8192-4096) with bigger core counts (32-64) (=262,144 cores)
Communication burden is partially shifted from MPI to intranode memory accesses
![Page 36: Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu.](https://reader035.fdocuments.us/reader035/viewer/2022062300/56649dbd5503460f94ab06b8/html5/thumbnails/36.jpg)
CELLO EXTREME AMR FRAMEWORK
Targeted at fluid, particle, or hybrid (fluid+particle) simulations on millions of cores
Generic AMR scaling issues: Small AMR patches restrict available parallelism Dynamic load balancing Maintaining data locality for deep hierarchies Re-meshing efficiency and scalability Inherently global multilevel elliptic solves Increased range and precision requirements for
deep hierarchies