Los Alamos National Laboratory : THIS IS NOTE Kokkos ......• Next generation global atmosphere...
Transcript of Los Alamos National Laboratory : THIS IS NOTE Kokkos ......• Next generation global atmosphere...
NOTE: THIS IS YOUR TITLE SLIDE. If you use the Walk-in Slide, you may replace the gray LANL logo on the Title Slide with your organization’s logo and delete the NNSA logo/management statement. If you DO NOT use one of the two the Walk-in Slide options, you MUST keep the LANL and NNSA logos and management statement on this Title Slide.
Los Alamos National Laboratory
Kokkos Implementation of Albany: Towards Performance Portable
Finite Element Code
I. Demeshko, O. Guba, R. P. Pawlowski, A. G. Salinger,
W. F. Spotz and I. K. Tezaur, M.A. Heroux
04/07/2016
LA-UR-16-22225
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
Los Alamos National Laboratory
04/07/16 | 2
NOTE: This is the lab color palette. è Performance Portability
Los Alamos National Laboratory
04/07/16 | 3
NOTE: This is the lab color palette. è Performance Portability
Los Alamos National Laboratory
04/07/16 | 4
NOTE: This is the lab color palette. è Performance Portability
EXASCALE SYSTEM
new architecture
new libraries
new programming models
Los Alamos National Laboratory
04/07/16 | 5
5
Los Alamos National Laboratory
04/07/16 | 6
NOTE: This is the lab color palette. è
Los Alamos National Laboratory
04/07/16 | 7
NOTE: This is the lab color palette. è
Albany : agile component-based parallel unstructured mesh application
• A finite element based application development environment containing the "typical" building blocks needed for rapid deployment and prototyping of analysis capabilities
§ ATrilinosdemonstra/onapplica/on,builtalmostexclusivelyfromreusablelibraries.Albanyleverages100+packages/libraries.
§ Open-sourceMain
PDE Assembly
Solvers
Field Manager
Discretization
Interoperability Use Case
Nonlinear Model
Nonlinear Transient
Optimization UQ
Analysis Tools
Iterative Linear Solvers
Multi-Level
Mesh Tools
Mesh I/O
Mesh Database
Problem Discretization
ManyCore Node
Multi-Core Accelerators
Application
Linear Solve
Load Balancing
Input Parser
Node Kernels
Regression Testing
Version Control Build System
Libraries Interfaces
Software Quality Tools Demo Apps
PDE Terms
Albany Structure:
Strategic Goal: To enable the Rapid development of new Production codes embedded with Transformational capabilities.
Los Alamos National Laboratory
04/07/16 | 8
NOTE: This is the lab color palette. è
Heat transfer Fluid dynamics
Structural mechanics Quantum device
modeling
Climate modeling
supports a wide variety of application physics areas
Los Alamos National Laboratory
04/07/16 | 9
NOTE: This is the lab color palette. è
Albany team is Rapidly Developing Several New Component-Based Applications
1. Turbulent CFD for nuclear energy [NE] 2. Computational mechanics R&D [ASC] 3. Quantum device design [LDRD] 4. Extended MHD [ASCR] 5. CRADA partner’s in-house code [CRADA] 6. Peridynamics solver [ASC] 7. Biogeochemical element cycling: climate [SciDAC] 8. Fuel rod degradation modeling [NE] 9. Ice Sheet dynamics [SciDAC] 10. Atmospheric Dynamics [LDRD] + Impacting Many Others
Codes are born: parallel, scalable, robust, with sensitivities, optimization, UQ … and ready to adopt: embedded UQ, multi-core kernels, adaptivity, code coupling, ROM
Temperature Strain
Los Alamos National Laboratory
04/07/16 | 10
NOTE: This is the lab color palette. è
Our goal: To create an architecture-portable version of Albany by using Kokkos library.
Los Alamos National Laboratory
04/07/16 | 11
NOTE: This is the lab color palette. è
Albany to Kokkos refactoring
Phalanx Intrepid
Kokkos
Albany
Trilinos
Piro Tpetra
MueLu
manages dependencies between different components of the Albany and manages data in the code.
Library of interoperable tools for compatible discretizations of Partial Differential Equations implements linear algebra objects,
including sparse graphs, sparse matrices, and dense vectors.
Los Alamos National Laboratory
04/07/16 | 12
NOTE: This is the lab color palette. è
• A new Albany-Kokkos implementation: • has Kokkos::Views at the base layer • has Kokkos::Vew –like temporary data • has Kokkos kernels in replacement of original nested loops • is a single code base that runs and is performant on diverse HPC
architectures
Los Alamos National Laboratory
04/07/16 | 13
NOTE: This is the lab color palette. è
FELIX: Albany Greenland Ice Sheet model
Los Alamos National Laboratory
04/07/16 | 14
NOTE: This is the lab color palette. è Albany FELIX project
• Anunstructured-gridfiniteelementicesheetcodeforland-icemodeling(Greenland,Antarc/ca).
• Projectobjec*ve:• Providesealevelrisepredic/on• Runonnewarchitecturemachines(hybridsystems).
– 50%*mespentinFEAssembly– 50%/mespentinLinearSolves
FundingSource:SciDAC
Collaborators:SNL,ORNL,LANL,LBNL,UT,FSU,SC,MIT,NCAR
SandiaStaff:A.Salinger,I.Kalashnikova,M.Perego,R.Tuminaro,J.Jakeman,M.Eldred
Los Alamos National Laboratory
04/07/16 | 15
NOTE: This is the lab color palette. è
Phalanx graph for the Greenland Ice-Sheet model
Gather Solution
Gather Coordinate Vector
Compute Basis Functions
2:1
VecInterpolation
3:0 3:2
VecGradInterpolation
4:0 4:2
ViscosityFO
5:4
Load State Field
GradInterpolation
7:2 7:6
Stokes BodyForce
8:7
Stokes Resid
9:2
9:3 9:4
9:5 9:8
Scatter Stokes
10:9
Los Alamos National Laboratory
04/07/16 | 16
NOTE: This is the lab color palette. è
Kokkos implementation (Greenland Ice-Sheet model)
Device:
Copy solution vector to the Device
Copy residual vector to the Host
Loop over the number of worksets
Gather Solution
Gather Coordinate Vector
Compute Basis Functions
2:1
VecInterpolation
3:0 3:2
VecGradInterpolation
4:0 4:2
ViscosityFO
5:4
Load State Field
GradInterpolation
7:2 7:6
Stokes BodyForce
8:7
Stokes Resid
9:2
9:3 9:4
9:5 9:8
Scatter Stokes
10:9
Los Alamos National Laboratory
04/07/16 | 17
NOTE: This is the lab color palette. è
Kokkos functor example in Albany
Los Alamos National Laboratory
04/07/16 | 18
NOTE: This is the lab color palette. è
FELIX Performance results
Evaluation environment:
Shannon: 32 nodes:
Two 8-core Sandy Bridge Xeon E5-2670 @ 2.6GHz (HT deactivated) per node, 128GB DDR3 memory per node, 2x NVIDIA K20x/k40 per node Serial=2 MPI processes OpenMP=16 OpenMP threads CUDA=1 Nvidia K80 GPU UVM for CPU-GPU data management
Los Alamos National Laboratory
04/07/16 | 19
NOTE: This is the lab color palette. è
FELIX performance results
Evaluation environment: TITAN: 18,688 AMD Opteron nodes:
• 16 cores per node, • 1 K20X Kepler GPUS per node, • 32GB + 6GB memory per node
Los Alamos National Laboratory
04/07/16 | 20
NOTE: This is the lab color palette. è
Los Alamos National Laboratory
04/07/16 | 21
NOTE: This is the lab color palette. è
• Next generation global atmosphere model.
• Numerics are similar to the Community Atmosphere Model - Spectral Elements (CAM-SE)
• Model development: shallow water, X-Z hydrostatic, 3D hydrostatic, clouds, 3D non-hydrostatic
Los Alamos National Laboratory
04/07/16 | 22
NOTE: This is the lab color palette. è
Aeras performance results
0.010.020.030.040.050.060.070.080.090.0
100.0
100 1000 10000 100000!m
e,sec
#oflementsperworkset
Aerascompute!me(Total!me-Gather/Sca<er)
Serial-1MPIthreadpernode
OpenMP-16OpenMPthreadspernode
CUDA-1NVIDIAK80GPUpernode
0.020.040.060.080.0
100.0120.0140.0160.0180.0200.0
100 1000 10000 100000
!me,se
c
#ofelementsperworkset
Aerastotal!me
Evaluation environment:
Shannon: 32 nodes:
Two 8-core Sandy Bridge Xeon E5-2670 @ 2.6GHz (HT deactivated) per node, 128GB DDR3 memory per node, 2x NVIDIA K20x/k40 per node
Los Alamos National Laboratory
04/07/16 | 23
NOTE: This is the lab color palette. è
Aeras performance results Evaluation environment: TITAN: 18,688 AMD Opteron nodes:
• 16 cores per node, • 1 K20X Kepler GPUS per
node, • 32GB + 6GB memory per
node
Los Alamos National Laboratory
04/07/16 | 24
NOTE: This is the lab color palette. è
Conclusion
• New version of Albany provides architecture-portability;
• Our numerical experiments on two climate applications implemented in Albany show that:
(1) a single code can execute correctly in several evaluation environments (MPI, OpenMP, CUDAUVM), and (2) reasonable performance is achieved across the different architectures without implicit data management: speed-ups using OpenMP and GPUs can be achieved over an MPI-only run;
Los Alamos National Laboratory
04/07/16 | 25
NOTE: This is the lab color palette. è
Acknowledgments
I would like to thank: • C. R. Trott and H.C. Edwards for their help with Kokkos, • Adam V. Delora for his work on Intrepid, • Eric T. Phipps, Eric C. Cyr and Andrew Bradley for their help with
Trilinos and Albany, • Steve Price and Matt Hoffman and Mauro Perego for providing the
data used in the FELIX land-ice runs.
Los Alamos National Laboratory
04/07/16 | 26
NOTE: This is the lab color palette. è
Thank you! [email protected]