STFC Corporate PowerPoint Template · 2020-01-07 · 13th– 14th Sept 2016, NCAR MultiCore 6...
Transcript of STFC Corporate PowerPoint Template · 2020-01-07 · 13th– 14th Sept 2016, NCAR MultiCore 6...
13th– 14th Sept 2016, NCAR MultiCore 6 Workshop
PSyclone: a code generation and optimisation system for finite element and
finite difference codes
Rupert Ford, Andrew Porter and Karthee Sivalingam, Mike Ashworth (speaker)
STFC Daresbury Laboratory,
United Kingdom
Funded by the Hartree Centre
Overview
Motivation − Maintainable, efficient, scalable software on
current and future HPC architectures
Aims − Portable performance (today and future) − Single source science code − High(er) level problem specification
Overview
Challenge − Parallelism increasing − Memory latency and hierarchy increasing − Different and changing
architectural solutions software standards Compilers
− It is hard to restructure codes
Overview
Some Solutions − HPC experts optimise for particular architectures
Not single source science Not portable performance
− Trade maintenance, portability and performance e.g. only support MPI Not portable performance
− Use domain specific knowledge ...
Domain Specific Knowledge
Finite element, volume, difference − Operations over a mesh − Typically same operation at each
element/volume/point − Data parallel (typically independent operations) − Low level functional parallelism − Nearest neighbour communications for stencils − Global sum(s) for solver convergence and/or
conservation (e.g. temperature)
Projects GungHo
− Separate Science from HPC optimisation (PSyKAl) Performance portability Single-source science code
− 'Dynamo' prototype implementation − PSyclone generation of PSy layer
GOcean − Apply GungHo approach to Ocean models
Intel Parallel Computing Centre − PSyclone optimisation for Xeon and Xeon Phi
LFRic - See Chris Maynard's talk
PSyKAl
Performance
Algorithm
PSy
Kernels
Science
Infrastructure
Algorithm layer refers to the whole model domain
Kernels for individual columns
Parallel System layer handles multiple levels of parallelism
Dynamo algorithm example ...
From lhs_alg_mod.x90 Multiple kernels within an invoke() Note: 'built-ins' now supported in PSyclone
Dynamo kernel example ...
From matrix_vector_kernel_mod.F90
PSyclone
PSy Generator
Algorithm Generator
Parser Alg Code
Kernel Codes
PSy Code
Alg Code
Transforms Transformation
psy
ast ast
ast info
schedule
Dynamo algorithm post PSyclone
PSyclone Transformations example ... > python ../../psyclone/src/generator.py lhs_alg_mod.x90 \ -d ../../dynamo/kernel -oalg alg.f90 -opsy psy.f90 \ -nodm -s ./global.py
Vanilla, no dm
PSyclone Transformations example ... > python ../../psyclone/src/generator.py lhs_alg_mod.x90 \ -d ../../dynamo/kernel -oalg alg.f90 -opsy psy.f90 \ -s ./global.py
Vanilla, dm
PSyclone Transformations example ... > python ../../psyclone/src/generator.py lhs_alg_mod.x90 \ -d ../../dynamo/kernel -oalg alg.f90 -opsy psy.f90 \ -s ./global.py
Colours, dm
PSyclone Transformations example ... > python ../../psyclone/src/generator.py lhs_alg_mod.x90 \ -d ../../dynamo/kernel -oalg alg.f90 -opsy psy.f90 \ -s ./global.py
OpenMP + dm
PSyclone Transformations example ...
No Scientists were harmed
Going MPI Parallel
LFRic infrastructure and PSyclone support
PSyclone command line change > python generator.py -oalg alg.f90 -opsy psy.f90 file.f90 -nodm
Dynamo Initial MPI results L is the patch of columns on each core Blue is weak scaling (small problem) Red is weak scaling (larger problem) From blue to red is strong scaling
Going OpenMP Parallel
Dynamo infrastructure and Psyclone support
PSyclone script > python generator.py -oalg alg.f90 -opsy psy.f90 file.f90 -s script.py
No Scientists were harmed
Dynamo Initial OpenMP results
0
5
10
15
20
25
30
0 16 32 48 64
Perf
orm
ance
(/se
c)
Number of Physical Cores
Xeon IvyBridge
Xeon Phi KNC
KNC: HT=2
KNC: HT=4
Dual-socket 12-core IvyBridge Hyperthreading gives a small performance boost on the Xeon Phi No attempt has been made to improve vectorization
What about performance?
Shallow water model GOcean FD API Sequential performance
Comparing with a hand-
optimised code PSyKAl-restructured code can perform as well (some-times better) than hand optimized code
NEMOlite2D model, GOcean FD API
OpenMP and OpenACC performance
How well optimised is the code?
Plan to analyse kernels to determine a realistic achievable performance upper bound
Analyse kernels via a generated DAG
Matrix vector loop1 example
Summary PSyclone has domain specific APIs to support FE in
LFRic and FD in GOcean
The full LFRic code, incorporating the GungHo dynamical core, is running using PSyclone
From single source science code we are generating serial, MPI, OpenMP and MPI/OpenMP code
MPI scaling to O(10k) tasks – see Chris Maynard’s talk for O(100k)
OpenMP scaling to 8 threads on CPU, more on KNC
Optimisations in progress
OpenPOWER port in progress
Future Work
Open source later this year
Currently working toward − MPI optimisations − Multigrid support − Kernel optimisations
Future − Physics integration − OpenACC − 3D finite difference / finite volume API − Explore / search the optimisation space