STFC Corporate PowerPoint Template · 2020-01-07 · 13th– 14th Sept 2016, NCAR MultiCore 6...

13th– 14th Sept 2016, NCAR MultiCore 6 Workshop

PSyclone: a code generation and optimisation system for finite element and

finite difference codes

Rupert Ford, Andrew Porter and Karthee Sivalingam, Mike Ashworth (speaker)

STFC Daresbury Laboratory,

United Kingdom

Funded by the Hartree Centre

Overview

Motivation − Maintainable, efficient, scalable software on

current and future HPC architectures

Aims − Portable performance (today and future) − Single source science code − High(er) level problem specification

Overview

Challenge − Parallelism increasing − Memory latency and hierarchy increasing − Different and changing

architectural solutions software standards Compilers

− It is hard to restructure codes

Overview

Some Solutions − HPC experts optimise for particular architectures

Not single source science Not portable performance

− Trade maintenance, portability and performance e.g. only support MPI Not portable performance

− Use domain specific knowledge ...

Domain Specific Knowledge

Finite element, volume, difference − Operations over a mesh − Typically same operation at each

element/volume/point − Data parallel (typically independent operations) − Low level functional parallelism − Nearest neighbour communications for stencils − Global sum(s) for solver convergence and/or

conservation (e.g. temperature)

Projects GungHo

− Separate Science from HPC optimisation (PSyKAl) Performance portability Single-source science code

− 'Dynamo' prototype implementation − PSyclone generation of PSy layer

GOcean − Apply GungHo approach to Ocean models

Intel Parallel Computing Centre − PSyclone optimisation for Xeon and Xeon Phi

LFRic - See Chris Maynard's talk

PSyKAl

Performance

Algorithm

PSy

Kernels

Science

Infrastructure

Algorithm layer refers to the whole model domain

Kernels for individual columns

Parallel System layer handles multiple levels of parallelism

Dynamo algorithm example ...

From lhs_alg_mod.x90 Multiple kernels within an invoke() Note: 'built-ins' now supported in PSyclone

Dynamo kernel example ...

From matrix_vector_kernel_mod.F90

PSyclone

PSy Generator

Algorithm Generator

Parser Alg Code

Kernel Codes

PSy Code

Alg Code

Transforms Transformation

psy

ast ast

ast info

schedule

Dynamo algorithm post PSyclone

PSyclone Transformations example ... > python ../../psyclone/src/generator.py lhs_alg_mod.x90 \ -d ../../dynamo/kernel -oalg alg.f90 -opsy psy.f90 \ -nodm -s ./global.py

Vanilla, no dm

PSyclone Transformations example ... > python ../../psyclone/src/generator.py lhs_alg_mod.x90 \ -d ../../dynamo/kernel -oalg alg.f90 -opsy psy.f90 \ -s ./global.py

Vanilla, dm


Colours, dm


OpenMP + dm

PSyclone Transformations example ...

No Scientists were harmed

Going MPI Parallel

LFRic infrastructure and PSyclone support

PSyclone command line change > python generator.py -oalg alg.f90 -opsy psy.f90 file.f90 -nodm

Dynamo Initial MPI results L is the patch of columns on each core Blue is weak scaling (small problem) Red is weak scaling (larger problem) From blue to red is strong scaling

Going OpenMP Parallel

Dynamo infrastructure and Psyclone support

PSyclone script > python generator.py -oalg alg.f90 -opsy psy.f90 file.f90 -s script.py

No Scientists were harmed

Dynamo Initial OpenMP results

0

5

10

15

20

25

30

0 16 32 48 64

Perf

orm

ance

(/se

c)

Number of Physical Cores

Xeon IvyBridge

Xeon Phi KNC

KNC: HT=2

KNC: HT=4

Dual-socket 12-core IvyBridge Hyperthreading gives a small performance boost on the Xeon Phi No attempt has been made to improve vectorization

What about performance?

Shallow water model GOcean FD API Sequential performance

Comparing with a hand-

optimised code PSyKAl-restructured code can perform as well (some-times better) than hand optimized code

NEMOlite2D model, GOcean FD API

OpenMP and OpenACC performance

How well optimised is the code?

Plan to analyse kernels to determine a realistic achievable performance upper bound

Analyse kernels via a generated DAG

Matrix vector loop1 example

Summary PSyclone has domain specific APIs to support FE in

LFRic and FD in GOcean

The full LFRic code, incorporating the GungHo dynamical core, is running using PSyclone

From single source science code we are generating serial, MPI, OpenMP and MPI/OpenMP code

MPI scaling to O(10k) tasks – see Chris Maynard’s talk for O(100k)

OpenMP scaling to 8 threads on CPU, more on KNC

Optimisations in progress

OpenPOWER port in progress

Future Work

Open source later this year

Currently working toward − MPI optimisations − Multigrid support − Kernel optimisations

Future − Physics integration − OpenACC − 3D finite difference / finite volume API − Explore / search the optimisation space

STFC Corporate PowerPoint Template · 2020-01-07 · 13th– 14th Sept 2016, NCAR MultiCore 6...

Documents

Transcript of STFC Corporate PowerPoint Template · 2020-01-07 · 13th– 14th Sept 2016, NCAR MultiCore 6...