Firedrake: Burning the Thread at Both Endsknepley/presentations/SIAMPP2016_2.pdf · Firedrake:...

14
Firedrake: Burning the Thread at Both Ends M. Lange 1 G. J. Gorman 1 1 AMCG, Imperial College London April 13 2016 M. Lange, G. J. Gorman Firedrake: Burning the Thread at Both Ends

Transcript of Firedrake: Burning the Thread at Both Endsknepley/presentations/SIAMPP2016_2.pdf · Firedrake:...

Page 1: Firedrake: Burning the Thread at Both Endsknepley/presentations/SIAMPP2016_2.pdf · Firedrake: Burning the Thread at Both Ends M. Lange 1G. J. Gorman 1AMCG, Imperial ... I Utilise

Firedrake: Burning the Thread at Both Ends

M. Lange1 G. J. Gorman1

1AMCG, Imperial College London

April 13 2016

M. Lange, G. J. GormanFiredrake: Burning the Thread at Both Ends

Page 2: Firedrake: Burning the Thread at Both Endsknepley/presentations/SIAMPP2016_2.pdf · Firedrake: Burning the Thread at Both Ends M. Lange 1G. J. Gorman 1AMCG, Imperial ... I Utilise

To Thread or Not To Thread

In order to thread the application . . .

I A while ago, everybody wanted threading:I Utilise shared memory parallelismI Avoid MPI communication overheadI Improved memory footprint

I And it was supposed to be easy:

#pragma openmp for

I Fluidity: A widely used finite element code:I CFD, ocean modelling, geophysical flows,

renewable energies, reservoir modelling, . . .I Adaptive anisotropic mesh refinement

02

46

80 1 2

0

1

2

M. Lange, G. J. GormanFiredrake: Burning the Thread at Both Ends

Page 3: Firedrake: Burning the Thread at Both Endsknepley/presentations/SIAMPP2016_2.pdf · Firedrake: Burning the Thread at Both Ends M. Lange 1G. J. Gorman 1AMCG, Imperial ... I Utilise

To Thread or Not To Thread

. . . we need to thread the solver

I PETSc-OMP:I An OpenMP threaded fork of PETSc-3.3I Low-level threading on Mat and Vec objects

I Optimised sparse MatVecI Explicit computation-communication overlapI Fined-grained load balance based on non-zero weights

PETSc-OMP IS NOT SUPPORTED ANYMORE!

I Was superseded by PETSc-ThreadcommI Threadcomm already decommissioned

M. Lange, G. J. GormanFiredrake: Burning the Thread at Both Ends

Page 4: Firedrake: Burning the Thread at Both Endsknepley/presentations/SIAMPP2016_2.pdf · Firedrake: Burning the Thread at Both Ends M. Lange 1G. J. Gorman 1AMCG, Imperial ... I Utilise

Sparse MatVec results on Cray XE6

32 64 128 256 512 1024 2048 4096 8192No. of Cores

101

102

103

Runt

ime

(s)

XE6: Vector-basedXE6: Task-basedXE6: Task-based, NZ-balanced

XE6: Pure-MPIXE6: Pure-MPI (Decomposed)

32 64 128 256 512 1024 2048 4096 8192No. of Cores

20

40

60

80

100

120

140

Para

llel E

ffici

ency

(%)

XE6: Vector-basedXE6: Task-basedXE6: Task-based, NZ-balanced

XE6: Pure-MPIXE6: Pure-MPI (Decomposed)

It’s extremely hard to beat pure MPI!

1M. Lange, G. Gorman, M. Weiland, L. Mitchell, and J. Southern. ”Supercomputing: 28th ISC 2013. Proceedings”, chapter ”AchievingEfficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation”, pages 97–108. Springer, 2013

M. Lange, G. J. GormanFiredrake: Burning the Thread at Both Ends

Page 5: Firedrake: Burning the Thread at Both Endsknepley/presentations/SIAMPP2016_2.pdf · Firedrake: Burning the Thread at Both Ends M. Lange 1G. J. Gorman 1AMCG, Imperial ... I Utilise

Fluidity performance on Cray XE6

Total run-time for 10 time-steps

512 1024 2048 4096 8192 16384No. of Cores

102

103

104

Runt

ime

(s)

MPI Hybrid

I/O for initial mesh read

512 1024 2048 4096 8192 16384No. of Cores

100

101

102

103

104

Runt

ime

(s)

I/O: MPI I/O: Hybrid

1X. Guo, M. Lange, G. Gorman, L. Mitchell, and M. Weiland. Developing a scalable hybrid MPI/OpenMP unstructured finite elementmodel. Computers & Fluids, 110(0):227 – 234, 2015. ParCFD 2013

M. Lange, G. J. GormanFiredrake: Burning the Thread at Both Ends

Page 6: Firedrake: Burning the Thread at Both Endsknepley/presentations/SIAMPP2016_2.pdf · Firedrake: Burning the Thread at Both Ends M. Lange 1G. J. Gorman 1AMCG, Imperial ... I Utilise

Fluidity performance on Cray XE6

Hybrid MPI-OpenMP looks faster at scale, but . . .

I Huge gains due to initial mesh I/OI Fluidity does off-line mesh decompositionI Partitioning and halo read from fileI Using threads we need less partitions (x8)

I Sparse MatVec beats pure MPII Only in strong scaling limit with little local workI Need threading to enforce asynchronous communicationI Improvement due to better load balance, not MPI overheads!

No actual gain from threading!I We just ameliorated some other underlying problem

M. Lange, G. J. GormanFiredrake: Burning the Thread at Both Ends

Page 7: Firedrake: Burning the Thread at Both Endsknepley/presentations/SIAMPP2016_2.pdf · Firedrake: Burning the Thread at Both Ends M. Lange 1G. J. Gorman 1AMCG, Imperial ... I Utilise

Threading: Should we even care?

Threading is never the whole story . . .I What is my application really limited by?

I Different tasks can have different limitations (flops vs. bandwidth)I Profiling (roofline plots, analysis tools) must guide optimisation!

I Can we do better algorithmically?I Am I using the right numerical scheme?I Can I use better solvers?

I What about data-intensive tasks?I Is my communication model appropriate?I Am I doing I/O right? Are there better file formats?

. . . but threading looks so much easier!I Changing any of the above is invasiveI Fundamental changes are impractical in monolithic codes

M. Lange, G. J. GormanFiredrake: Burning the Thread at Both Ends

Page 8: Firedrake: Burning the Thread at Both Endsknepley/presentations/SIAMPP2016_2.pdf · Firedrake: Burning the Thread at Both Ends M. Lange 1G. J. Gorman 1AMCG, Imperial ... I Utilise

Firedrake - A finite element framework

Automated symbolic computation1

I Re-envisioned FEniCS/DOLFIN2

φn+1/2 = φn ´ ∆t2

pn

pn+1 = pn +

ż

Ω∇φn+1/2 ¨∇v dx

ż

Ωv dx

@v P V

φn+1 = φn+1/2 ´ ∆t2

pn+1

where

∇φ ¨ n = 0 on ΓN

p = sin(10πt) on ΓD

from firedrake import *mesh = Mesh("wave_tank.msh")V = FunctionSpace(mesh , ’Lagrange ’, 1)p = Function(V, name="p")phi = Function(V, name="phi")u = TrialFunction(V)v = TestFunction(V)p_in = Constant (0.0)bc = DirichletBC(V, p_in , 1)T = 10.dt = 0.001t = 0while t <= T:

p_in.assign(sin(2*pi*5*t))phi -= dt / 2 * pp += assemble(dt * inner(grad(v), grad(phi))*dx) \

/ assemble(v*dx)bc.apply(p)phi -= dt / 2 * pt += dt

1F. Rathgeber, D. Ham, L. Mitchell, M. Lange, F. Luporini, A. McRae, G. Bercea, G. Markall, and P. Kelly. Firedrake: Automating thefinite element method by composing abstractions. Submitted to ACM TOMS, 2015

2A. Logg, K.-A. Mardal, and G. Wells. Automated Solution of Differential Equations by the Finite Element Method. Springer, 2012

M. Lange, G. J. GormanFiredrake: Burning the Thread at Both Ends

Page 9: Firedrake: Burning the Thread at Both Endsknepley/presentations/SIAMPP2016_2.pdf · Firedrake: Burning the Thread at Both Ends M. Lange 1G. J. Gorman 1AMCG, Imperial ... I Utilise

Firedrake - A finite element framework

Automated symbolic computation1

I Implements UFL2, a finite element DSLembedded in Python

I Run-time C code generationI PyOP2: Assembly kernel execution

framework

Separation of concerns

I Expert for each layerI Use third-party packages

I “Write as little code as possible”

Unified FormLanguage

PyOP2Interface

TSFC

Parallel scheduling, code generation

CPU(OpenMP/OpenCL)

GPU(PyCUDA /PyOpenCL)

Futurearch.

FEM problem(weak form PDE)

Local assemblykernels (AST)

Parallel loops: kernelsexecuted over mesh

Explicitlyparallelhardware-specificimplemen-tation

Meshes,matrices,vectors

PETSc4py (KSP,SNES, DMPlex)

Firedrake/FEniCSlanguage

MPI

Geometry,(non)linearsolves

assembly,compiledexpressions

FIAT

parallelloop

parallelloop

COFFEEAST optimiser

data structures(Set, Map, Dat)

Domain specialist: mathematicalmodel usingFEM

Numerical analyst: generation ofFEM kernels

Domain specialist: mathematicalmodel on un-structured grid

Parallelprogrammingexpert: hardwarearchitectures, optimisation

Expert for each layer

1F. Rathgeber, D. Ham, L. Mitchell, M. Lange, F. Luporini, A. McRae, G. Bercea, G. Markall, and P. Kelly. Firedrake: Automating thefinite element method by composing abstractions. Submitted to ACM TOMS, 2015

2M. Alnæs, A. Logg, K. Ølgaard, M. Rognes, and G. Wells. Unified Form Language: A domain-specific language for weak formulationsof partial differential equations. ACM Transactions on Mathematical Software (TOMS), 40(2):9, 2014

M. Lange, G. J. GormanFiredrake: Burning the Thread at Both Ends

Page 10: Firedrake: Burning the Thread at Both Endsknepley/presentations/SIAMPP2016_2.pdf · Firedrake: Burning the Thread at Both Ends M. Lange 1G. J. Gorman 1AMCG, Imperial ... I Utilise

Firedrake - A finite element framework

End-to-end optimisation

I Exploration of numerical schemesI Automated parallelisationI Data layout optimisationsI Automated kernel optimisation

Parallelisation modelI Mostly MPI on CPUs

I We have threads, but no gains

I Extendable to MPI+X, or just XI for some unknown X

I Model definition doesn’t change!I Can even adjust numerics if needed

Unified FormLanguage

PyOP2Interface

TSFC

Parallel scheduling, code generation

CPU(OpenMP/OpenCL)

GPU(PyCUDA /PyOpenCL)

Futurearch.

FEM problem(weak form PDE)

Local assemblykernels (AST)

Parallel loops: kernelsexecuted over mesh

Explicitlyparallelhardware-specificimplemen-tation

Meshes,matrices,vectors

PETSc4py (KSP,SNES, DMPlex)

Firedrake/FEniCSlanguage

MPI

Geometry,(non)linearsolves

assembly,compiledexpressions

FIAT

parallelloop

parallelloop

COFFEEAST optimiser

data structures(Set, Map, Dat)

Domain specialist: mathematicalmodel usingFEM

Numerical analyst: generation ofFEM kernels

Domain specialist: mathematicalmodel on un-structured grid

Parallelprogrammingexpert: hardwarearchitectures, optimisation

Expert for each layer

M. Lange, G. J. GormanFiredrake: Burning the Thread at Both Ends

Page 11: Firedrake: Burning the Thread at Both Endsknepley/presentations/SIAMPP2016_2.pdf · Firedrake: Burning the Thread at Both Ends M. Lange 1G. J. Gorman 1AMCG, Imperial ... I Utilise

Case study: Seigen

Seismology through code generation1

I Seismic model using elastic wave equationI Implemented purely on top of Firedrake (UFL)I Explore end-to-end optimisation through symbolic computation

As used in energy exploration

I Full Waveform Inversion (FWI)I Traditionally finite difference (FD)I Explore use of unstructured meshes

1C. T. Jacobs, M. Lange, F. Luporini, and G. J. Gorman. Application of code generation to high-order seismic modelling with thediscontinuous galerkin finite element method. Under Preparation

M. Lange, G. J. GormanFiredrake: Burning the Thread at Both Ends

Page 12: Firedrake: Burning the Thread at Both Endsknepley/presentations/SIAMPP2016_2.pdf · Firedrake: Burning the Thread at Both Ends M. Lange 1G. J. Gorman 1AMCG, Imperial ... I Utilise

Case study: Seigen

Seismology through code generation1

I Discontinuous finite element (DG-FEM) with implicit and explicit solvesI 4th order time-stepping and up to 4th order spatial discretisation

10-6 10-5 10-4 10-3 10-2 10-1

Stress error in L2 norm

2

4

8

16

32

Wall

tim

e /

seco

nds

dx=0.031

dx=0.016

dx=0.008

dx=0.062

dx=0.031

dx=0.016

dx=0.125

dx=0.062

dx=0.031

dx=0.250

dx=0.125

dx=0.062

Error-cost comparison of spatial discretisations

P1-DG P2-DG P3-DG P4-DG

P1-DG P4-DG0.03125

0.0625

0.125

0.25

0.5

1.0

2.0

4.0

8.0

16.0

Wall

tim

e /

seco

nds

Raw

Opt

0.5 1.0 2.0 4.0 8.0 16.0

Operational intensity (FLOPS/Byte)

8

16

32

64

128

256

512

1024

Perf

orm

ance

(G

Flops/

s)

P1-D

G: O

pt

P4-D

G: O

pt

P1-D

G: R

aw

P4-D

G: R

aw

1C. T. Jacobs, M. Lange, F. Luporini, and G. J. Gorman. Application of code generation to high-order seismic modelling with thediscontinuous galerkin finite element method. Under Preparation

M. Lange, G. J. GormanFiredrake: Burning the Thread at Both Ends

Page 13: Firedrake: Burning the Thread at Both Endsknepley/presentations/SIAMPP2016_2.pdf · Firedrake: Burning the Thread at Both Ends M. Lange 1G. J. Gorman 1AMCG, Imperial ... I Utilise

Conclusion

Threading: Yes, no, maybe . . .

I Performance optimsiation is usually morecomplicated than #pragma openmp for

What matters is end-to-end optimisation

I Consider model, numerics, data optimisation and compiler tricksI Optimisation needs to fit parallelisation, needs to fit hardware!

Separation of concerns through abstraction layering

I Enables end-to-end optimisationI Allows expertise from all relevant fieldsI Requires run-time decisions1

1J. Brown, M. Knepley, and B. Smith. Run-time extensibility and librarization of simulation software. IEEE Computing in Science andEngineering, 2015

M. Lange, G. J. GormanFiredrake: Burning the Thread at Both Ends

Page 14: Firedrake: Burning the Thread at Both Endsknepley/presentations/SIAMPP2016_2.pdf · Firedrake: Burning the Thread at Both Ends M. Lange 1G. J. Gorman 1AMCG, Imperial ... I Utilise

Thank You

Don’t miss:

I Poster session - Seigen: Seismic modelling through code generationI Friday, 4.50pm - F. Luporini: Generating High Performance Finite Element Kernels

Using Optimality Criteria

www.firedrakeproject.org http://www.opesci.org

M. Lange, G. J. GormanFiredrake: Burning the Thread at Both Ends