Firedrake: Burning the Thread at Both Endsknepley/presentations/SIAMPP2016_2.pdf · Firedrake:...
Transcript of Firedrake: Burning the Thread at Both Endsknepley/presentations/SIAMPP2016_2.pdf · Firedrake:...
Firedrake: Burning the Thread at Both Ends
M. Lange1 G. J. Gorman1
1AMCG, Imperial College London
April 13 2016
M. Lange, G. J. GormanFiredrake: Burning the Thread at Both Ends
To Thread or Not To Thread
In order to thread the application . . .
I A while ago, everybody wanted threading:I Utilise shared memory parallelismI Avoid MPI communication overheadI Improved memory footprint
I And it was supposed to be easy:
#pragma openmp for
I Fluidity: A widely used finite element code:I CFD, ocean modelling, geophysical flows,
renewable energies, reservoir modelling, . . .I Adaptive anisotropic mesh refinement
02
46
80 1 2
0
1
2
M. Lange, G. J. GormanFiredrake: Burning the Thread at Both Ends
To Thread or Not To Thread
. . . we need to thread the solver
I PETSc-OMP:I An OpenMP threaded fork of PETSc-3.3I Low-level threading on Mat and Vec objects
I Optimised sparse MatVecI Explicit computation-communication overlapI Fined-grained load balance based on non-zero weights
PETSc-OMP IS NOT SUPPORTED ANYMORE!
I Was superseded by PETSc-ThreadcommI Threadcomm already decommissioned
M. Lange, G. J. GormanFiredrake: Burning the Thread at Both Ends
Sparse MatVec results on Cray XE6
32 64 128 256 512 1024 2048 4096 8192No. of Cores
101
102
103
Runt
ime
(s)
XE6: Vector-basedXE6: Task-basedXE6: Task-based, NZ-balanced
XE6: Pure-MPIXE6: Pure-MPI (Decomposed)
32 64 128 256 512 1024 2048 4096 8192No. of Cores
20
40
60
80
100
120
140
Para
llel E
ffici
ency
(%)
XE6: Vector-basedXE6: Task-basedXE6: Task-based, NZ-balanced
XE6: Pure-MPIXE6: Pure-MPI (Decomposed)
It’s extremely hard to beat pure MPI!
1M. Lange, G. Gorman, M. Weiland, L. Mitchell, and J. Southern. ”Supercomputing: 28th ISC 2013. Proceedings”, chapter ”AchievingEfficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation”, pages 97–108. Springer, 2013
M. Lange, G. J. GormanFiredrake: Burning the Thread at Both Ends
Fluidity performance on Cray XE6
Total run-time for 10 time-steps
512 1024 2048 4096 8192 16384No. of Cores
102
103
104
Runt
ime
(s)
MPI Hybrid
I/O for initial mesh read
512 1024 2048 4096 8192 16384No. of Cores
100
101
102
103
104
Runt
ime
(s)
I/O: MPI I/O: Hybrid
1X. Guo, M. Lange, G. Gorman, L. Mitchell, and M. Weiland. Developing a scalable hybrid MPI/OpenMP unstructured finite elementmodel. Computers & Fluids, 110(0):227 – 234, 2015. ParCFD 2013
M. Lange, G. J. GormanFiredrake: Burning the Thread at Both Ends
Fluidity performance on Cray XE6
Hybrid MPI-OpenMP looks faster at scale, but . . .
I Huge gains due to initial mesh I/OI Fluidity does off-line mesh decompositionI Partitioning and halo read from fileI Using threads we need less partitions (x8)
I Sparse MatVec beats pure MPII Only in strong scaling limit with little local workI Need threading to enforce asynchronous communicationI Improvement due to better load balance, not MPI overheads!
No actual gain from threading!I We just ameliorated some other underlying problem
M. Lange, G. J. GormanFiredrake: Burning the Thread at Both Ends
Threading: Should we even care?
Threading is never the whole story . . .I What is my application really limited by?
I Different tasks can have different limitations (flops vs. bandwidth)I Profiling (roofline plots, analysis tools) must guide optimisation!
I Can we do better algorithmically?I Am I using the right numerical scheme?I Can I use better solvers?
I What about data-intensive tasks?I Is my communication model appropriate?I Am I doing I/O right? Are there better file formats?
. . . but threading looks so much easier!I Changing any of the above is invasiveI Fundamental changes are impractical in monolithic codes
M. Lange, G. J. GormanFiredrake: Burning the Thread at Both Ends
Firedrake - A finite element framework
Automated symbolic computation1
I Re-envisioned FEniCS/DOLFIN2
φn+1/2 = φn ´ ∆t2
pn
pn+1 = pn +
ż
Ω∇φn+1/2 ¨∇v dx
ż
Ωv dx
@v P V
φn+1 = φn+1/2 ´ ∆t2
pn+1
where
∇φ ¨ n = 0 on ΓN
p = sin(10πt) on ΓD
from firedrake import *mesh = Mesh("wave_tank.msh")V = FunctionSpace(mesh , ’Lagrange ’, 1)p = Function(V, name="p")phi = Function(V, name="phi")u = TrialFunction(V)v = TestFunction(V)p_in = Constant (0.0)bc = DirichletBC(V, p_in , 1)T = 10.dt = 0.001t = 0while t <= T:
p_in.assign(sin(2*pi*5*t))phi -= dt / 2 * pp += assemble(dt * inner(grad(v), grad(phi))*dx) \
/ assemble(v*dx)bc.apply(p)phi -= dt / 2 * pt += dt
1F. Rathgeber, D. Ham, L. Mitchell, M. Lange, F. Luporini, A. McRae, G. Bercea, G. Markall, and P. Kelly. Firedrake: Automating thefinite element method by composing abstractions. Submitted to ACM TOMS, 2015
2A. Logg, K.-A. Mardal, and G. Wells. Automated Solution of Differential Equations by the Finite Element Method. Springer, 2012
M. Lange, G. J. GormanFiredrake: Burning the Thread at Both Ends
Firedrake - A finite element framework
Automated symbolic computation1
I Implements UFL2, a finite element DSLembedded in Python
I Run-time C code generationI PyOP2: Assembly kernel execution
framework
Separation of concerns
I Expert for each layerI Use third-party packages
I “Write as little code as possible”
Unified FormLanguage
PyOP2Interface
TSFC
Parallel scheduling, code generation
CPU(OpenMP/OpenCL)
GPU(PyCUDA /PyOpenCL)
Futurearch.
FEM problem(weak form PDE)
Local assemblykernels (AST)
Parallel loops: kernelsexecuted over mesh
Explicitlyparallelhardware-specificimplemen-tation
Meshes,matrices,vectors
PETSc4py (KSP,SNES, DMPlex)
Firedrake/FEniCSlanguage
MPI
Geometry,(non)linearsolves
assembly,compiledexpressions
FIAT
parallelloop
parallelloop
COFFEEAST optimiser
data structures(Set, Map, Dat)
Domain specialist: mathematicalmodel usingFEM
Numerical analyst: generation ofFEM kernels
Domain specialist: mathematicalmodel on un-structured grid
Parallelprogrammingexpert: hardwarearchitectures, optimisation
Expert for each layer
1F. Rathgeber, D. Ham, L. Mitchell, M. Lange, F. Luporini, A. McRae, G. Bercea, G. Markall, and P. Kelly. Firedrake: Automating thefinite element method by composing abstractions. Submitted to ACM TOMS, 2015
2M. Alnæs, A. Logg, K. Ølgaard, M. Rognes, and G. Wells. Unified Form Language: A domain-specific language for weak formulationsof partial differential equations. ACM Transactions on Mathematical Software (TOMS), 40(2):9, 2014
M. Lange, G. J. GormanFiredrake: Burning the Thread at Both Ends
Firedrake - A finite element framework
End-to-end optimisation
I Exploration of numerical schemesI Automated parallelisationI Data layout optimisationsI Automated kernel optimisation
Parallelisation modelI Mostly MPI on CPUs
I We have threads, but no gains
I Extendable to MPI+X, or just XI for some unknown X
I Model definition doesn’t change!I Can even adjust numerics if needed
Unified FormLanguage
PyOP2Interface
TSFC
Parallel scheduling, code generation
CPU(OpenMP/OpenCL)
GPU(PyCUDA /PyOpenCL)
Futurearch.
FEM problem(weak form PDE)
Local assemblykernels (AST)
Parallel loops: kernelsexecuted over mesh
Explicitlyparallelhardware-specificimplemen-tation
Meshes,matrices,vectors
PETSc4py (KSP,SNES, DMPlex)
Firedrake/FEniCSlanguage
MPI
Geometry,(non)linearsolves
assembly,compiledexpressions
FIAT
parallelloop
parallelloop
COFFEEAST optimiser
data structures(Set, Map, Dat)
Domain specialist: mathematicalmodel usingFEM
Numerical analyst: generation ofFEM kernels
Domain specialist: mathematicalmodel on un-structured grid
Parallelprogrammingexpert: hardwarearchitectures, optimisation
Expert for each layer
M. Lange, G. J. GormanFiredrake: Burning the Thread at Both Ends
Case study: Seigen
Seismology through code generation1
I Seismic model using elastic wave equationI Implemented purely on top of Firedrake (UFL)I Explore end-to-end optimisation through symbolic computation
As used in energy exploration
I Full Waveform Inversion (FWI)I Traditionally finite difference (FD)I Explore use of unstructured meshes
1C. T. Jacobs, M. Lange, F. Luporini, and G. J. Gorman. Application of code generation to high-order seismic modelling with thediscontinuous galerkin finite element method. Under Preparation
M. Lange, G. J. GormanFiredrake: Burning the Thread at Both Ends
Case study: Seigen
Seismology through code generation1
I Discontinuous finite element (DG-FEM) with implicit and explicit solvesI 4th order time-stepping and up to 4th order spatial discretisation
10-6 10-5 10-4 10-3 10-2 10-1
Stress error in L2 norm
2
4
8
16
32
Wall
tim
e /
seco
nds
dx=0.031
dx=0.016
dx=0.008
dx=0.062
dx=0.031
dx=0.016
dx=0.125
dx=0.062
dx=0.031
dx=0.250
dx=0.125
dx=0.062
Error-cost comparison of spatial discretisations
P1-DG P2-DG P3-DG P4-DG
P1-DG P4-DG0.03125
0.0625
0.125
0.25
0.5
1.0
2.0
4.0
8.0
16.0
Wall
tim
e /
seco
nds
Raw
Opt
0.5 1.0 2.0 4.0 8.0 16.0
Operational intensity (FLOPS/Byte)
8
16
32
64
128
256
512
1024
Perf
orm
ance
(G
Flops/
s)
P1-D
G: O
pt
P4-D
G: O
pt
P1-D
G: R
aw
P4-D
G: R
aw
1C. T. Jacobs, M. Lange, F. Luporini, and G. J. Gorman. Application of code generation to high-order seismic modelling with thediscontinuous galerkin finite element method. Under Preparation
M. Lange, G. J. GormanFiredrake: Burning the Thread at Both Ends
Conclusion
Threading: Yes, no, maybe . . .
I Performance optimsiation is usually morecomplicated than #pragma openmp for
What matters is end-to-end optimisation
I Consider model, numerics, data optimisation and compiler tricksI Optimisation needs to fit parallelisation, needs to fit hardware!
Separation of concerns through abstraction layering
I Enables end-to-end optimisationI Allows expertise from all relevant fieldsI Requires run-time decisions1
1J. Brown, M. Knepley, and B. Smith. Run-time extensibility and librarization of simulation software. IEEE Computing in Science andEngineering, 2015
M. Lange, G. J. GormanFiredrake: Burning the Thread at Both Ends
Thank You
Don’t miss:
I Poster session - Seigen: Seismic modelling through code generationI Friday, 4.50pm - F. Luporini: Generating High Performance Finite Element Kernels
Using Optimality Criteria
www.firedrakeproject.org http://www.opesci.org
M. Lange, G. J. GormanFiredrake: Burning the Thread at Both Ends