Task-based Parallelization of the Fast Multipole Method ...€¦ ·...

21
Task-based Parallelization of the Fast Multipole Method on NVIDIA GPUs and Multicore Processors Emmanuel Agullo, a Bérenger Bramas, a Olivier Coulaud, a Eric Darve, b Matthias Messner, a Toru Takahashi c a INRIA Bordeaux – Sud-Ouest, France b Institute for Computational and Mathematical Engineering, Stanford University c Department of Mechanical Science and Engineering, Nagoya University March Agullo, Bramas, Coulaud, Darve, Messner, Takahashi FMM on StarPU /

Transcript of Task-based Parallelization of the Fast Multipole Method ...€¦ ·...

Page 1: Task-based Parallelization of the Fast Multipole Method ...€¦ · Datastructuresandwell-separatedclusters Fast multipole method on parallel systems S C r i S! C! r j Figure 1: Two

Task-based Parallelization of the Fast Multipole Method on NVIDIAGPUs and Multicore Processors

Emmanuel Agullo,a Bérenger Bramas,a Olivier Coulaud,a Eric Darve,b Matthias Messner,a

Toru Takahashic

a INRIA Bordeaux – Sud-Ouest, Franceb Institute for Computational and Mathematical Engineering, Stanford Universityc Department of Mechanical Science and Engineering, Nagoya University

March 21 2013

Agullo, Bramas, Coulaud, Darve, Messner, Takahashi FMM on StarPU 1 / 21

Page 2: Task-based Parallelization of the Fast Multipole Method ...€¦ · Datastructuresandwell-separatedclusters Fast multipole method on parallel systems S C r i S! C! r j Figure 1: Two

Outline

1 Fast multipole method

2 Task-based parallelization

3 Numerical benchmarks onmulticore and NVIDIA GPUs

Agullo, Bramas, Coulaud, Darve, Messner, Takahashi FMM on StarPU 2 / 21

Page 3: Task-based Parallelization of the Fast Multipole Method ...€¦ · Datastructuresandwell-separatedclusters Fast multipole method on parallel systems S C r i S! C! r j Figure 1: Two

Fast Multipole Method

Applications:

Gravitational simulations,molecular dynamics, electrostatic forces

Integral equations for Laplace, Poisson, Stokes, Navier. Both fluid and solid mechanics.

Interpolation using radial basis functions:meshless methods, graphics (data smoothing)

Imaging and inverse problems: subsurface characterization andmonitoring, imaging

O(N) fast linear solvers: LU factorization, matrix inverse, for dense and sparse matrices,“FastLAPACK”

Agullo, Bramas, Coulaud, Darve, Messner, Takahashi FMM on StarPU 3 / 21

Page 4: Task-based Parallelization of the Fast Multipole Method ...€¦ · Datastructuresandwell-separatedclusters Fast multipole method on parallel systems S C r i S! C! r j Figure 1: Two

Matrix-vector products

FMM: a method to calculate

φ(xi) =N∑

j=1

K(xi,xj)σj, 1 ≤ i ≤ N

inO(N) orO(Nlnα N) arithmetic operations.

There are many di�erent FMM formulations.

Most of them rely on low-rank approximations of the kernel function K in the form:

K(x,y) =p∑

q=1uq(x)

p∑

s=1Tqs vs(y) + ε

Agullo, Bramas, Coulaud, Darve, Messner, Takahashi FMM on StarPU 4 / 21

Page 5: Task-based Parallelization of the Fast Multipole Method ...€¦ · Datastructuresandwell-separatedclusters Fast multipole method on parallel systems S C r i S! C! r j Figure 1: Two

Data structures and well-separated clusters

Fast multipole method on parallel systems

S

C

ri

S0

C 0rj

Figure 1: Two spheres containing ri and rj illustrating the restricted domain of validity of (1).

(a) (b)

Figure 2: 2D illustration: (a) A square is subdivided into four quadrants; approximation (1) (well-separatedcondition) does not apply to any pair of quadrants. (b) Each quadrant from Figure 2(a) is further subdividedinto 4 quadrants. The interaction between particles in the two gray clusters can now be computed using thelow-rank approximation (1). The interaction between the gray cluster and its neighboring clusters (with thediagonal pattern) can not be computed at this level of the tree. A further decomposition of each quadrant isrequired.

Instead each octant is further decomposed into octants as shown in Figure 2(b). This allows computingall interactions between pairs of clusters that are not nearest neighbors, e.g., the pair of gray clusters inFigure 2(b). Interactions between particles in neighboring clusters are computed by further subdividingeach octant into smaller octants. This process is recursively repeated until the clusters are small enough thatinteractions between particles in neighboring clusters can be computed directly, that is by direct evaluationof

Pj K(ri, rj)�j .

In a typical situation, a cluster is required to interact only with clusters that are neighbors (i.e., they sharea vertex, edge, or face) of their parent cluster. In addition, since they cannot interact with clusters that aretheir own neighbors, each cluster is therefore only required to interact with a relatively small number ofclusters, at most 189 (= 63 � 33) in 3D. This is shown in Figure 3.

Figure 3: 2D illustration: basic interaction list. In a typical situation, the gray cluster near the center has tointeract with all the light gray clusters around it. In 3D there are 189 of them.

3

Fast multipole method on parallel systems

S

C

ri

S0

C 0rj

Figure 1: Two spheres containing ri and rj illustrating the restricted domain of validity of (1).

(a) (b)

Figure 2: 2D illustration: (a) A square is subdivided into four quadrants; approximation (1) (well-separatedcondition) does not apply to any pair of quadrants. (b) Each quadrant from Figure 2(a) is further subdividedinto 4 quadrants. The interaction between particles in the two gray clusters can now be computed using thelow-rank approximation (1). The interaction between the gray cluster and its neighboring clusters (with thediagonal pattern) can not be computed at this level of the tree. A further decomposition of each quadrant isrequired.

Instead each octant is further decomposed into octants as shown in Figure 2(b). This allows computingall interactions between pairs of clusters that are not nearest neighbors, e.g., the pair of gray clusters inFigure 2(b). Interactions between particles in neighboring clusters are computed by further subdividingeach octant into smaller octants. This process is recursively repeated until the clusters are small enough thatinteractions between particles in neighboring clusters can be computed directly, that is by direct evaluationof

Pj K(ri, rj)�j .

In a typical situation, a cluster is required to interact only with clusters that are neighbors (i.e., they sharea vertex, edge, or face) of their parent cluster. In addition, since they cannot interact with clusters that aretheir own neighbors, each cluster is therefore only required to interact with a relatively small number ofclusters, at most 189 (= 63 � 33) in 3D. This is shown in Figure 3.

Figure 3: 2D illustration: basic interaction list. In a typical situation, the gray cluster near the center has tointeract with all the light gray clusters around it. In 3D there are 189 of them.

3

Fast multipole method on parallel systems

S

C

ri

S0

C 0rj

Figure 1: Two spheres containing ri and rj illustrating the restricted domain of validity of (1).

(a) (b)

Figure 2: 2D illustration: (a) A square is subdivided into four quadrants; approximation (1) (well-separatedcondition) does not apply to any pair of quadrants. (b) Each quadrant from Figure 2(a) is further subdividedinto 4 quadrants. The interaction between particles in the two gray clusters can now be computed using thelow-rank approximation (1). The interaction between the gray cluster and its neighboring clusters (with thediagonal pattern) can not be computed at this level of the tree. A further decomposition of each quadrant isrequired.

Instead each octant is further decomposed into octants as shown in Figure 2(b). This allows computingall interactions between pairs of clusters that are not nearest neighbors, e.g., the pair of gray clusters inFigure 2(b). Interactions between particles in neighboring clusters are computed by further subdividingeach octant into smaller octants. This process is recursively repeated until the clusters are small enough thatinteractions between particles in neighboring clusters can be computed directly, that is by direct evaluationof

Pj K(ri, rj)�j .

In a typical situation, a cluster is required to interact only with clusters that are neighbors (i.e., they sharea vertex, edge, or face) of their parent cluster. In addition, since they cannot interact with clusters that aretheir own neighbors, each cluster is therefore only required to interact with a relatively small number ofclusters, at most 189 (= 63 � 33) in 3D. This is shown in Figure 3.

Figure 3: 2D illustration: basic interaction list. In a typical situation, the gray cluster near the center has tointeract with all the light gray clusters around it. In 3D there are 189 of them.

3

E. Darve, C. Cecka, T. Takahashi

Tree Root

MCs

M2M

MC0s

Upward Pass M2L

LC0q

L2L

LCq

Downward Pass

Figure 4: Illustration of the three passes in the FMM.

In most FMMs, the expansion (1) is not applied directly to each independent level of the tree. Indeed,computing MC

s — the multipole coefficients for cluster C — for each cluster directly using the particlepositions, ri and rj , as in (1), the total cost would be O(N ln N) since there are N particles and O(ln N)levels in the tree. Similarly, if we compute the potential �(ri) directly using each level’s LC

q , the total costwould also be O(N ln N). Instead, in the FMM, one evaluates uC

q and vCs only for leaf clusters C and

determines operators to propagate this information up and down the tree. This is illustrated in Figure 4.We start by computing MC

s for all leaf clusters C. Then MC0s is computed for all clusters C 0 in the tree

by progressively propagating the information upward using Multipole-to-Multipole operators. Partial localcoefficients LC

q are computed for all clusters by applying the Multipole-to-Local operator TCC0qs to MC0

s .Finally, this information is propagated down the tree using Local-to-Local operators, until the final set oflocal coefficients LC

q are obtained for all leaf clusters C. The potential �(ri) can finally be estimated afteradding near neighbor interactions that are computed directly.

Mathematical analysis is required to determine the exact definition of uq and vs and the nature of theMultipole-to-Multipole (M2M), Multipole-to-Local (M2L), and Local-to-Local (L2L) operators. However,in most cases, these operators are dense matrices which need to multiply the vectors of coefficients MC

s

and LCq [6]. See Figure 4.

An additional difficulty in the method arises when the particles rj are distributed inhomogeneously inspace. In that case a uniform octree cannot be used and instead an adaptive octree is constructed, in whichthe number of levels varies spatially. More levels are used in places where the concentration of particlesis greater. See Figure 5. This allows maintaining the optimal O(N) computational cost. A few additionaloperators are then required. See for example Lashuk et al. [10]

2. Parallelization on distributed clusters and multicore processors

The parallelization of the calculation has effectively two components: the construction of the appropriatedistributed tree structure, and the application of the FMM itself.

Many difficulties arise when using adaptive trees [4]. The construction of the tree structure is fairlystraightforward when the number of nodes is not too large, but special techniques must be applied beyond.Many implementations are based on the scheme of Warren and Salmon [20]. The basic difficulty is locating,in memory, a given cluster on another processor. Warren [20] proposed to associate each cluster with a key.The mapping of this key to memory locations is then achieved via a hash table. This was shown to be anefficient approach.

The construction of the tree itself [10] follows two steps. In step 1, a distributed array with all the leaves

4

E. Darve, C. Cecka, and T. Takahashi. The fast multipole method on parallel clusters, multicore processors,and graphics processing units. Comptes Rendus Mécanique, 339(2–3):185–193, 2011.

Agullo, Bramas, Coulaud, Darve, Messner, Takahashi FMM on StarPU 5 / 21

Page 6: Task-based Parallelization of the Fast Multipole Method ...€¦ · Datastructuresandwell-separatedclusters Fast multipole method on parallel systems S C r i S! C! r j Figure 1: Two

Interpolatory FMM

Many formulas are possible to obtain a low-rank approximation.

We use an approach based on interpolation formulas (Chebyshev polynomials) for smoothfunctions. This allows building an FMM that is:

Very general: any smooth (even oscillatory) kernel can be used.

Simple to apply: we only need a routine that evaluates K.With SVD acceleration, has optimal rank.

In some cases, it may be less e�icient than specialized methods for K(x,y) = 1/|x− y|.

W. Fong, E. Darve, The black-box fast multipole method, J. Comput. Phys., 228(23):8712–8725, 2009.

Agullo, Bramas, Coulaud, Darve, Messner, Takahashi FMM on StarPU 6 / 21

Page 7: Task-based Parallelization of the Fast Multipole Method ...€¦ · Datastructuresandwell-separatedclusters Fast multipole method on parallel systems S C r i S! C! r j Figure 1: Two

Convergence of Chebyshev interpolation with SVD acceleration

0 200 400 600 8001,00010−15

10−10

10−5

100

n = 3n = 4n = 5n = 6n = 7n = 8n = 9n = 10

6.4.1. FMM parametersAt this stage, we have not developed a method to optimize the parameters in the method. As a result the running times

are not optimal and can be improved by changing our choice of parameters. In order to be consistent throughout all studies,however, we stick with the following parameter setting. Recall that w and k denote the length of the side of a cluster and thewave number, respectively.

Low-frequency/high-frequency threshold. If kw < 1 holds, a cluster is in the low frequency regime, otherwise it is in thehigh frequency regime. This threshold basically separates clusters which are smaller than one wave length from theothers.Cone aperture. The cone aperture is set to 1/kw < au 6 2/kw where 2p=au 2 N and au 6 p/2 due to the hierarchical coneconstruction. Except near the leaf level, the cone aperture is divided by two when going up the tree.

Fig. 7. Computational time in R2; ! is the prescribed accuracy for ACA; ‘ is the Chebyshev interpolation order.

Fig. 8. Example geometries.

Fig. 9. Cross-sections in the z-direction, for uniform oct-trees, of the surface particle distributions from Fig. 8 with k = 64.

M. Messner et al. / Journal of Computational Physics 231 (2012) 1175–1196 1189

10-12

10-10

10-8

10-6

10-4

10-2

Target accuracy εACA

10-12

10-10

10-8

10-6

10-4

10-2

Rel

ativ

e er

ror

εL

2

l=4l=5l=6l=7l=8l=9l=10l=11l=12l=13

Error vs SVD cuto� for K = 1/r 2D prolate mesh Error vs target accuracy,K = eikr/r

M. Messner, M. Schanz, and E. Darve. Fast directional multilevel summation for oscillatory kernels based onChebyshev interpolation. J. Comp. Phys., 2011.

Agullo, Bramas, Coulaud, Darve, Messner, Takahashi FMM on StarPU 7 / 21

Page 8: Task-based Parallelization of the Fast Multipole Method ...€¦ · Datastructuresandwell-separatedclusters Fast multipole method on parallel systems S C r i S! C! r j Figure 1: Two

Task-based parallelization

Task Concurrency Load-balancing

M2L (lower level) High HomogeneousM2L (higher level) Low HomogeneousP2P High HeterogeneousM2M/L2L Low Homogeneous

Agullo, Bramas, Coulaud, Darve, Messner, Takahashi FMM on StarPU 8 / 21

Page 9: Task-based Parallelization of the Fast Multipole Method ...€¦ · Datastructuresandwell-separatedclusters Fast multipole method on parallel systems S C r i S! C! r j Figure 1: Two

Multicore parallelization with OpenMP: fork-join

function FMM(tree)// Near-fieldP2P(tree.levels[tree.height-1]);// Far-fieldP2M(tree.levels[tree.height-1]);forall the level l from tree.height-2 to 2 do

M2M(tree.levels[l]);

forall the level l from 2 to tree.height-2 doM2L(tree.levels[l]);L2L(tree.levels[l]);

M2L(tree.levels[tree.height-1]);L2P(tree.levels[tree.height-1]);

Agullo, Bramas, Coulaud, Darve, Messner, Takahashi FMM on StarPU 9 / 21

Page 10: Task-based Parallelization of the Fast Multipole Method ...€¦ · Datastructuresandwell-separatedclusters Fast multipole method on parallel systems S C r i S! C! r j Figure 1: Two

Parallel for loops

The loop over cells at a given level are parallelized:

function M2L(level)#pragma omp parallel for

foreach cell cl in level.cells dokernel.m2l(cl.local,

cl.far_field.multipole);

Agullo, Bramas, Coulaud, Darve, Messner, Takahashi FMM on StarPU 10 / 21

Page 11: Task-based Parallelization of the Fast Multipole Method ...€¦ · Datastructuresandwell-separatedclusters Fast multipole method on parallel systems S C r i S! C! r j Figure 1: Two

Improving the scalability

The top of the tree represents a parallelbottleneck.

P2P tasks can be highly heterogeneous.Solution:

mix heavily parallel sections with moresequential ones.Try to execute large tasks first and finishwith small tasks.On heterogeneous platforms, map tasksto appropriate cores.

Agullo, Bramas, Coulaud, Darve, Messner, Takahashi FMM on StarPU 11 / 21

Page 12: Task-based Parallelization of the Fast Multipole Method ...€¦ · Datastructuresandwell-separatedclusters Fast multipole method on parallel systems S C r i S! C! r j Figure 1: Two

Interleaving the far field and near field calculations

#pragma omp parallel#pragma omp sections

#pragma omp sectionP2Ptask(tree.levels[tree.height-1])

#pragma omp sectionP2Mtask(tree.levels[tree.height-1])forall the level l from tree.height-2 to 2 do

M2Mtask(tree.levels[l])forall the level l from 2 to tree.height-2 do

M2Ltask(tree.levels[l])L2Ltask(tree.levels[l])

M2Ltask(tree.levels[tree.height-1])

#pragma omp singleL2Ptask(tree.levels[tree.height-1])

Agullo, Bramas, Coulaud, Darve, Messner, Takahashi FMM on StarPU 12 / 21

Page 13: Task-based Parallelization of the Fast Multipole Method ...€¦ · Datastructuresandwell-separatedclusters Fast multipole method on parallel systems S C r i S! C! r j Figure 1: Two

StarPU

StarPU is a parallel runtime system for heterogeneous platforms (multicore, GPU)

API allows expressing the DAG (directed acyclic graph) of tasks

Runtime scheduling uses a variety of schedulers, including user-defined schedulers

Abstract the processor architecture

E. Agullo, B. Bramas, O. Coulaud, E. Darve, M. Messner, T. Takahashi. “Pipelining the Fast Multipole Method over a

Runtime System.” arXiv preprint arXiv:1206.0115, 2012.

Agullo, Bramas, Coulaud, Darve, Messner, Takahashi FMM on StarPU 13 / 21

Page 14: Task-based Parallelization of the Fast Multipole Method ...€¦ · Datastructuresandwell-separatedclusters Fast multipole method on parallel systems S C r i S! C! r j Figure 1: Two

API Application Programming Interface

Example of a task insertion using StarPU:insertTask( codelet, ACCESS_MODE, handle, ... )

codelet: structure containing function pointers

ACCESS_MODE ← { READ, WRITE, READ/WRITE}

handle: StarPU data interface

ACCESS_MODE and order of task insertion implicitly define the DAG.

forall the level l from 2 to tree.height-2 doforeach block bl in tree.levels[l].blocks do

insertTask( kernel.m2l, READ, bl.Ilist, WRITE, bl.local);

Agullo, Bramas, Coulaud, Darve, Messner, Takahashi FMM on StarPU 14 / 21

Page 15: Task-based Parallelization of the Fast Multipole Method ...€¦ · Datastructuresandwell-separatedclusters Fast multipole method on parallel systems S C r i S! C! r j Figure 1: Two

Priority scheduler

? X X X

R R P

F

User StarPU Core MultiPrio Workers (PU)

Release dependencies

Push ready task Pop taskInsert a new task

CPU

GPU

0 120

R

R

R R00

1

2

2 2 2

Task legend:? : Unknown statusX : Not readyR : ReadyP : Currently processedF : Finished

Priority

P2M > M2M > L2L(l) > P2P (large) > M2L(l) > P2P (small) > L2P

Agullo, Bramas, Coulaud, Darve, Messner, Takahashi FMM on StarPU 15 / 21

Page 16: Task-based Parallelization of the Fast Multipole Method ...€¦ · Datastructuresandwell-separatedclusters Fast multipole method on parallel systems S C r i S! C! r j Figure 1: Two

Heterogeneous priority scheduler

OperatorsHomogeneous Heterogeneous

CPU CPU GPU

P2M 0 0 -M2M 1 1 -L2L 2 2 -P2P (Large) 3 7 0M2L 4 3 2P2P (Small) 5 4 1L2P 6 5 -P2P Restore 7 6 -

GPUs get assigned P2P andM2L operators, with a higher priority on P2P since GPUs are moste�icient at that task.

Agullo, Bramas, Coulaud, Darve, Messner, Takahashi FMM on StarPU 16 / 21

Page 17: Task-based Parallelization of the Fast Multipole Method ...€¦ · Datastructuresandwell-separatedclusters Fast multipole method on parallel systems S C r i S! C! r j Figure 1: Two

Performance of OpenMP and StarPU onmulticore

105 106 107 108

101

102

Number of Particles

PerformanceGflop/s

Seq. fork-join× 160 Fork-join 160 CPUsSeq. block FJ× 160 Block FJ 160 CPUs

Seq. interleaved× 160 Interleaved 160 CPUsSeq. StarPU× 160 StarPU 160 CPUs

h = 8h = 7h = 6h = 5

Agullo, Bramas, Coulaud, Darve, Messner, Takahashi FMM on StarPU 17 / 21

Page 18: Task-based Parallelization of the Fast Multipole Method ...€¦ · Datastructuresandwell-separatedclusters Fast multipole method on parallel systems S C r i S! C! r j Figure 1: Two

StarPU on heterogeneous parallel platform: execution trace

idle P2P M2L other kernels copying data

Top: Nehalem-Fermi M2070; execution time 12.5 sec.Bottom: Nehalem-Fermi M2090; execution time 10.9 sec.N = 30 106. The 9 first lanes represent CPU cores occupancy and the 3 last ones GPUs.

Agullo, Bramas, Coulaud, Darve, Messner, Takahashi FMM on StarPU 18 / 21

Page 19: Task-based Parallelization of the Fast Multipole Method ...€¦ · Datastructuresandwell-separatedclusters Fast multipole method on parallel systems S C r i S! C! r j Figure 1: Two

Heterogeneous execution of an imbalanced FMM

idle P2P M2L other kernels copying data

Top: M2L heavy; Bottom: P2P heavy

Agullo, Bramas, Coulaud, Darve, Messner, Takahashi FMM on StarPU 19 / 21

Page 20: Task-based Parallelization of the Fast Multipole Method ...€¦ · Datastructuresandwell-separatedclusters Fast multipole method on parallel systems S C r i S! C! r j Figure 1: Two

Scheduling with empirical execution model

Scheduler has the option of using amodel of the execution time to estimate the best mapping ofthe DAG onto resources.

108 109

102

103

104

Number of interactions in a block

Time[sec]

Measured CPU Time Fit CPU Time

Measured GPU M2070 Time Fit M2070 Time

Measured GPU M2090 Time Fit M2090 Time

Agullo, Bramas, Coulaud, Darve, Messner, Takahashi FMM on StarPU 20 / 21

Page 21: Task-based Parallelization of the Fast Multipole Method ...€¦ · Datastructuresandwell-separatedclusters Fast multipole method on parallel systems S C r i S! C! r j Figure 1: Two

Conclusion

Modern heterogeneous computing platforms lead to hard to schedule parallel DAGs.

Approach like OpenMP and OpenACCmay prove insu�icient.

Runtime optimization of parallel mapping was demonstrated.

Significant improvement in parallel scalability observed using StarPU.

Code is hardware “independent.” Scheduler determines the best core to execute a giventask.

Agullo, Bramas, Coulaud, Darve, Messner, Takahashi FMM on StarPU 21 / 21