Alya: Computational Solid Mechanics for Supercomputers

Arch Computat Methods Eng (2015) 22:557–576DOI 10.1007/s11831-014-9126-8

ORIGINAL PAPER

Alya: Computational Solid Mechanics for Supercomputers

E. Casoni · A. Jérusalem · C. Samaniego ·B. Eguzkitza · P. Lafortune · D. D. Tjahjanto ·X. Sáez · G. Houzeaux · M. Vázquez

Received: 17 July 2014 / Accepted: 22 July 2014 / Published online: 13 August 2014© CIMNE, Barcelona, Spain 2014

Abstract While solid mechanics codes are now conven-tional tools both in industry and research, the increasinglymore exigent requirements of both sectors are fuelling theneed for more computational power and more advanced algo-rithms. For obvious reasons, commercial codes are laggingbehind academic codes often dedicated either to the imple-mentation of one new technique, or the upscaling of currentconventional codes to tackle massively large scale computa-tional problems. Only in a few cases, both approaches havebeen followed simultaneously. In this article, a solid mechan-ics simulation strategy for parallel supercomputers based on ahybrid approach is presented. Hybrid parallelization exploitsthe thread-level parallelism of multicore architectures, com-

E. Casoni (B) · C. Samaniego · B. Eguzkitza ·X. Sáez · G. Houzeaux · M. VázquezBarcelona Supercomputing Center (BSC-CNS), Edificio NEXUSI, Campus Nord UPC Gran Capitán 2–4, 08034 Barcelona, Spaine-mail: [email protected]

M. VázquezArtificial Intelligence Research Institute CSIC (IIIA-CSIC),Campus de la UAB, 08193 Bellaterra, Spaine-mail: [email protected]

A. Jérusalem · D. D. TjahjantoIMDEA Materials Institute, Tecnogetafe, C/ Eric Kandel 2,28906 Getafe, Spain

D. D. TjahjantoDepartment of Solid Mechanics, KTH Royal Instituteof Technology, Teknikringen 8D, 10044 Stockholm, Sweden

P. LafortuneIdra Simulation S.C.P., 08004 Barcelona, Spain

A. JérusalemDepartment of Engineering Science, University of Oxford, Parks Road,Oxford OX1 3PJ, UKe-mail: [email protected]

bining MPI tasks with OpenMP threads. This paper describesthe proposed strategy, programmed in Alya, a parallel multi-physics code. Hybrid parallelization is specially well suitedfor the current trend of supercomputers, namely large clus-ters of multicores. The strategy is assessed through transientnon-linear solid mechanics problems, both for explicit andimplicit schemes, running on thousands of cores. In orderto demonstrate the flexibility of the proposed strategy underadvance algorithmic evolution of computational mechanics,a non-local parallel overset meshes method (Chimera-like)is implemented and the conservation of the scalability isdemonstrated.

Keywords Computational mechanics · Finite elementmethod · Parallel computing · Chimera

1 Introduction

With the increasing need for more advanced modeling tech-niques involving non-local approaches or multiphysics inter-actions, finite element (FE) solid mechanics codes require-ments have traditionally slowed down the parallel effortsaimed at increasing the computational scalability of suchcodes. One of the direct consequences of this is that mostcommercial codes cannot efficiently scale on parallel com-puters when more than hundreds of cores are used [5,24].Academic FE codes, on the other hand, have often relied onthe need to develop one unique technique of interest, poten-tially followed by a secondary development phase aimed atscaling it up because of the prohibitive cost of the technique.As a direct consequence, researchers have often focused onthe scalability of one technique independently of the otherones; see for example the Arbitrary Lagrangian-Eulerian(ALE) methods [51], Discontinuous Galerkin (DG) meth-

123

http://crossmark.crossref.org/dialog/?doi=10.1007/s11831-014-9126-8&domain=pdf

558 E. Casoni et al.

ods [21,71], fluid–structure interaction (FSI) [27,51,67] oreven expensive constitutive models (and their also expensivemeshing requirements) such as crystal plasticity [70]. It isalso noticeable that the range of fields of application for thesereferences is extremely wide, ranging from bio-medical, mil-itary, seismic, or fracture mechanics to polycrystalline textureanalysis.

1.1 Fluid Mechanics Versus Solid Mechanics

Fluid mechanics computational codes have generally beenexploiting advanced techniques conjointly with parallelefficiency (see for instance OpenFOAM [9], Alya [1] orCODE_SATURNE [4]). This can be explained by the factthat fluid problems have traditionally required larger meshesthan Solid Mechanics problems but also by the fact thatmodel requirements for solids may hinder the paralleliza-tion itself, for instance, in fracture mechanics. In that sense,the development of parallel Solid Mechanics codes requiresa more complex structure in order, to not only solve the par-tial differential equation (PDE) involved, but also to add thecomplex features typical of computational Solid Mechan-ics: e.g., non-linear behavior of materials, non-locality of themodel in fracture mechanics or continuum damage theory,non-linearity (and unphysical) forces in contact mechanics,multi-scale model needed in fracture and cracks, etc.

1.2 FE Solid Mechanics Codes

Designing computational models of complex material defor-mation mechanisms such as fracture, phase change or impactrequires a combination of new numerical methods andimproved computing power. Beyond the advanced numericaltechniques, an increased computational power is needed toresolve the many length and time scales of these problems.Most of the modern FE codes can parallelize over distrib-uted memory clusters to allow for solution on larger meshesor to reduce computation time. Sandia Labs has developedthe SIERRA framework [77], which provides a foundationfor several computational mechanics codes, for instance thehighly scalable software Salinas [23] or the open sourcecode CODE ASTER [3]. Other open source software suchas Gmsh [36] or FEniCS [61] also offer PETSc [19] or Epe-tra [45]. Recently, however, shared memory multicore tech-nology such as graphics processing units (GPUs) [55] havegained prominence due to their arithmetic ability and highpower efficiency, such as in FEAST software package [41].

Large scale Solid Mechanics FE computation became afield of research on its own as soon as in the early 80’s [37].However, if preconditioner optimization studies for largescale computing were tackled shortly after [49], and bar-ring a few exceptions (e.g., Ref. [35] focusing on XFEM),they have almost always been focused on conventional FE

codes (and often for one unique application) without consid-eration of all the previously listed evolutions [17,33,60,79].In the particular case of FE for Solid Mechanics, many com-puter codes were written to solve one specific applicationand are not flexible enough to adapt the newest schemes andtechnological improvements. These limitations of code struc-ture make it difficult to integrate the latests research that isrequired to solve ever larger and more complicated problems.Instead, the rapid advances on parallel computation requirethe code to be developed from the beginning for any largescale application, and then enhanced with advanced tech-niques following the same large scale framework imposedby the code structure.

1.3 System of Equations

The cornerstone of any FE method involves an assembly anda solution of a linear or non-linear system of equations. In animplicit scheme, this solution process requires solving a linersystem of equations, which usually has dimension of millionsand even billions of degrees of freedom. The solution of largesparse linear systems of equations is the main concern in anyparallel code [43]. The algorithms to solve these systems arebasically grouped into two categories: direct methods anditerative methods. For decades, research has mainly focusedon taking advantage of sparsity to design direct methods thatcan be economical. However, a substantial portion of the totalcomputational work and storage required to solve stiff ini-tial value problems is devoted to solve these linear algebraicsystems, particularly if the system is large (see for instancethe study in Ref. [38]). However, over the last decade, severalefficient iterative methods have been developed to solve largesparse (non-symmetric) systems of linear algebraic equationsthat involve a large number of variables (sometimes of theorder of millions) [72]. In these cases, direct methods wouldbe prohibitively expensive even with the best available com-puting power and iterative methods appear as the only ratio-nal alternative.

1.4 Algebraic Solvers

An algebraic system Ax = b can be approached by twocategories of methods. Krylov subspace methods tackle theproblem directly by solving directly for x. Some examples arethe conjugate gradient (CG) method for the symmetric cases,generalized minimal residual (GMRes) and the biconjugategradient (BiCGSTAB) method for the non-symmetric cases[72]. The other family of methods are the primal and dualSchur complement methods, which solve for the interfaceunknowns, the restriction of x on the subdomain interfacesand its dual (e.g. the FETI method [32,73]), respectively. Forthis last class of methods, Krylov solvers can then be usedto solve the interface problem. In both approaches, a precon-

123

Computational Solid Mechanics for Supercomputers 559

ditioner is also required to enhance the convergence. Exam-ples of classical preconditioners are the Gauss-Seidel, blockdiagonal, Deflated [62] or Multigrid [13,57,66]. The precon-ditioner can also be based on domain decomposition meth-ods (DD) as well. Some examples are: Schwarz, RestrictedAdditive Schwarz (RAS), Block LU, Imcomplete Block LU,where the support of the preconditioners coincide with thesubdomains of the mesh partition.

Some definitions are now introduced to assess the effi-ciency of a solver plus a preconditioner in a parallel environ-ment. Let np and dof be the number of parallel processesand the number of degrees of freedom of the algebraic sys-tem, respectively. If tnp(dof ) is the time to achieve the con-vergence criterion of the solver, the efficiency of the strongscalability is defined as as

Efficiency of strong scalability = t1(dof )

np × tnp(dof ).

Thus, a solver with a unit efficiency enables one toachieve convergence np faster by using np processes, i.e.tnp(dof ) = t1(dof )/np. Krylov solvers with classical pre-conditioners can be made strongly scalable, as long as dofis sufficient large on each process, in order to keep a highwork versus communication ratio. However, if one wishes toincrease the global number of degrees of freedom by refiningthe mesh, the strong scalability does not provide any infor-mation about the computing time. The observation of weakscalability means that the amount of time budged does notincrease if one maintains a fixed number of dof per proces-sor. In this case perfect efficiency is achieved if the run timestays constant while the workload (dof ) is increased in directproportion to the number of processors:

Efficiency of weak scalability = t1(dof )

tnp(np × dof ).

This criterion is useful to estimate the CPU time whilerefining the mesh and answer the following question: if onerefines the mesh by a given factor, how many more CPUsshould one use to converge at the same computing time.Even though it is relatively easy to maintain the numberof iterations of the solver independently of dof , it is muchharder to maintain the computing time. This is due to thefact that complex preconditioners must be used, in additionto some coarse space solver to propagate information acrossthe CPUs. These coarse space solvers are usually the bottle-neck of the weak scaling [18,39].

It must be emphasized that some solvers present the sameconvergence independently of np. This is the case for exam-ple of Krylov methods together with a diagonal precondi-tioner. Weak scalable solvers do not, as the mesh partition isthe base for the construction of the local solvers. Therefore,

they do not enable reproducibilty. This means that if a sim-ulation is carried out on 1,024 CPUs on one day, differentconvergence or even results may be obtained on another day,using 256 CPUs. This is not a trivial remark as reproducibilityis a very relevant aspect in the industry.

For some applications leading the very stiff systems,direct solvers could even be required. Direct methods appearas feasible solution for solving linear systems that are ill-conditioned. Actually, they are commonly used in implicit FEcodes for structural dynamic problems. However, these meth-ods were initially discarded due to their high storage memoryrequirements. In the past the limitations of CPU and mem-ory requirements have prevented the use of these methods butnowadays, with the increasing advances in power computingnew trends on direct solvers are being investigated. Theseare either multifrontal [16] and supernodal techniques [34].Multifrontal techniques are basically the multifrontal mas-sively parallel solver (MUMPS) [28] and the Watson SparseMatrix package (WSMP)[40]. SuperLUDIST [58] is a MPIparallel version of SuperLU family of solvers for unsymmet-ric systems based on supernodal right LU factorization. A listof the available software for solving sparse linear systemsvia direct methods is presented in Ref. [12]. Performanceresults of these iterative solver are compared against a directsolver routine of the commercial Harwell Software Library[53].

Many studies on the behavior of iterative solvers in clas-sical Solid Mechanics problems have been carried out. Forinstance, a comparison was made between preconditionedconjugate gradients (PCG) and Multigrid methods (MG),applying them to a series of test problems of plane elas-ticity [52]. Different iterative solvers for large scale linearalgebraic systems for 3D elasticity are compared in Ref.[30]. Solving nonlinear Solid Mechanics problems with aNewton-type method could be problematic if the determina-tions, storage or solution cost associated with the Jacobian ishigh. The Jacobian Free Newton–Krylov methods [54] ini-tially developed for CFD problems has been applied to non-linear computational Solid Mechanics in Ref. [42]. Multi-grid methods, either as iterative methods or as a precondi-tioners, have been studied in many application areas: for theHelmholtz equation when analyzing frequency responses ofan structure [14], for plasticity problems [85] or recently infracture mechanics with the XFEM method [83]. In contrast,direct solvers are not as popular as iterative solvers. MUMPShas been used for simulations of linear elasticity coupledwith acoustics in [69]. SuperLU was used to solve FSI prob-lems with large- displacements in Ref. [44]. As a conclu-sion, despite some recent efforts, more focused research isneeded to analyze the solvers behavior and requirements incomplex structures or in the recent techniques for the spe-cific features of Solid Mechanics problems, such as fracturemechanics.

123


Fig. 1 General architecture of a supercomputer: nodes, cores andaccelerators, and associated common programming languages

1.5 Parallel Context

High performance computational mechanics codes exploitone or more levels of parallelism offered by modern super-computers. A general supercomputer architecture is depictedin Fig. 1, together with some common programming lan-guages. At the coarse level, the architecture consists of nodes,in a distributed memory environment. At the medium level,each node is composed of a given number of cores, in ashared memory environment. These nodes can, in turn, usethe accelerators (e.g. GPUs or Xeon Phi.) as a fine level ofparallelism [68].

MPI is the most commonly used Message Passing Libraryon distributed memory architectures while OpenMP [10] isthe preferred one for shared memory architectures, i.e. par-allelizing the computation at the core level. Accelerators canbe programmed using programming languages like OpenCL[8], OpenACC [7] or CUDA [63]. Very few unstructuredmesh computational mechanics codes have been ported toaccelerators [2,26,46]. There are three main reasons for this:the huge cost of porting large codes; the difficulty in mappingcodes with irregular memory access to the vector architec-ture (SIMD) of accelerators; the lack of a standard language[25]. One standard candidate is the new version of OpenMP,namely OpenMP4 (see new advances in [10]). On the otherhand, Intel solution to accelerators, the Xeon Phi, presentsvery appealing features for unstructured meshes due to itsvery low programming effort together with its computingpower. See Ref. [82] for more complete assessment SolidMechanics problems.

1.6 Libraries

Many different software packages are available to solve alinear system of equations. Most modern solid mechaniccodes include an abstract interface to a LinearSolver classthat operates on a sparse Matrix and a Vector for the right

hand side. Inherited from this LinearSolver base class aredifferent solvers that can interface to packages. In the topof these packages are PETSc [20], which supports also MPIand GPU architectures, and Hypre [31], which provides alsoMultigrid preconditioning in a parallel MPI environment.Moreover, MUMPS methods are also included in a directsolver library developed in Ref. [16], which is supportedby MPI. The WSMP software package WSMP [40] is alsodeveloped using an hybrid implementation with MPI and P-threads, among others. Abstracting the solver interface inthis manner allows the application writer to choose the mosteffective solver for the particular problem.

1.7 Alya

Alya [1] has been conceived as a Computational Mechanicscode developed at Barcelona Supercomputing Center (BSC–CNS) and aimed at solving PDEs in unstructured meshes.Alya exploits the similarities of the PDE-governed prob-lems to solve with high parallel performance compressibleand incompressible flows, thermal flows, excitable mediaor quantum mechanics for transient molecular dynamics[29,47,48,81], while running in thousands of cores. Paral-lelization is hidden behind a common solver that assemblesmatrices and residuals and carries out the solution scheme.The scalability of the code thus exclusively depends on oneunique set of parallel communication subroutines indepen-dently of the physics of the problem. Additionally, Alyais specially well-suited for the parallel solution of coupledmulti-physics problems.

In the present work, the Solid Mechanics module of Alyais introduced. Using the large scale PDE solver capabil-ity of Alya’s kernel, the solid module was implemented sothat any future development of more complex FE techniquesshould conserve the high scalability of the code. These devel-opments have been carried out along two parallel strate-gies: first, the MPI implementation at the node level, wherethe parallelization is based on a sub-structuring techniqueand uses a Master–Worker strategy [48]; and secondly, theOpenMP implementation used to treat many-core proces-sors. The overall code is thus hybrid MPI/OpenMP. Non-symmetric problems are solved using GMRes or Bi-CGStabschemes and symmetric ones, solved using CG methods[72]. There are several available preconditioners: diagonal,Deflated, RAS, blockLU, etc.

The Alya framework is specially well-suited to solve cou-pled problems. A Chimera overset meshes strategy is a par-ticular way of coupling problems: two or more problems aresolved in different meshes which are connected using theirrespective geometrical information. This connection schemeresults in the fusion of all the problems in one single matrix.The problem can then be solved using the same paralleliza-tion scheme proposed earlier [29]. This scheme was thus

123


implemented in Alya as an illustration of its large scale par-allelization strategy where the code structure naturally lendsitself to complex non-local approaches without loss of scal-ability performance.

The article is structured as follows. Section 2 introducesthe governing equations and briefly describes the numericalmethod. Section 3 describes the structure of Alya, with spe-cial attention on the parallel strategy and the solver in Sect.4. The parallelization and hybrid strategies are formally pre-sented in Sect. 5 and the optimal scalability of the code isalso shown. Section 6 presents some problems of applica-tions where the Alya modules have been used with good per-formance. The paper ends with some conclusions and futurelines.

2 Numerical Scheme

The computational solid mechanics problem is solved usinga standard Galerkin method for a large deformation frame-work using a generalized Newmark time integration scheme.This framework, developed in a total Lagrangian formula-tion, is only briefly summarized below; for more details, seeRef. [59].

2.1 Standard Galerkin Governing Equations

Let ϕ : R3 → R3 be the function that maps a material point

of an undeformed body X ∈ B0 in the reference configura-tion to its point x = ϕ(X) ∈ B in the current (or deformed)configuration. The deformation gradient tensor F is definedas F := ∇0x, where ∇0 is the gradient operator with respectto the reference configuration. In cartesian basis, the compo-nents of F are given by Fi J = ∂xi/∂X J . Index i in vector xand J in vector X are written in lower and upper case to indi-cate that the component is referred to in the current and refer-ence configurations, respectively. Since x(X) = X + u(X),where u is the displacement vector, the deformation gradientcan be given by F = I +∇0u, with ∇0u as the displacementgradient.

The equation of balance of momentum with respect to thereference configuration can be written as

Div P + b0 = ρ0u , ∀X ∈ B0 , (1)

where ρ0 is the mass density (with respect to the referencevolume) and Div is the divergence operator with respect tothe reference configuration, with Div P = ∇0 · P . Tensor Pand vector b0 stand for the first Piola–Kirchhoff stress andthe distributed body force on the undeformed body, respec-tively. Prescribed displacements and tractions are applied atreference boundary Γ0 = Γd0 ∪Γn0, where Γd0 and Γn0 cor-respond to the Dirichlet and Neumann boundary conditions,respectively:

u = u , ∀X ∈ Γd0 , (2)

P · N0 = t0 , ∀X ∈ Γn0. (3)

where N0 is the normal to the boundary in the referenceconfiguration.

As usual in Finite elements methods [86], the weak formof balance of the momentum (1) can be formulated for anyarbitrary admissible virtual displacement w, such that,

∫B0

P ·∇w dV+∫B0

ρ0u·w dV =∫B0

b0·w dV+∫

Γn0

w·t0dΓ

(4)

For a FE approximation Ω0 = ⋃e Ωe

0 of the undeformedcontinuum body B0, let uh be a polynomial approximationof degree k to the actual displacement u

uh(X) = N(X) uh, (5)

where uh denotes time-dependent nodal displacement and Nthe three-dimensional matrix of shape functions.

Then, the discrete form of (4) consists in solving for thearray of nodal displacement uh ∈ R

3

Muh + fint = fext (6)

where M, fint (uh), and fext are, respectively, the mass matrixand the vector of internal and external forces vector. For morereferences, see [59].

The generalized Newmark formulation used here for Eq.(6) can be written as [59]

Mun+1h + fn+1

int = fn+1ext , (7)

un+1h = un

h + Δt unh + (Δt)2

[ (1

2− β

)un + βun+1

h

], (8)

un+1h = un

h + Δt[(1 − γ ) un

h + γ un+1h

], (9)

where superscripts “n” and “n+1” indicate that the variablesare evaluated at time tn and tn+1, respectively. Parametersβ and γ set the characterictics of the Newmark scheme.Parameter β = 0 leads to an explicit Newmark scheme,whereas for values 0 < β ≤ 0.5 leads to an implicitscheme. In the latter case, the set of equations (7)–(9) aresolved for the unknown displacement un+1

h , velocity un+1h ,

and acceleration un+1h using the iterative Newton-Raphson

algorithm.

3 Numerical Implementation

The Alya system is a computational mechanics code devel-oped with two main motivations. First, it was designed to

123


Fig. 2 solidzmodule structure in Alya; kernel elements are in blue andservices in red. (Color figure online)

run with high efficiency standards in large scale supercom-puting facilities. Secondly, various physical problem shouldbe allowed to be solved individually or in a coupled manner,while conserving exceptional scalability and retaining theirindividual efficiency.

Alya’s architecture is modular, grouping the differenttasks into kernel, modules and services. The kernel, the coreof Alya, contains all the facilities required to solve anyset of discretized PDEs (e.g., the solver, the I/O, the cou-pling, the elements database, the geometrical information,etc.).

The physical description of a given problem is providedby its corresponding module (e.g., the discretized terms ofthe PDE, the meaning of the boundary and initial conditions,etc.). Other Alya modules than the one presented here includeincompressible or compressible flow, thermal transport, orN-body problem, among others. Finally, the services con-tain the toolboxes providing several independent proceduresto be called by modules and kernel (e.g., parallelization oroptimization).

3.1 Alya’s Solid Mechanics Module

A large database of element types and integration schemesof different orders is available from previous work in theother modules. Well known constitutive equations for largedeformation elasticity constitutive models, such as the neo-Hookean or specific hyperelastic models, were also imple-mented.

Figure 2 shows a flowchart of the solidz module. All thegeometrical and physical data of the problem are introducedas input files. Once read, Alya initializes the computationwithin the solidz module, either in serial or parallel mode.The parallel service must be specified in the input files.

3.2 Mesh Convergence

The following test aims at studying the mesh convergenceof the Alya-solidz module. To this end, a target solution u(e)

with a given degree of regularity is used. By using a sta-tionary manufactured solution belonging to the finite ele-ment space (for linear and bilinear elements), the numeri-cal scheme should provide an exact nodalwise solution, i.e.,u = u(e), thus a) confirming that the equation is well codedand b) allowing for the proposed mesh convergence study.The analysis below uses indicial notation and the followingequation, obtained from Eq. (1), is solved:

ρ0ui − ∂Pi J∂X J

= −∂P(e)i J

∂X J, (10)

where P(e)i j = Pi j (u(e)). Assuming large deformations and

using a linear elastic constitutive law, i.e., the last term ofEquation (10) is given by:

∂P(e)i J

∂X J= CK JML

[F (e)i K2

(∂2u(e)

n∂X J ∂XM

F (e)nL + ∂2u(e)

n∂X J ∂XL

F (e)nM

)

+ ∂2u(e)i

∂XK ∂X JE (e)ML

]

(11)

where CK JML is the fourth order elasticity tensor and FnL(or FiK and FnM ) and EML are the deformation gradient andthe Green-Lagrange strain tensors, defined as:

F (e)mL = ∂u(e)

m∂XL

+ δmL

E (e)ML = F (e)

nM F (e)nL − δML

(12)

The 2D manufactured solution considered here is arbitrar-ily chosen to be:

{u(e)

1 = X31 X4

2

u(e)2 = X3

1 X32

(13)

and is solved on a unit square. Figure 3 shows the resultingmesh convergence results computed on a regular mesh of Q1(quadrilateral) elements. The L2-norm of the error and of theerror of the gradient, ||ε(u)||L2 , and ||ε(∇u||L2 , respectively,are shown and they are computed as follows:

||ε(u)||L2 =√∫

Ω

(uh − u(e))2dΩ/

√∫Ω

u(e)2dΩ,

||ε(u)||H1 =√∫

Ω

(∇uh − ∇u(e))2dΩ/

√∫Ω

(∇u)(e)2dΩ.

The results successfully confirm a first order convergencein the derivative and second order in the L2-norm of thedisplacement with respect to h for the solidz module [50].

123


0.0001

0.001

0.01

0.1

0.01 0.1 1

Err

or

h

h

h2

L2 normH1 norm

Fig. 3 Mesh Convergence study; L2-error of the displacement and ofthe gradient of the displacement

4 MPI Parallelization

4.1 Parallel Context

The parallelization for distributed memory supercomputers isbased on a domain decomposition technique, using a Master–Workers paradigm and MPI as the message-passing library.In the core of the parallelization outer layer lies the auto-matic graph partition tool, METIS [6]. First, the master readsthe mesh and performs the partition of the mesh into sub-meshes, or subdomains, using METIS. Each of these subdo-mains is called a Worker. The Master distributes each parti-tion among the Workers that carry out the solution compu-tational work. METIS mesh partition is done by maximizingload balance and minimizing communication (see Ref. [48]for more details). As a second step, the Workers build thelocal elements matrices and local right-hand sides, and are incharge of solving the resulting system solution in parallel. Inthe assembling tasks, no communication is needed betweenthe Workers, and the scalability only depends on the load bal-ancing. In the iterative solvers, the scalability depends on thesize of the interfaces and on the communication scheduling.

Depending on the constitutive model, the resulting equa-tions can be symmetric or non-symmetric. In the implicitcase, the basic iterative solvers are GMRes and CG [72]. Dur-ing the execution of these iterative solvers, two main typesof asynchronous communications are required:

– Point-to-point communications via MPI_ISend andMPI_IRecv, which are used when sparse matrix-vectorproducts are calculated.

– Collective communications viaMPI_AllReduce, whichare used to compute residual norms and scalar products.

In the current implementation of Alya, the solutionobtained in parallel is, up to round-off errors, the same as thesequential one all the way through the computation. This isbecause the mesh partition is only used for distributing workwithout in any way altering the actual sequential algorithm.This would not be the case if one would consider more com-plex solvers, like primal or dual Schur complement solvers[64], or more complex preconditioners, like linelet [74] orblock LU [65]. Since the explicit framework is relativelystraightforward to implement, in the following we only focusour attention on the implicit framework.

The numerical solution of a PDE (and consequently thesolid mechanics equations) consists mainly of two steps: first,the construction of the matrix A and right-hand side (RHS) bof the algebraic system Ax = b; second, the solution of thissystem using an iterative solver. As far as the matrix and RHSassemblies are concerned, only part of the matrix is assem-bled for the interface nodes. For the iterative solvers, the basicoperations are the matrix-vector and the dot products. Let usconsider these two operations for a simple one-dimensionalcase illustrated in Fig. 4.

In the context of FE, the coefficients of the matrix comefrom element computations (see Sect. 2.1). The contributionof node 3 comes from subdomain 1 and 2, namely A1

33 andA2

33, respectively. Since

y3 = A32x2 + A33x3 + A34x4

and by rewritting it as

y3 = (A32x2 + A(1)33 x3) + (A(2)

33 x3 + A34x4)

= y(1)3 + y(2)

3

the parallelization only consists of a residual exchange.Therefore, the idea is to use the distributive law of the

multiplication to carry out the matrix-vector product in par-allel. Let us introduce two functions, that will be describedformally in the next subsection as PAREXC and PARASM.In the previous 1D example, the asynchronous matrix-vectorproduct can be carried out in parallel as follows:

– Perform local matrix-vector product of boundary nodes(node 3);

– PAREXC: Exchange the results on the interface y(1)3 and

y(2)3 using non-blocking MPI functions;

– Perform local matrix-vector product of interior nodes(rows 1 and 2 for subdomain 1, 4 and 5 for subdomain2);

– Synchronization MPI_Waitall;– PARASM: Assemble (sum) the local contributions of

each subdomain: y3 = y(1)3 + y(2)

3 .

Regarding the dot product, i.e., x · y we only need tointroduce the concept of own interface node. In the current

123


Fig. 4 Iterative solver basicoperations: matrix-vectorproduct and dot product

implementation all vectors are assembled on the interface,which implies:

x (1)3 = x (2)

3 = x3 and y(1)3 = y(2)

3 = y3

Then, if both subdomains compute their entire local dot prod-uct, then the sum of the two contributions will account forx3y3 twice, hence

α = x1y1 + x2y2 + 2 x3y3 + x4y4 + x5y5,

which provides a wrong result. In order to account only oncefor the interface value, all the interface nodes are splittedinto own interface nodes and oth (for other) interface nodes.That is, an interface node has the own status only in onesubdomain although it can be shared by others, but with statusoth. The local dot product thus only involves interior nodesand interface nodes of own status.

In the next section we define formally the concept of inte-rior and interface nodes and introduce some notations andoperators, as well as the functions PAREXC and PARASMdoing the parallel matrix-vector and dot products. The termboundary will refer to the interfaces between subdomains.

4.2 Formal set definitions

Before introducing the parallel operators, it is first necessaryto define the basic sets, categorizing the nodes and elementsof each subdomain, as well as the relations between the sub-domains. Some of the definitions are illustrated in Fig. 5.

Consider a computational domain Ω ∈ R. Let N ={n1, n2, . . . , nN } and E = {e1, e2, . . . , eE } be the sets ofall nodes and elements, respectively, of the FE mesh used todiscretize the computational domain. Here, N and E denote

Fig. 5 Node sets

the total number of nodes and elements of the computationalmesh.

Any element e ∈ E can be defined as a tuple of nodese = (ne1, n

e2, . . . , n

ek) where k is the number of nodes per

element.Two sets are defined to describe the nodal connectivity

with elements and nodes of the mesh, Lele(n) and Lnod(n),respectively. For any arbitrary node n ∈ N :

Definition Element connectivity of n. Let Lele(n) denotethe set of elements in E directly connected to the node n.Formally,

Lele(n) = {e ∈ E : n ∈ e}. (14)

Definition Node connectivity of n. Let Lnod(n) denote theset of nodes in N directly connected to n. Formally,

Lnod(n) = {m ∈ N : ∃e ∈ Lele(n),m ∈ e} \{n} (15)

123


The latter set represents the nodal connection of the mesh andthe matrix graph as well. It contains the information requiredto construct the Compressed Sparse Row (CSR) format usedto assemble the sparse matrix of the algebraic system.

We then consider a domain decomposition by elements(see the example of Sect. 4.1). In this work, the METIS [6]library is used. LetN I and E I denote the set of nodes and ele-ments of any arbitrary subdomain ΩI , respectively. The totalnumber of nodes and elements of the mesh can be groupedby subdomains:

E =S⋃

I=1

E I , N =S⋃

I=1

N I ,

where S is the number of subdomains. As the domain decom-position is performed by elements, the nodes can be sharedbetween subdomains, namely boundary nodes, but the ele-ment subdomains are non-overlapping.

The interface created by the partition of the mesh dividesthe set of nodes of any arbitrary subdomain ΩI into twodisjoint sets. On the one hand, the set of interior nodes:

Definition (Interior nodes of Ω I .) Let N Iint denote the set

of interior nodes of the subdomain ΩI . Formally,

N Iint = N I \

⎛⎝ S⋃

J=1,J =I

N J

⎞⎠ .

These nodes do not belong to the boundary.

On the other hand, the set of boundary nodes:

Definition (Boundary nodes of Ω I .) Let N Ibou denote the

set of boundary nodes of ΩI . Formally,

N Ibou = N I \N I

int

These nodes belong to the interface and are shared by differ-ent subdomains, including ΩI .

In this new context of domain decomposition, let us definea subset of the set Lnod(n) related with any boundary noden ∈ N I

bou as:

Definition (Node connectivity of n belonging to Ω I .) LetLInod(n) denote the set of nodes in N I directly connected to

n. Formally,

LInod(n) =

{m ∈ N I : ∃e ∈ Lele(n),m ∈ e

}\{n} (16)

The concept of adjacency between subdomains is definedas

Definition (Adjacent subdomains.) Two arbitrary subdo-mains I and J are adjacent when N I ∩ N J = ∅.

In order to carry out the scalar product we also need tointroduce the concept of own boundary and oth boundary,where oth is defined in Sect. 4.1. This is achieved by split-ting the interfaces between the subdomains that share it, forexample with METIS.

Definition (Own and other’s boundaries of Ω I .) Let usedefine N I

bou,own and N Ibou,oth such that

N Ibou,own ∩ N I

bou,oth = ∅,

N Ibou,own ∪ N I

bou,oth = N Ibou,

N Ibou,own ∩

S⋃J=1,J =I

N Jbou,own = ∅.

4.3 Parallel Operators

We now define the parallel operator used in the algebraicsolver to carry out the matrix-vector product and assemblycontributions in an asynchronous way.

The node and element numbering is represented by a localindex index I , and an index used to exchange informationbetween two neighboring subdomains I and J , index I,Jbou .

For any arbitrary subdomain ΩI ,

Definition (Local node numbering inΩ I .) Let the function

index I : N I → {1, 2, 3, . . . , |N I |} (17)

be the local node numbering in ΩI defined as index I (n) =i I where n ∈ N I and i I ∈ N. |N I | is the cardinal numberof the set.

For implementation purposes, the interior nodes are firstnumbered followed by the boundary nodes, as shown inFig. 6. This ordering is useful to carry out not only the matrix-vector product in an efficient way but also the dot product.

Note that for two arbitrary adjacent subdomains ΩI andΩJ , index I (n) is not necessarily equal to index J (n) for anyboundary node n ∈ N I

bou ∩ N Jbou .

Definition (Shared boundary node numbering in Ω I andΩ J .) Let the function

index I,Jbou : N Ibou ∩ N J

bou → {1, 2, 3, . . . , |N Ibou ∩ N J

bou |}(18)

be the shared boundary node numbering for two arbitraryadjacent subdomains ΩI and ΩJ defined as index I,J (n) =i ∈ N where n ∈ N I

bou ∩ N Jbou .

This index facilitates the data exchange between two sub-domains and is constructed in such a way that both subdo-mains are able to order their shared boundary nodes in the

123


Fig. 6 Node numbering ofsubdomain ΩI

Fig. 7 Shared node numbering index I,J (n) of the interface nodesbetween ΩI and ΩJ

same way, see Fig. 7. In Alya, this task is carried out by themaster process, after the partition of the mesh.

Taking into account the previous defined functions, theimplementation details of the operators PAREXC andPARASM from the point of view of any arbitrary subdo-main ΩI are provided in Algorithms 1 and 2. As the exchangeof data between subdoamins is asynchronous, the operator isdivided into two parts: the data exchange first, given by Algo-rithm 1, and the assembly operation, given by Algorithm 2.

Algorithm 1 The parallel operator PAREXC from ΩI : dataexchange

Input: a numeric array data with length N I

Output: voidfor each adjacent subdomain J of I do

for each node n ∈ N Ibou ∩ N J

bou doi = index I (n)

ibou = index I,J (n)

Construct the array data_send(ibou) = data(i)MPI_Isend to send the array data_send to JMPI_Irecv to receive the array data_receiveJ from J

end forend for

4.4 matrix-vector and dot products

The two main parallel functions of an iterative solver, i.e., thematrix-vector and dot products, referred to as MATVEC andDOTPRO respectively, using the previously defined parallel

operators PAREXC and PARASM are described here. Thematrix-vector product in given by Algorithm 3. For the sakeof clarity, the indices of the matrix are not given in the CSRformat.

Algorithm 2 The parallel operator PARASM from ΩI : dataassembly

Input: a numeric array data with length N I

Output: a modified array datafor each adjacent subdomain J of I do

for each node n ∈ N Ibou ∩ N J

bou doi = index I (n)

ibou = index I,J (n)

data(i) = data(i) + data_receiveJ (ibou)end for

end for

Algorithm 3 The parallel matrix-vector product MATVECfor each subdomain I do

for each n ∈ N Ibou do

for each a ∈ LI (n) doi = index I (a)

Construct y I (i) = 0for each b ∈ LI (n) do

j = index I (b)Construct y I (i) = y I (i) + AI (i, j) × x I ( j)

end forend for

end forExchange: PAREXC(y I )

end forfor each subdomain I do

for each n ∈ N Iint do

for each a ∈ LI (n) doi = index I (a)

Construct y I (i) = 0for each b ∈ LI (n) do

j = index I (b)Construct y I (i) = y I (i) + AI (i, j) × x I ( j)

end forend for

end forend forMPI_Waitallfor each subdoamin I do

Assemble: y I = PARASM(y I )end for

123


Fig. 8 Scalability test: computational mesh

Algorithm 4 describes the dot product. As mentioned pre-viously, the loop over the nodes excludes those belonging tothe set N I

bou,own of each subdomain.

Algorithm 4 The parallel dot product DOTPROfor each subdomain I do

α = 0for each n ∈ N I do

i = index I (n)

if n ∈ N Iint or n ∈ N I

bou,own thenα = α + x I (i) × y I (i)

end ifend for

end forMPI_AllReduce of α

It must be emphasized that the functions PAREXC andPARASM are used to compute many other arrays duringthe execution the code, such as the construction of the massmatrix and the diagonal of the stiffness matrix to constructthe inverse diagonal preconditioner.

4.5 Scalability test: shear stress of a milling cutter punch

This example, run in implicit, addresses the capability ofAlya to deal with structures composed of millions of ele-ments, while maintaining optimal scalability. The test con-sists in a linear elastic drill (with arbitrary mechanical prop-erties) grasped in a shank. Gravity force and a punctual forcealong one of its lips are applied. The computational mesh isa regular mesh of 8.5 million tetrahedra, see Fig. 8. The testshave been run on MareNostrum cluster. The machine consistsof 3,056 nodes, two 8-core processors (Inter Xeon E5-2670cores at 2.6 GHz) per node and 32 GBytes of memory per

64

256

512

1024

64 256 512 1024

133k 33k 16k 8k

Spee

d up

Number of CPU’s

Average # elements per CPU

Speed-upIdeal

Fig. 10 Scalability test: scalability for a fixed number of iterations witha mesh of 8.5 million elements

node. Figure 9 shows the displacement (left figure) and themaximum and minimum shear stress after 20 time steps. Fig-ure 10 shows the scalability obtained up to 1,024 CPUs. Notethat the curve looses its optimallity at 1,024 CPUs, due to thesmall number of elements per CPU. In the next section thescalability with a finer mesh is studied.

5 Hybrid Based Strategy for Parallelization

The most important trends in contemporary computer archi-tecture are currently leaning towards processors with multi-ple cores per socket with access to the same memory bank.This approach allows to run at lower frequencies with betteroverall performance than a single core processor. The par-allelization strategy for this architecture is multithreading:a Master thread forks a number of Worker threads, and thecomputation is divided among them. The data communica-tion and the synchronization between threads are done usingthe shared memory inside the multiprocessor. This strategyis known as parallelism at thread level and OpenMP is thestandard interface for this model [11].

As detailed in the previous section, parallelization in Alyais mainly done via MPI. The domain decomposition strat-egy implemented only uses parallelism at task level, which

Fig. 9 Scalability test:displacement field (left) andmaximum and minimum shearstress field (right). Theminimum value is shown in blueand the maximum in red. (Colorfigure online)

123


Fig. 11 Multicore architecture for an hybrid openMP/MPI framework

is provided by MPI. For that reason, a hybrid code withOpenMP/MPI is developed in order to take advantage of alllevels of parallelism that a multicore architecture offers andalso to enable one MPI task to access all the memory of anode. The main feature of this multicore-architecture relieson the fact that both the shared memory inside the processorand the message passing between nodes are leveraged. Thestructure consists of two steps: first, it assigns one task toeach node and secondly, the node creates a thread per core,see Fig. 11. An important reduction of the communicationcost between cores is then expected.

5.1 Introducing OpenMP into the MPI Parallel Code

To exploit the thread-level parallelism of a multicore archi-tecture, OpenMP directives are added to the MPI parallelcode. The first step consists in selecting the most-time con-suming routines of the code in total execution time. Amongthese routines, not all of them are susceptible to be paral-lelized at thread-level: they must contain a loop structurewith independent iterations. The targeted loops must con-tain a large body of code, since a considerable amount ofcomputational work hides behind the overhead of thread-management.

Table 1 depicts the most time-consuming routines of Alyain terms of the total execution time in the explicit case. Notethat the cost refers to the percentage of the total executiontime spent, also including the call to other subroutines. Inorder to select the proper subroutine to be optimized by theparallelization with OpenMP, it is important to identify thecalls between them. For instance, the routine that computesthe matrix costs the same as the assembly one, which meansthat the time spent by the matrix construction routine itself

Table 1 Routines costs in terms of the sum of time of the function,including the time of the called subfunctions with respect to the totalexecution time in the explicit case

Cost (%) Routine Description

… …

93,6 Iterative scheme Controlling the internalloop of the equations

93,6 Explicit time integration Explicit time steppingscheme (Newmark)

93,5 Iterative solution Solving an iteration of theequations

93,5 Matrix Computing the elementalmatrix and RHS

93,5 Assembly Assembling the matrixand RHS

36,4 Finite element computations Computing derivatives atgauss points

… …

is minimum, since it has a call to the assembly subroutine.Hence, the assembly routine is the one that has to be opti-mized. In the FE matrix and RHS assembly routine, the dis-crete system to be solved is formed by looping over all theelements of the mesh, adding the contributions from that ele-ment to the global matrix and RHS. This routine is expen-sive for a number of reasons: significant loop nesting, manymatrices assembling and several calls to other subroutines(for instance, to constitutive models).

In order to introduce thread parallelism in the assemblyloop, two main OpenMP directives are used to indicate thecompiler how to parallelize appropriately:

– Guided scheduling: since the workload within each iter-ation is not the same, iterations are assigned to threads inblocks. In a guided scheduling the block size decreaseswithin each iteration (in contrast with the dynamicscheduling), thus obtaining a better relation between thethread management time and the balanced workload.

– Data scope: the variables that are shared among the iter-ations are visible to all threads, while the ones that havean independent value among iterations have a differentcopy per thread.

A critical point on the thread parallelism relies on the socalled race conditions, which might lead to non-deterministicresults. Race conditions arise when different threads try toupdate the same state or array at the same time. Code pathsaccessing and manipulating shared data simultaneously areknown as critical regions. It is clear that both the operationto assemble an element into a local matrix and the additionof that local matrix into the global matrix must be threadsafe. Even though these regions cannot be fully avoided it is

123


2

4

8

16

2 4 8 16

Spee

d up

Number of Threads

Implicit SchemeExplicit Scheme

Ideal

Fig. 12 Scalability of the assembly routine parallelized with OpenMPin Alya-solidz

important to minimize their occurrence, because their abusecan serialize the execution, hence losing the optimal parallelarchitecture of the code. One possible solution is to spec-ify different region names, combined with the use of atomicclauses (uninterruptible instructions) for a single memorylocation.

Such scheme was implemented in Alya, paying particu-lar attention to the specificities of solid mechanics codes. Inorder to evaluate the gain in terms of time execution thatOpenMP architectures provides, a reduced case was com-puted. As an illustration of the strategy used here for all iden-tified subroutines, Fig. 12 depicts the speedup analysis of theassembly routine for both explicit and implicit schemes. Theperformance with one thread is the same as the one for thesequential version and, the thread version for both schemesis speeding up quasi-optimally up to at least 16 cores.

6 Numerical Examples

This section presents several examples of different naturewith the purpose of showing the applicability, scalability,performance and flexibility of Alya when solving a solidmechanics problems.

6.1 Example 1: Mesh Multiplication

In petascale applications, the pre- and post-process tasksare becoming a bottleneck in the complete simulation cycle.Techniques such as parallel I/O have been introduced to miti-gate these effects in post-processing, but these are only effec-tive within a limited range. Mesh Multiplication (MM) wasintroduced as an alternative [47]. This technique consists inrefining the mesh uniformly, recursively, on-the-fly and inparallel. For tetrahedra, hexahedra and prisms, each level

Fig. 13 Example 1: Mesh multiplication algorithm

multiplies the number of elements by eight, while a pyramidis divided into ten new elements. Apart from generating afine mesh in parallel, the MM strategy enables to carry outthe simulation on this fine mesh without having to perma-nently store it at anytime during the simulation process.Thistechnique is also very useful for studying mesh convergenceas well as weak or strong scalability [47]. Figure 13 showsthe recursive MM algorithm.

The example consists of a linear elastic large structurewith arbitrary mechanical properties under its own weightand magnetic load. The initial mesh is composed of a mesh of491,415 tetrahedral elements. As an example of the efficiencyof the algorithm, a mesh of 250 million elements was alsoobtained in just 1 second on 8,000 CPUs. Figure 14 shows theoriginal mesh and the obtained after MM. Figure 15 depictsthe displacement field after 200 iterations using an implicitscheme and the hybrid version of Alya.

Figure 16 shows the speedup obtained with a mesh of32 million elements (2-level mesh) from 64 to 2,048 CPUs.The code shows optimal scalability when solving mechanicalproblems. Note that in contrast with the drill problem, theamount of elements per core is significatively larger.

6.2 Example 2: The Chimera Method—Two-Material Cube

As an illustration of the flexibility of the Alya-solidz module,the Chimera method is applied to the solid mechanics equa-tions described in Sect. 2.1. This family of methods allows tohandle non-conforming and overlapping meshes, thus simpli-fying the construction of computational meshes of complexgeometries.

The origin of the Chimera Method is found in the con-text of Computational Fluid Dynamics. It was originallydeveloped by Steger and coworkers [22,75,76]. It consistsin superimposing an independent mesh referred to as the

123


Fig. 14 Example 1: Initial mesh of the structure (left) and detail of themesh after mesh multiplication (right). Fusion reactor vacuum vessel.Geometry and mesh property of F4E

Fig. 15 Example 1: Displacement field after 200 iterations using andimplicit scheme

patch mesh on a larger mesh covering the overall compu-tational domain and called the background mesh. In Alya,the patch mesh and its corresponding constitutive proper-ties are responsible for the mechanical deformation of the

64

256

512

1024

2048

64 256 512 1024 2048

500k 124k 615k 307k 157k

Spee

d up

Number of CPU’s

Average # elements per CPU

Speed-upIdeal

Fig. 16 Example 1: Scalability of the fusion reactor problem with amesh of 32 million elements

Fig. 17 Example 2: Extension elements and hole (left) and boundaryconditions (right)

body it covers, while the mechanics of the rest of the bodyis defined by the part of the background mesh not coveredby the patch mesh. The coupling between the boundary ofthe patch mesh and the background mesh is then enforcedby additional transmission conditions. In the framework ofsolid mechanics, this non-local approach allows for the inde-pendent meshing of complex intertwined geometries withoutbeing constrained by mesh conformity conditions at theirboundaries. See Ref. [29] for more details.

The Chimera method is applied here to the solution of a3D example shown in Fig. 17. The geometry is composedof two different Neo-Hookean materials, one of which cor-responds to the spherical patch mesh enclosed in a secondmaterial. In this example arbitrary constitutive parametershave been considered for both material, but material 2 (withinthe sphere) is stiffer than material 1. The cube is submittedto an increasing y-displacement on the top face, the bottomface being constrained in the y-direction. Additional lateralboundary conditions are applied to avoid rigid body motion(not shown for clarity).

123


Table 2 Example 2: Number ofelements and mesh size Elements total Elements extension Elements hole Elements size

Mesh 1 374,848 14,848 17,120 0.0029

Mesh 2 2,939,664 59,664 151,448 0.00036

Reference mesh 10,790,400 0.000093

Fig. 18 Example 2: Solutionon mid yz-plane for the threemeshes: y-displacement (top)and stress along the y-direction(middle) and computationalmesh (bottom)

σm

esh

The problem was solved on two different meshes with theChimera method: a coarse one of 400,000 elements and afine one of 3 million elements. A reference solution withoutthe Chimera method was also computed in a refined mesh of10 million elements. Table 2 shows the geometrical detailsof the three different meshes. Figure 18 shows the displace-ment and stress in the y direction as well as the correspondingmesh obtained on the mid yz-plane for the coarse and finemeshes of the Chimera method. For the Chimera meshes, theextension elements used to couple the patch and backgroundmesh are delimited in red. Figure 19 shows the solution for anadditional cut along the vertical displacement and stress. Thesolution on both meshes with the Chimera method is com-pared with the reference one. A good qualitative agreement isobserved for the displacement even for the larger mesh. Thestress exhibits a lower convergence, but this should be tem-pered by the fact that the meshes are only first order for thesolution derivatives. In the overlapping zone, covered by theextension elements, two solutions coexist: the one obtained

on the patch and the one obtained on the background. The dis-continuity in the stress inside the overlapping zone is appre-ciable for the coarsest mesh, but this jump decreases whenrefining the mesh. Nevertheless, the maximum error betweenthe solution with Chimera computed with the fine mesh andthe reference solution is of order less than 10−4.

6.3 Example 3: Coupled Electro-mechanical Model of theHeart

In this case, Alya is used to simulate the cardiac ventricu-lar contraction, showing the potential for simulating coupledproblems. See Ref. [56,80] for a complete discussion of themethodology. The heart is made of elongated cells calledmyocites arranged as a compact fibered helicoidal structure.As an electrical impulse propagates, the fibers contract longi-tudinally making the heart pump the blood out of its cavitiesthanks to the fiber arrangement. At organ level, tissue is con-

123


Fig. 19 Example 2: Solutionalong mid line of mid yz-planefor the three meshes:y-displacement (left) and stressalong the y-direction (right)

sidered as a continuum composite material with anisotropicbehaviour.

The cardiac computational model can be decomposed intothree parts:

– Electrophysiology: Action potential propagation φ(xi , t)is modelled through a transient anisotropic diffusionequation with a non-linear reaction term:

∂φ

∂t= ∂

∂xi

(Di j

∂φ

∂x j

)+ L(φ) (19)

Diffusion is governed by the tensor Di j , which representsthe fiber orientation at each physical point. Its diagonalcomponents are the axial and crosswise fibre diffusion.The crosswise diffusion is around one third of the axialdiffusion. Finally, L(φα) is the non-linear reaction termcorresponding to the cell model. Depending on the modelused, this term ranges from a simple cubic equation toan ordinary differential equation system of one hundredequations coupled and solved simultaneously. This termmodels the ionic currents interaction behind the muscularcellular mechanisms.

– Mechanical deformation: From a mechanical point ofview, cardiac tissue is a thick layered structure:endo-, epi- and myocardium. The material is compress-ible hyperelastic, with anisotropic behaviour ruled by thefibre structure. The composite character of the materialis determined by the internal stress, which is developedin two parts, active and passive:

σ = σ pas + σact (λ, [Ca2+]) f ⊗ f (20)

The passive part is governed by a transverse isotropicexponential strain energy function W (b) that relates theCauchy stress σ to the right Cauchy-Green deformationb [56]:

Jσ pas = (a eb(I1−3) − a)b + 2a f (I4 − 1)eb f (I4−1)2

f ⊗ f + K (J − 1)I (21)

The strain invariant I1 represents the non-collagenousmaterial while strain invariant I4 represents the stiffnessof the muscle fibers, and a, b, a f , b f are parameters tobe determined experimentally. K is the bulk modulusand f defines the fibre direction. The active part comesthrough the electro-mechanical coupling and is describedas follows.

– Electro-mechanical Coupling: The contracting electricalcomponent of the electro-mechanical coupling is initi-ated almost simultaneously in several zones of the ven-tricular epicardium. Cardiac mechanical deformation isthe result of the active tension generated by the myocytes.The model assumes that the active stress is produced onlyin the direction of the fibre and depends on the calciumconcentration of the cardiac cell:

σact = γ[Ca2+]n

[Ca2+]n + Cn50

σmax (1 + β(λ f − 1)). (22)

In this equation, Cn50, σmax , λ f and 0 < γ < 1 are model

parameters.

Both electrophysiology and mechanical action is simu-lated in Alya on the same mesh, as fine as the case demands.The time integration scheme is a staggered, with both prob-lems solved explicitly. The fibre fields can come either frommeasurements (using a so-called Diffusion Tensor MRI tech-nique) or from semi-empirical rule-based models [56].

Figure 20 shows several snapshots of a bi-ventriculargeometry during systole. Figure 21 shows the evolution ofthe ejection rate, which represents the heart pumping action,measuring the change in the ventricular cavities volume.

6.4 Example 4: Fluid–Structure Interactions—Wall’sProblem

This problem [84] is proposed to demonstrate the abilityto solve FSI problems with complex flow-flexible structureinteractions exhibiting large deformations.

The problem consists of a thin elastic nonlinear shellattached to a fixed square rigid body, which are submerged

123


Fig. 20 Example 3: Bi-ventricular electrical activation propagation of a dog heart during systole

20

40

60

80

100

120

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

Vol

ume

(ml)

t (ms)

Ejection Fraction

Total EF

Fig. 21 Example 3: Heart ejection rate

in an incompressible fluid flow. Both media have arbitrarymechanical properties but such that the Reynolds number isRe = 204 if the length of the square rigid body is takenas the characteristic length. Inflow boundary conditions areimposed on the left wall of the fluid domain, outflow on theright wall and slip boundary conditions at the top and bottom

Fig. 22 Example 4: Boundary conditions

Fig. 23 Example 4: reference finite element mesh

123


Fig. 24 Example 4: fluid velocity field for different times

wall, see Fig. 22. Non-slip boundary conditions are imposedalong the body and the structure. Again, both problems aresolved using Alya.

The finite element mesh is shown in Fig. 23. It consistsof a hybrid mesh of 6,919 elements, composed of tetrahedraand quadrilateral elements. The fluid domain contains 6,695elements, while the solid one has 224 elements.

A weak coupling scheme [15,78] was used to couple thefluid with the structure. The key point of the strategy imple-mented in Alya is to divide geometrically the problem innon-overlapping fluid and solid zones, such that by properlydistributing the amount of processors working within eachzone, the amount of computations performed by the CPUs isoptimized.

Figure 24 shows the structural displacement at differenttimes t = 0, 3.58, 7.01 and 15.58s along with the fluid veloc-ity. Note that the mesh is adapted to the solid structure dis-placement, since the ALE formulation used here computesthe convective velocity of the fluid according to the differ-ence between the material and mesh velocities. The resultscan be compared to the ones obtained in Ref. [84].

7 Conclusion

This paper presents the solid mechanics module of Alya codefor solving linear and non-linear continua problems withlarge deformations addressing also the potential for solvingcomplex coupled problems in parallel. The parallelizationstrategy used in Alya is formally described in the context

of MPI communications. To enhance the massively parallelperformance of the code, OpenMP directives were added tothe current MPI Alya code. The code was tested success-fully using thousands of CPUs to solve problems with upto millions of elements. The flexibility of solid module inAlya was demonstrated by using more complex preprocess-ing techniques, such as a mesh multiplication algorithm andthe Chimera method. Those were successfully tested withthree-dimensional structural tests showing optimal results.Examples of FSI and electro-mechanical coupling were alsoshown, demonstrating the capability of the code for solvingmulti-physic problems with high-performance computing.Future work is aiming at implementing more complex FEtechniques and overcoming the barrier of more than 100,000CPUs.

Acknowledgments D.D.T and A.J acknowledge funding throughSIMUCOMP and ERA-NET MATERA+ project financed by the Con-sejería de Educación y Empleo of the Comunidad de Madrid andby the European Union’s Seventh Framework Programme (FP7/2007-2013). This work was partially supported by the grant SEV-2011-00067,Severo Ochoa Program, awarded by the Spanish Government. Theauthors’ would like to acknowledge PRACE infrastructure support.

References

1. Alya system. http://www.bsc.es/computer-applications/alya-system

2. Bigdft. http://bigdft.org/Wiki/index.php?title=Presenting_BigDFT

3. Codeaster. http://www.code-aster.org/4. CODE_SATURN. http://code-saturne.org

123

http://www.bsc.es/computer-applications/alya-system

http://www.bsc.es/computer-applications/alya-system

http://bigdft.org/Wiki/index.php?title=Presenting_BigDFT

http://bigdft.org/Wiki/index.php?title=Presenting_BigDFT

http://www.code-aster.org/

http://code-saturne.org


5. Febio. http://febio.org/6. Metis, family of multilevel partitioning algorithms. http://glaros.

dtc.umn.edu/gkhome/views/metis7. Openacc. https://developer.nvidia.com/openacc8. Opencl. https://developer.nvidia.com/opencl9. Openfoam. http://www.openfoam.com/

10. Openmp. http://openmp.org/11. Openmp. http://openmp.org/wp/12. Summary of available software for sparse direct methods. http://

www.cise.ufl.edu/research/sparse/codes/13. Adams M (1999) Parallel multigrid algorithms for unstructured 3d

large deformation elasticity and plasticity finite element problems.Technical report UCB/CSD-99-1036, EECS Department, Univer-sity of California, Berkeley. http://www.eecs.berkeley.edu/Pubs/TechRpts/1999/5398.html

14. Adams MF (2007) Algebraic multigrid methods for direct fre-quency response analyses in solid mechanics. Comput Mech39:497–507

15. Malan AG, Oxtoby O (2013) An accelerated, fully-coupled, par-allel 3d hybrid finite-volume fluid–structure interaction scheme.Comput Methods Appl Mech Eng 253:426–438

16. Amestoy P, Duff I, L’Excellent JY (2000) Multifrontal parallel dis-tributed symmetric and unsymmetric solvers. Comput MethodsAppl Mech Eng 184(2–4):501–520

17. Arbenz P, van Lenthe G, Mennel U, Müller R, Sala M (2008)A scalable multi-level preconditioner for matrix-free μ-finite ele-ment analysis of human bone structures. Int J Numer Methods Eng73:937–947. doi:10.1002/nme.2101

18. Badia S, Martin A, Principe J (2014) A highly scalable paral-lel implementation of balancing domain decomposition by con-straints. SIAM J Sci Comput 36(2):C190–C218

19. Balay S, Abhyankar S, Adams MF, Brown J, Brune P, BuschelmanK, Eijkhout V, Gropp WD, Kaushik D, Knepley MG, McInnes LC,Rupp K, Smith BF, Zhang H (2014) PETSc Web page. http://www.mcs.anl.gov/petsc

20. Balay S, Adams MF, Brown J, Brune P, Buschelman K, EijkhoutV, Gropp WD, Kaushik D, Knepley MG, McInnes LC, Rupp K,Smith BF, Zhang H (2013) PETSc users manual. Technical reportANL-95/11—revision 3.4, Argonne National Laboratory. http://www.mcs.anl.gov/petsc

21. Becker G, Noels L (2013) A full-discontinuous galerkin formula-tion of nonlinear kirchhoff-love shells: elasto–plastic finite defor-mations, parallel computation, and fracture applications. Int JNumer Methods Eng 93:80–117. doi:10.1002/nme.4381

22. Benek J (1986) Chimera. A grid-embedding technique. Technicalreport, DTIC Document

23. Bhardwaj M, Pierson K, Reese G, Walsh T, Day D, Alvin K, PeeryJ, Farhat C, Lesoinne M (2002) Salinas: a scalable software forhigh-performance structural and solid mechanics simulations. In:Proceedings of the 2002 ACM/IEEE conference on supercomput-ing., SC’02IEEE Computer Society Press, Los Alamitos, pp 1–19

24. Blatt M (2009) Dune on bluegeen/p. In: Proceedings of 15th Sci-Comp. University Press

25. Casasdei F, Avotins J (1997) A language for implementing compu-tational mechanics applications. In: Technology of object-orientedlanguages and systems, 1997. TOOLS 25, Proceedings, pp 52–67

26. Ciccozzi F (2013) Towards code generation from design models forembedded systems on heterogeneous cpu-gpu platforms. In: 2013IEEE 18th conference on emerging technologies factory automa-tion (ETFA), pp 1–4

27. Cirak F, Deiterding R, Mauch S (2007) Large-scale fluidstruc-ture interaction simulation of viscoplastic and fracturing thin-shellssubjected to shocks and detonations. Comput Struct 85:1049–1065.doi:10.1016/j.compstruc.2006.11.014

28. Duff I, Reid J (1983) The multifrontal solution of indefinite sparsesymmetric linear-equations. ACM Trans Math Softw 9(3):302–325

29. Eguzkitza B, Houzeaux G, Aubry R, Owen H, Vázquez M (2013) Aparallel coupling strategy for the Chimera and domain decomposi-tion methods in computational mechanics. Comput Fluids 80:128–141

30. El maliki A, Fortin M, Tardieu N, Fortin A (2010) Iterative solversfor 3d linear and nonlinear elasticity problems: Displacementand mixed formulations. Int J Numer Methods Eng 83(13):1780–1802

31. Falgout RD, Yang UM (2002) hypre: a library of high performancepreconditioners. In: Preconditioners, lecture notes in computer sci-ence, pp 632–641

32. Farhat C, Roux FX, Oden JT (1994) Implicit parallel processing instructural mechanics. Elsevier Science SA, Amsterdam

33. Flaig C, Arbenz P (2011) A scalable memory efficient multigridsolver for micro-finite element analyses based on CT images. Par-allel Comput 37:846–854. doi:10.1016/j.parco.2011.08.001

34. George A, Liu JW (1981) Computer solution of large sparse posi-tive definite. Prentice Hall, Englewood cliffs (Professional techni-cal reference)

35. Gerstenberger A, Tuminaro R (2013) An algebraic multigridapproach to solve extended finite element method based fractureproblems. Int J Numer Methods Eng 94:248–272. doi:10.1002/nme.4442

36. Geuzaine C, Remacle JF (2009) Gmsh: a 3-d finite element meshgenerator with built-in pre-and post-processing facilities. Int JNumer Methods Eng 79(11):1309–1331

37. Goudreau G, Hallquist J (1982) Recent developments in large-scalefinite element Lagrangian hydrocode technology. Comput MethodsAppl Mech Eng 33:725–757. doi:10.1016/0045-7825(82)90129-3

38. Gould NIM, Scott JA, Hu Y (2007) A numerical evaluation ofsparse direct solvers for the solution of large sparse symmetriclinear systems of equations. ACM Trans Math Softw 33(2):300–331

39. Grinberg L, Pekurovsky D, Sherwin SJ, Karniadakis GE (2009)Parallel performance of the coarse space linear vertex solver andlow energy basis preconditioner for spectral/hp elements. ParallelComput 35(5):284–304

40. Gupta A, Koric S, George T (2009) Sparse matrix factorization onmassively parallel computers. In: Proceedings of the conferenceon high performance computing networking, storage and analysis,SC’09, ACM, New York, pp 1:1–1:12

41. Goddeke D, Wobker H, Strzodka R, Mohd-Yusof J, McCormickP, Turek S (2009) Co-processor acceleration of an unmodified par-allel solid mechanics code with feastgpu. Int J Comput Sci Eng4(4):254–269

42. Hales JD, Novascone SR, Williamson RL, Gaston DR, TonksMR (2012) Solving nonlinear solid mechanics problems with theJacobian-free Newton Krylov method. Comput Model Eng Sci84:84–123

43. Heath M, Ng E, Peyton B (1991) Parallel algorithms for sparselinear systems. SIAM Rev 33(3):420–460

44. Heil M, Hazel A, Boyle J (2008) Solvers for large-displacementfluid–structure interaction problems: segregated versus monolithicapproaches. Comput Mech 43(1):91–101

45. Heroux MA, Bartlett RA, Howle VE, Hoekstra RJ, Hu JJ, KoldaTG, Lehoucq RB, Long KR, Pawlowski RP, Phipps ET, SalingerAG, Thornquist HK, Tuminaro RS, Willenbring JM, Williams A,Stanley KS (2005) An overview of the Trilinos project. ACM TransMath Softw 31(3):397–423

46. Hess B, Kutzner C, van der Spoel D, Lindahl E (2008) GROMACS4: algorithms for highly efficient, load-balanced, and scalable mole-cular simulation. J Chem Theory Comput 4(3), 435–447 (2008)doi:10.1021/ct700301q

47. Houzeaux G, de la Cruz R, Owen H, Vázquez M (2013) Paralleluniform mesh multiplication applied to a Navier–Stokes solver.Comput Fluids 80:142–151

123

http://febio.org/

http://glaros.dtc.umn.edu/gkhome/views/metis

http://glaros.dtc.umn.edu/gkhome/views/metis

https://developer.nvidia.com/openacc

https://developer.nvidia.com/opencl

http://www.openfoam.com/

http://openmp.org/

http://openmp.org/wp/

http://www.cise.ufl.edu/research/sparse/codes/

http://www.cise.ufl.edu/research/sparse/codes/

http://www.eecs.berkeley.edu/Pubs/TechRpts/1999/5398.html

http://www.eecs.berkeley.edu/Pubs/TechRpts/1999/5398.html

http://dx.doi.org/10.1002/nme.2101

http://www.mcs.anl.gov/petsc





http://dx.doi.org/10.1016/j.compstruc.2006.11.014

http://dx.doi.org/10.1016/j.parco.2011.08.001



http://dx.doi.org/10.1016/0045-7825(82)90129-3

http://dx.doi.org/10.1021/ct700301q


48. Houzeaux G, Vázquez M, Aubry R, Cela J (2009) A mas-sively parallel fractional step solver for incompressible flows. JCP228(17):6316–6332

49. Hughes T, Ferencz R (1987) Large-scale vectorized implicit calcu-lations in solid mechanics on a CrayX-MP/48 utilizing EBE pre-conditioned conjugate gradients. Comput Methods Appl Mech Eng61:215–248. doi:10.1016/0045-7825(87)90005-3

50. Hughes TJ (2012) The finite element method: linear static anddynamic finite element analysis. DoverPublications.com

51. Hussain M, Abid M, Ahmad M, Khokhar A, Masud A (2011) A par-allel implementation of ALE moving mesh technique for FSI prob-lems using OpenMP. Int J Parallel Program 39:717–745. doi:10.1007/s10766-011-0168-3

52. Jouglard C, Coutinho A (1998) A comparison of iterative multi-level finite element solvers. Comput Struct 69(5):655–670

53. Kilic SA, Saied F, Sameh A (2004) Efficient iterative solvers forstructural dynamics problems. Comput Struct 82(28):2363–2375

54. Knoll D, Keyes D (2004) Jacobian-free Newton Krylov methods: asurvey of approaches and applications. J Comput Phys 193(2):357–397

55. Komatitsch D, Erlebacher G, Göddeke D, Michéa D (2010) High-order finite-element seismic wave propagation modeling with MPIon a large GPU cluster. J Comput Phys 229:7692–7714. doi:10.1016/j.jcp.2010.06.024

56. Lafortune P, Arís R, Vázquez M, Houzeaux G (2012) Coupledelectromechanical model of the heart: parallel finite element for-mulation. Int J Numer Methods Biomed Eng 28:72–86. doi:10.1002/cnm.1494

57. Lang S, Wieners C, Wittum G (2002) The application of adaptiveparallel multigrid methods to problems in nonlinear solid mechan-ics. In: Ramm E, Rank E, Rannacher R, Schweizerhof K, Stein E,Wendland W, Wittum G, Wriggers P, Wunderlich W (eds) Error-controlled adaptive finite elements in solid mechanics, 422 pp,ISBN: 978-0-471-49650-2

58. Li X, Demmel JW (2003) Superlu dist: a scalable distributed-memory sparse direct solver for unsymmetric linear systems. ACMTrans Math Softw 29:110–140

59. Liu WK, Belytschko T, Moran B (2000) Nonlinear finite elementsfor continua and structures. Wiley, New York

60. Liu Y, Zhou W, Yang Q (2007) A distributed memory parallelelement-by-element scheme based on Jacobi-conditioned conju-gate gradient for 3D finite element analysis. Finite Elem AnalDesign 43:494–503. doi:10.1016/j.finel.2006.12.007

61. Logg A, Mardal KA, Wells GN (eds) (2012) Automated solution ofdifferential equations by the finite element method. Lecture notesin computational science and engineering, vol 84. Springer, Berlin.doi:10.1007/978-3-642-23099-8

62. Lohner R, Mut F, Cebral J, Aubry R, Houzeaux G (2010) Deflatedpreconditioned conjugate gradient solvers for the pressure-poissonequation: Extensions and improvements. Int J Numer Meth Eng10196–10208

63. Luebke D (2008) Cuda: scalable parallel programming for high-performance scientific computing. In: 5th IEEE international sym-posium on biomedical imaging: from nano to macro, 2008. ISBI2008, pp 836–838

64. Maday Y, Magoulès F (2006) Absorbing interface conditions fordomain decomposition methods: a general presentation. ComputMethods Appl Mech Eng 195(29–32):3880–3900

65. Maurer D, Wieners C (2011) A parallel block LU decompositionmethod for distributed finite element matrices. Parallel Comput37(12):742–758

66. McCormick SF (1987) Multigrid methods. Frontiers in appliedmathematics. Philadelphia, Pa. Society for Industrial and AppliedMathematics

67. Moore D, Jérusalem A, Nyein M, Noels L, Jaffee M, RadovitzkyR (2009) Computational biology—modeling of primary blast

effects on the central nervous system. NeuroImage 47(2):T10–T20.doi:10.1016/j.neuroimage.2009.02.019

68. Owens J, Houston M, Luebke D, Green S, Stone J, Phillips J (2008)Gpu computing. Proc IEEE 96(5):879–899

69. Paszyski M, Jurczyk T, Pardo D (2013) Multi-frontal solver forsimulations of linear elasticity coupled with acoustics. Comput Sci12(0). http://journals.agh.edu.pl/csci/article/view/102

70. Quey R, Dawson PR, Barbe F (2011) Large-scale 3D randompolycrystals for the finite element method: Generation, meshingand remeshing. Comput Methods Appl Mech Eng 200:1729–1745.doi:10.1016/j.cma.2011.01.002

71. Radovitzky R, Seagraves A, Tupek M, Noels L (2011) A scalable3D fracture and fragmentation algorithm based on a hybrid, dis-continuous galerkin, cohesive element method. Comput MethodsAppl Mech Eng 200:326–344. doi:10.1016/j.cma.2010.08.014

72. Saad Y (2003) Iterative methods for sparse linear systems. Societyfor Industrial and Applied Mathematics

73. Smith BF (1995) Domain decomposition methods for partial dif-ferential equations. In: Proceedings of ICASE/LaRC Workshop onParallel Numerical Algorithms. University Press

74. Soto O, Löhner R, Camelli F (2003) A linelet preconditioner forincompressible flow solvers. Int J Numer Meth Heat Fluid Flow13(1):133–147

75. Steger J, Benek FDJ (1983) A chimera grid scheme. Adv GridGener 5:59–69

76. Steger J, Benek J (1987) On the use of composite grid schemesin computational aerodynamics. Comput Meth Appl Mech Eng64:301–320

77. Stewart JR, Edwards H (2004) A framework approach for develop-ing parallel adaptive multiphysics applications. Finite Elem AnalDesign 40(12):1599–1617 (The Fifteenth Annual Robert J. MeloshCompetition)

78. Kalro V, Tezduyar TE (2000) A parallel 3d computational methodfor fluid–structure interactions in parachute systems. ComputMethods Appl Mech Eng 190:1467–1482

79. van Rietbergen B, Weinans H, Huiskes R, Polman B (1996)Computational strategies for iterative solutions of large FEMapplications employing voxel data. Int J Numer MethodsEng 39:2743–2767. doi:10.1002/(SICI)1097-0207(19960830)39:162743:AID-NME9743.3.CO;2-1

80. Vázquez M, Arís R, Houzeaux G, Aubry R, Villar P, Garcia-BarnósJ, Gil D, Carreras F (2011) A massively parallel computationalelectrophysiology model of the heart. Int J Numer Methods BiomedEng 27(12):1911–1929. doi:10.1002/cnm.1443

81. Vázquez M, Houzeaux G, Grima R, Cela J (2007) Applications ofparallel computational fluid mechanics in MareNostrum supercom-puter: low-mach compressible flows. In: PARCFD2007. Antalya(Turkey)

82. Vázquez M, Rubio F, Houzeaux G, González J, Giménez J, BeltranV, de la Cruz R, Folch A (2014) Xeon phi performance for hpc-based computational mechanics codes

83. Waisman H, Berger-Vergiat L (2013) An adaptive domain decom-position preconditioner for crack propagation problems modeledby XFEM. Int J Multiscale Comput Eng 11(6):633–654

84. Wall WA, Ramm E. Fluid–structure interaction based upon stabi-lized (ale) finite element method. In: IV World congress on com-putational mechanics. Barcelona. CIMNE.

85. Wieners C (2001) The application of multigrid methods to plas-ticity at finite strains. ZAMM J Appl Math Mech [Zeitschrift frAngewandte Mathematik und Mechanik] 81(S3):733–734

86. Zienkiewicz O, Taylor R (2000) The Finite Elements method forSolid and Structural Mechanics, 5th edn. Butterworth-Heinermann,Boston

123

http://dx.doi.org/10.1016/0045-7825(87)90005-3

http://dx.doi.org/10.1007/s10766-011-0168-3

http://dx.doi.org/10.1007/s10766-011-0168-3

http://dx.doi.org/10.1016/j.jcp.2010.06.024

http://dx.doi.org/10.1016/j.jcp.2010.06.024

http://dx.doi.org/10.1002/cnm.1494


http://dx.doi.org/10.1016/j.finel.2006.12.007

http://dx.doi.org/10.1007/978-3-642-23099-8

http://dx.doi.org/10.1016/j.neuroimage.2009.02.019

http://journals.agh.edu.pl/csci/article/view/102

http://dx.doi.org/10.1016/j.cma.2011.01.002

http://dx.doi.org/10.1016/j.cma.2010.08.014

http://dx.doi.org/10.1002/(SICI)1097-0207(19960830)39:162743:AID-NME9743.3.CO;2-1

http://dx.doi.org/10.1002/(SICI)1097-0207(19960830)39:162743:AID-NME9743.3.CO;2-1


Alya: Computational Solid Mechanics for Supercomputers

Documents

Transcript of Alya: Computational Solid Mechanics for Supercomputers