Parallel FETI Solver for Modern Architectures

1
Tests performed on IT4Innovations Salomon Supercomputer Heat Transfer - CG Solver Runtime w. Lumped Preconditioner solving 7.5 to 2912 million DOF Note: (* denotes speedup for structural mechanics) 0 50 100 150 200 250 0 50 100 150 200 250 300 350 400 450 Processing time [s] Number of compute nodes [-] CPU - PARDISO - Lumped prec CPU - SC - Lumped prec MIC - SC - Lumped prec stopping criteria: 1e-4 subdomain size: 4096 DOF cluster size: 2197subdomains cluster size: 7.5 million DOF 1 2 3 4 1 2 3 4 cluster 1 cluster 2 FETI method Hybrid FETI method 1 2 3 4 1 2 3 4 Modified system has 'only' 6 independent rigid motions. 1 2 3 4 System has 12 independent rigid motions. Total FETI (2D case) Problem decomposed into 4 subdomains generates coarse problem matrix (GGT) with dimension: 3 *(number of SUBDOMAINS) = 12 Hybrid Total FETI (2D case) Beam decomposed into 2 clusters (each consists of N subdomains) generates coarse problem matrix (GGT) with dimension 3 *(number of CLUSTERS) = 6 Number of clusters = number of nodes Hybrid Total FETI Method - Multilevel FETI FETI and HTFETI References: [1] A METHOD OF FINITE-ELEMENT TEARING AND INTERCONNECTING AND ITS PARALLEL SOLUTION ALGORITHM By: FARHAT, C; ROUX, FX; INTERNATIONAL JOURNAL FOR NUMERICAL METHODS IN ENGINEERING, Volume: 32, Issue: 6, 1991 [2] Highly scalable parallel domain decomposition methods with an application to biomechanics By: Klawonn, Axel; Rheinbach, Oliver; ZAMM-ZEITSCHRIFT FUR ANGEWANDTE MATHEMATIK UND MECHANIK, Volume: 90, Issue: 1, 2010 [2] Total FETI domain decomposition method and its massively parallel implementation By: Kozubek, T.; Vondrak, V.; Mensik, M.; et al.; ADVANCES IN ENGINEERING SOFTWARE Volume: 60-61, 2013 Why Hybrid FETI scales? Parallel FETI Solver for Modern Architectures ESPRESO Solver – espreso.it4i.cz Lubomír Říha ([email protected]), Tomáš Brzobohatý, Michal Merta, Alexandros Markopoulos, Ondřej Meca, and Tomáš Kozubek IT4Innovations National Supercomputing Center, Ostrava, Czech Republic - http://www.it4i.cz ESPRESO Library Key Features of the ESPRESO Library support FEM and BEM (uses BEM4I library) discretization for: Advection- diffusion, Stokes flow and Structural mechanics Support for Ansys and OpenFOAM database file format Multiblock benchmark generator for large scalability tests C API allows ESPRESO to be used as solver library - tested with CSC ELMER Postprocessing and Vizualization is based on VTK library and Paraview (support for realtime vizualization using Paraview Catalyst) Massively Parallel Solver based on highly scalable Hybrid Total FETI Method – scales to ~18 000 compute nodes Support for symmetric (CG with full orthogonalization) and nonsymmetrical systems (BiCGStab, GMRES) supports modern many core architectures – GPGPU and Intel Xeon Phi contains pipelined Conjugated Gradient Solver – communication hiding supports hybrid parallelization in form of MPI and Cilk++ Key Research Funding Projects Intel Xeon Phi acceleration developed under Intel PCC at IT4Innovations Hybrid FETI implementation developed under EXA2CT FP7 project Scalability tests on Titan machine and GPU acceleration developed under the ORNL Director Discretion project for 2.7 million core hours FEM discretization 250 200 150 100 50 0 #1 #8 #27 #64 #125 # - number of nodes Preprocessing runtime [s] FETI processing Hybrid FETI processing + FETI processing FETI Iterative Solver for Many-core Accelerators FETI methods relies on sparse direct solvers (mainly on forward and backward substitutions) and sparse matrix vector multiplications Operation on sparse matrices have low arithmetic density and “bad” memory access patterns Using Local Schur complements in form of dense matrices is still memory bounded, but can fully utilize fast accelerator memory This is the main factor that brings the speedup when compared to CPU lower main memory bandwidth Pre-processing – K factorization 1.) = % & - SpMV 2.) = +% - solve 3.) = % - SpMV 4.) stencil data exchange in - MPI – Send and Recv - OpenMP – shared mem. Vec Pre-processing - - = % +% % & 1.) - nop 2.) = - - DGEMV, DSYMV 3.) - nop 4.) stencil data exchange in - MPI – Send and Recv - OpenMP – shared mem. Vec Pre-processing - - = % +% % & GPU/MIC 1.) GPU/MIC - PCIe transfer from CPU 2.) = - - DGEMV, DSYMV on GPU/MIC 3.) GPU/MIC - PCIe transfer to CPU 4.) stencil data exchange in - MPI – Send and Recv - OpenMP – shared mem. vec 90 – 95% of runtime spent in Ap i Projected Conjugate Gradient in FETI Local Schur Complement Method for FETI CPU (24 th.) 2x60 2x120 2x240 0 10 20 30 40 50 60 70 80 90 100 CPU (24 th.) 2x60 2x120 2x240 0 50 100 150 200 250 300 Non-symmetric Symmetric Number of subdomains: 1331 Stiffness matrix sizes: 2187 x 2187 Symmetric: 16.2 GB Number of iterations: 500 Number of subdomains: 512 Stiffness matrix sizes: 2187 x 2187 Non-symmetric: 12.8 GB Symmetric: 7.6 GB Number of iterations: 500 2.5 speedup 2x CPU Number of cores [-] Processing time [s] CG Solver Acceleration using Shur Complement method on Intel Xeon Phi Processing time [s] Number of cores [-] 30 25 20 15 10 5 0 2 4 6 8 10 12 14 Speedup / slowdown Matrix Size 1000 100 10 1 0.1 Processing time [s] Matrix Size 2048 8192 SC – many RHS SC – PARDISO SC K factorization slowdown vs factorization speedup vs SC – many RHS x10 3 Schur complement calculation is the main bottleneck of the method PARDISO solvers contain efficient algorithms for SC calculation (PARDISO SC and PARDISO MKL exhibit almost identical performance) Local Schur Complement Calculation using PARDISO Two Intel Xeon Phi 7120P are 2.5 times faster than two 12 core Xeon E5- 2680v3 (Symmetric format for SC further improves performance). speedup 7.8 2.7* 1.9 2.2* 652 224 87 34 32 128 512 1000 2000 4000 8000 Total solver runtime [s] MPI processes/CPU cores Linear Real Real World Problems Intel MKL PARDISO for FETI Solvers Cholesky decomposition and solve using Parallel Studio 2017 beta Schur complement and DSYMV using Parallel Studio 2017 beta Experience with Knights Landing (KNL) Schur Complement Processing on KNL Hardware: early access to Intel Xeon Phi 7210 (64 cores at 1.3 GHz) at Intel’s Endeavor cluster under umbrella of Intel PCC at IT4Innovations Schur complement calculation using MKL PARDISO In the best case decomposition scales up to 128 threads (MCDRAM) 2.3x better performance for MCDRAM when compared to DDR (19.1s/43s) 2x Haswell CPUs of IT4I Salomon is 1.4 times faster than KNL (19.1s/13.4s) Apply – Symmetric Matrix Vector Multiplication (MKL) in DDR very bad scalability – from 8 to 32 threads 1.2x speedup in MCDRAM scales up to 256 threads up to 4x speed compared to DDR 1x KNL is 2.5x times faster than 2x Haswell of IT4I Salomon (3.1s/7.7s) 1x KNL is 2.0x times faster than 1x Xeon Phi 7120p Notes: - Decomposition on KNL is two times faster than on KNC MCDRAM significantly boosts the performance of the Schur complement method in ESPRESO solver. One KNL compute node can deliver similar performance to two Intel Xeon Phi 7120P accelerators for the critical part of the FETI based solvers if essential data is stored in the MCDRAM. This reflects the 2.5x speedup over two 12core Intel Xeon [email protected] CPUs 0,0 0,1 0,2 0,6 0,6 0,6 0,9 0,8 1,0 1,7 1,5 1,3 1,8 6,0 6,2 6,2 6,9 7,0 7,0 7,1 7,7 8,2 7,1 8,1 8,3 8,1 28 28 28 28 28 28 29 29 29 29 29 29 29 80 89 105 109 109 110 111 112 113 113 114 115 115 0 20 40 60 80 100 120 140 160 0.06 8 0.45 64 1.53 216 3.62 512 7.07 1000 12.2 1728 19.4 2744 28.9 4096 41.2 5832 56.5 8000 75.2 10648 97.6 13824 124 17576 Processing Time [s] Problem size [billion DOF] Number of compute nodes [-] FETI Preprocessing Hybrid FETI Preprocessing K Regularization and Factorization CG Solver Runtime 47 47 47 47 47 47 47 47 47 46 47 41 39 Test performed on ORNL Titan (18,688 compute nodes) Structural mechanics - 11 billion DOF on up to 17 576 nodes (281 216 cores) and Heat transfer (Laplace equation) - 20 billion DOF on up to 17 576 nodes Strong Scalability Evaluation of the HTFETI Method 139 83 51 40 30 22 19 16 32 64 128 2 400 4 800 9 600 19 200 Solver runtime [s] Number of compute nodes Linear Real Weak Scalability Evaluation of the HTFETI Method 222 136 83 59 39 29 22 16 32 64 128 256 2 400 4 800 9 600 19 200 Solver runtime [s] Number of compute nodes Linear Real Heat transfer (Laplace equation) - up to 124 billion DOF on 17576 nodes Intel Xeon Phi Acceleration of the Iterative Solver speedup 3.4 1.4 0 2 4 6 8 10 12 14 16 0 200 400 600 800 1000 Processing time [s] Number of compute nodes / GPUs [-] CPU GPU - symmetric GPU - general storage format stopping criteria: 1e-3 subdomain size: 12,287 DOF cluster size: 27 subdomains cluster size: 0.35 million DOF GPGPU Acceleration of the Iterative Solver Test performed on ORNL Titan Structural Mechanics - CG Solver Runtime with Lumped Preconditioner solving 0.3 - 300 million DOF Note: AMD Opteron 6274 16-core CPU with Tesla K20X GPU w. 6 GB RAM + general storage of SC -> 0.35 mil DOF per GPU ~10x smaller problem than 2x Intel Xeon Phi 7120 w 2x16GB RAM and sym. SC Tests performed on IT4Innovations Salomon Supercomputer Structural Mechanics - 300 million DOF problem generated from ANSYS Workbench 2,016 Intel Xeon E5-2680v3, 2.5GHz, 12cores 864 Intel Xeon Phi 7120P, 61cores, 16GB RAM IT4Innovations Salomon References: [1] L. Riha, T. Brzobohaty, A. Markopoulos, O. Meca, T. Kozubek, Massively Parallel Hybrid Total FETI (HTFETI) Solver, in: Platform 270 for Advanced Scientific Computing Conference, PASC, ACM, 2016. doi:DOI:http://dx.doi.org/10.1145/2929908.2929909 [2] L. Riha, T. Brzobohaty, A. Markopoulos, O. Meca, T. Kozubek, O. Schenk, W. Vanroose, Efficient Implementation of Total FETI Solver for Graphic Processing Units using Schur Complement, in: HPCSE 2015, LNCS 9611, 2016. doi:DOI:10.1007/978-3-319-40361-86. 290 [3] L. Riha, T. Brzobohaty, A. Markopoulos, Hybrid parallelization of the Total FETI Solver, Advances in Engineering Software (2016) – doi:http://dx.doi.org/10.1016/j.advengsoft.2016.04.004. http://www.sciencedirect.com/science/article/pii/ S0965997816300783 [4] L. Říha, T. Brzobohatý, A. Markopoulos, M. Jarošová, T. Kozubek, D. Horák, V. Hapla, “Implementation of the Efficient Communication Layer for the Highly Parallel Total FETI and Hybrid Total FETI Solvers”, Parallel Computing, DOI 10.1016/j.parco.2016.05.002 Hybrid Total FETI with Dirichlet preconditioner Number of compute nodes [-] 47 93 186 372 Number of clusters [-] 1024 2048 4096 8192 Number of subdomains per cluster [-] 128 128 32 32 Total number of subdomains [-] 131072 262144 131072 262144 Average subdomain size [DOF] 2289 1144 2289 1144 Number of iterations [-] 1104 823 522 365 Total solution time with preprocessing [s] 652 224 87 34 Mesh Processing Matrix Assembler FEM/BEM (BEM4I) TFETI/Hybrid TFETI Solvers ANSYS/ELMER ESPRESO Generator ESPRESO API Visualization Paraview Catalyst CPU Preprocessing Postprocessing ESPRESO C++ library MIC GPU EnSight / Visit Parallel Computing Center

Transcript of Parallel FETI Solver for Modern Architectures

Page 1: Parallel FETI Solver for Modern Architectures

Tests performed on IT4Innovations Salomon SupercomputerHeat Transfer - CG Solver Runtime w. Lumped Preconditioner solving 7.5 to 2912 million DOF

Note: (* denotes speedup for structural mechanics)

0

50

100

150

200

250

0 50 100 150 200 250 300 350 400 450

Processin

gtim

e[s]

Numberofcomputenodes[-]

CPU- PARDISO- LumpedprecCPU- SC- LumpedprecMIC- SC- Lumpedprec

stoppingcriteria: 1e-4subdomainsize:4096DOFclustersize:2197subdomainsclustersize:7.5millionDOF

1 2 3 4

1 2 3 4cluster 1 cluster 2

FETI method

Hybrid FETI method

1 2 3 4 1 2 3 4

Modified system has 'only' 6 independent rigid motions.

1 2 3 4

System has 12 independent rigid motions.

Total FETI (2D case)Problem decomposed into 4 subdomains generates coarse problem matrix (GGT) with dimension:

3 *(number of SUBDOMAINS) = 12

Hybrid Total FETI (2D case)Beam decomposed into 2 clusters (each consists of N subdomains) generates coarse problem matrix (GGT) with dimension

3 *(number of CLUSTERS) = 6Number of clusters = number of nodes

Hybrid Total FETI Method - Multilevel FETI

FETI and HTFETI References:[1]AMETHODOFFINITE-ELEMENTTEARINGANDINTERCONNECTINGANDITSPARALLELSOLUTIONALGORITHMBy: FARHAT,C;ROUX,FX;INTERNATIONALJOURNALFORNUMERICALMETHODSINENGINEERING,Volume:32,Issue:6,1991

[2]HighlyscalableparalleldomaindecompositionmethodswithanapplicationtobiomechanicsBy: Klawonn,Axel;Rheinbach,Oliver;ZAMM-ZEITSCHRIFTFURANGEWANDTEMATHEMATIKUNDMECHANIK,Volume: 90,Issue: 1,2010

[2]TotalFETIdomaindecompositionmethodanditsmassivelyparallelimplementationBy: Kozubek,T.; Vondrak,V.;Mensik,M.;etal.;ADVANCESINENGINEERINGSOFTWARE Volume: 60-61,2013

Why Hybrid FETI scales?

ParallelFETISolverforModernArchitecturesESPRESO Solver – espreso.it4i.cz

Lubomír Říha([email protected]),TomášBrzobohatý,MichalMerta,AlexandrosMarkopoulos,OndřejMeca,andTomášKozubekIT4InnovationsNationalSupercomputingCenter,Ostrava,CzechRepublic- http://www.it4i.cz

ESPRESO Library

Key Features of the ESPRESO Library• support FEM and BEM (uses BEM4I library) discretization for: Advection-

diffusion, Stokes flow and Structural mechanics

• Support for Ansys and OpenFOAM database file format

• Multiblock benchmark generator for large scalability tests

• C API allows ESPRESO to be used as solver library - tested with CSC ELMER

• Postprocessing and Vizualization is based on VTK library and Paraview (support for realtime vizualization using Paraview Catalyst)

Massively Parallel Solver • based on highly scalable Hybrid Total FETI Method – scales to ~18 000

compute nodes

• Support for symmetric (CG with full orthogonalization) and nonsymmetrical systems (BiCGStab, GMRES)

• supports modern many core architectures – GPGPU and Intel Xeon Phi

• contains pipelined Conjugated Gradient Solver – communication hiding

• supports hybrid parallelization in form of MPI and Cilk++

Key Research Funding Projects• Intel Xeon Phi acceleration developed under Intel PCC at IT4Innovations

• Hybrid FETI implementation developed under EXA2CT FP7 project

• Scalability tests on Titan machine and GPU acceleration developed under the ORNL Director Discretion project for 2.7 million core hours

FEM discretization

250

200

150

100

50

0#1 #8 #27 #64 #125

# - number of nodes

Prep

roce

ssin

g ru

ntim

e [s

]

FETI processing

Hybrid FETI processing+ FETI processing

FETI Iterative Solver for Many-core Accelerators

• FETI methods relies on sparse direct solvers (mainly on forward andbackward substitutions) and sparse matrix vector multiplications

• Operation on sparse matrices have low arithmetic density and “bad”memory access patterns

• Using Local Schur complements in form of dense matrices is stillmemory bounded, but can fully utilize fast accelerator memory

• This is the main factor that brings the speedup when compared to CPUlower main memory bandwidth

Pre-processing – K factorization 1.)𝑥 = 𝐵%& ' 𝜆 - SpMV2.) 𝑦 = 𝐾+% ' 𝑥 - solve 3.)𝜆 = 𝐵% ' 𝑦 - SpMV4.) stencil data exchange in 𝜆

- MPI – Send and Recv- OpenMP – shared mem. Vec

Pre-processing - 𝑆- = 𝐵%𝐾+%𝐵%&1.) - nop2.)𝜆 = 𝑆- ' 𝜆- DGEMV, DSYMV3.) - nop4.) stencil data exchange in 𝜆

- MPI – Send and Recv- OpenMP – shared mem. Vec

Pre-processing - 𝑆- = 𝐵%𝐾+%𝐵%& → GPU/MIC1.)𝜆 → GPU/MIC - PCIe transfer from CPU2.)𝜆 = 𝑆- ' 𝜆- DGEMV, DSYMV on GPU/MIC3.) 𝜆 ←GPU/MIC - PCIe transfer to CPU4.) stencil data exchange in 𝜆

- MPI – Send and Recv- OpenMP – shared mem. vec 90 – 95% of runtime spent in Api

Projected Conjugate Gradient in FETI

Local Schur Complement Method for FETI

CPU (24 th.) 2x60 2x120 2x2400

10

20

30

40

50

60

70

80

90

100

Non-symmetricSymmetric

Configuraion

Itera

tive

solv

er ti

me

[s]

CPU (24 th.) 2x60 2x120 2x2400

50

100

150

200

250

300

Symmetric

Configuraion

Itera

tive

solv

er ti

me

[s]

Non

-sym

metric

Symmetric

Numberofsubdomains:1331Stiffnessmatrixsizes:2187x2187Symmetric:16.2GBNumberofiterations:500

Numberofsubdomains:512Stiffnessmatrixsizes:2187x2187Non-symmetric:12.8GBSymmetric: 7.6GBNumberofiterations:500

2.5speedup

2x CPU

Numberofcores[-]

Processin

gtim

e[s]

CG Solver Acceleration using Shur Complement method on Intel Xeon Phi

Processin

gtim

e[s]

Numberofcores[-]

30

25

20

15

10

5

0 2 4 6 8 10 12 14

Spee

du

p /

slow

dow

n

Matrix Size

1000

100

10

1

0.1

Proc

essi

ng ti

me

[s]

Matrix Size2048 8192

SC – many RHS SC – PARDISO SC K factorization slowdown vs factorization speedup vs SC – many RHS

x103

• Schur complement calculation is the main bottleneck of the method• PARDISO solvers contain efficient algorithms for SC calculation

(PARDISO SC and PARDISO MKL exhibit almost identical performance)

Local Schur Complement Calculation using PARDISO

Two Intel Xeon Phi 7120P are 2.5 times faster than two 12 core Xeon E5-2680v3 (Symmetric format for SC further improves performance).

speedup7.8 2.7*1.9 2.2*

652

224

87

3432

128

512

1000 2000 4000 8000

Tota

l sol

ver r

untim

e [s

]

MPI processes/CPU cores

LinearReal

Real World Problems

Intel MKL PARDISO for FETI Solvers Cholesky decomposition and solve using Parallel Studio 2017 beta

Schur complement and DSYMV using Parallel Studio 2017 beta

Experience with Knights Landing (KNL)

Schur Complement Processing on KNL

Hardware: early access to Intel Xeon Phi 7210 (64 cores at 1.3 GHz) at Intel’s Endeavor cluster under umbrella of Intel PCC at IT4Innovations

Schur complement calculation using MKL PARDISO• In the best case decomposition scales up to 128 threads (MCDRAM)• 2.3x better performance for MCDRAM when compared to DDR (19.1s/43s)• 2x Haswell CPUs of IT4I Salomon is 1.4 times faster than KNL (19.1s/13.4s)

Apply – Symmetric Matrix Vector Multiplication (MKL) • in DDR very bad scalability – from 8 to 32 threads 1.2x speedup• in MCDRAM scales up to 256 threads up to 4x speed compared to DDR• 1x KNL is 2.5x times faster than 2x Haswell of IT4I Salomon (3.1s/7.7s)• 1x KNL is 2.0x times faster than 1x Xeon Phi 7120p

Notes: - Decomposition on KNL is two times faster than on KNC

MCDRAM significantly boosts the performance of the Schur complement method in ESPRESO solver.

One KNL compute node can deliver similar performance to two Intel Xeon Phi 7120P accelerators for the critical part of the FETI based solvers if essential data

is stored in the MCDRAM.

This reflects the 2.5x speedup over two 12core Intel Xeon [email protected] CPUs

0,0 0,1 0,2 0,6 0,6 0,6 0,9 0,8 1,0 1,7 1,5 1,3 1,86,0 6,2 6,2 6,9 7,0 7,0 7,1 7,7 8,2 7,1 8,1 8,3 8,128 28 28 28 28 28 29 29 29 29 29 29 29

80 89 105 109 109 110 111 112 113 113 114 115 115

0

20

40

60

80

100

120

140

160

0.068

0.4564

1.53216

3.62512

7.071000

12.21728

19.42744

28.94096

41.25832

56.58000

75.210648

97.613824

12417576

Processin

gTime[s]

Problemsize[billionDOF]Numberofcomputenodes[-]

FETIPreprocessingHybridFETIPreprocessingKRegularizationandFactorizationCGSolverRuntime

47 47 47 47 47 47 47 47 4746 474139

Test performed on ORNL Titan (18,688 compute nodes)

Structural mechanics - 11 billion DOF on up to 17 576 nodes (281 216 cores) and Heat transfer (Laplace equation) - 20 billion DOF on up to 17 576 nodes

Strong Scalability Evaluation of the HTFETI Method

139

83

51

40

30

2219

16

32

64

128

2 400 4 800 9 600 19 200

Solv

er ru

ntim

e [s

]

Number of compute nodes

Linear Real

Weak Scalability Evaluation of the HTFETI Method

222

136

83

59

3929

2216

32

64

128

256

2 400 4 800 9 600 19 200

Solv

er ru

ntim

e [s

]

Number of compute nodes

Linear Real

Heat transfer (Laplace equation) - up to 124 billion DOF on 17576 nodes

Intel Xeon Phi Acceleration of the Iterative Solver

Structural Mechanics - CG Solver Runtime w. Lumper Preconditioner solving 192 - 1033 million DOF

0

20

40

60

80

100

120

50 100 150 200 250 300 350 400

Processin

gtim

e[s]

numberofcomputenodes[-]

CPU- PARDISO- Lumpedprec

CPU- SC- Lumpedprec

MIC- SC- Lumpedprec

stoppingcriteria: 1e-4subdomainsize:3993DOFclustersize:1000subdomainsclustersize:3.1millionDOF

2.2

speedup2.7

speedup3.4

1.4

02468

10121416

0 200 400 600 800 1000

Processin

gtim

e[s]

Number of compute nodes /GPUs [-]

CPU GPU- symmetric GPU- generalstorageformat

stoppingcriteria:1e-3subdomainsize:12,287DOF

clustersize:27subdomainsclustersize:0.35millionDOF

GPGPU Acceleration of the Iterative Solver

Test performed on ORNL TitanStructural Mechanics - CG Solver Runtime with Lumped Preconditioner solving 0.3 - 300 million DOF

Note: AMD Opteron 6274 16-core CPU with Tesla K20X GPU w. 6 GB RAM + general storage of SC -> 0.35 mil DOF per GPU ~10x smaller problem than 2x Intel Xeon Phi 7120 w 2x16GB RAM and sym. SC

Tests performed on IT4Innovations Salomon SupercomputerStructural Mechanics - 300 million DOF problem generated from ANSYS Workbench

2,016 Intel Xeon E5-2680v3, 2.5GHz, 12cores864 Intel Xeon Phi 7120P, 61cores, 16GB RAM

IT4Innovations Salomon

References: [1]L.Riha,T.Brzobohaty,A.Markopoulos,O.Meca,T.Kozubek,MassivelyParallelHybridTotalFETI(HTFETI)Solver,in:Platform270forAdvancedScientificComputingConference,PASC,ACM,2016.doi:DOI:http://dx.doi.org/10.1145/2929908.2929909[2]L.Riha,T.Brzobohaty,A.Markopoulos,O.Meca,T.Kozubek,O.Schenk,W.Vanroose,EfficientImplementationofTotalFETISolverforGraphicProcessingUnitsusingSchur Complement,in:HPCSE2015,LNCS9611,2016.doi:DOI:10.1007/978-3-319-40361-86.290[3]L.Riha,T.Brzobohaty,A.Markopoulos,HybridparallelizationoftheTotalFETISolver,AdvancesinEngineeringSoftware(2016)– doi:http://dx.doi.org/10.1016/j.advengsoft.2016.04.004.http://www.sciencedirect.com/science/article/pii/S0965997816300783[4]L.Říha,T.Brzobohatý,A.Markopoulos,M.Jarošová,T.Kozubek,D.Horák,V.Hapla,“ImplementationoftheEfficientCommunicationLayerfortheHighlyParallelTotalFETIandHybridTotalFETISolvers”,ParallelComputing,DOI10.1016/j.parco.2016.05.002

TotalFETIwithDirichletpreconditionerNumberofcomputenodes[-] 23 47 93 186 372NumberofMPIprocesses[-] 512 1024 2048 4096 8192NumberofsubdomainsperMPIprocess[-] 64 16 16 8 2Totalnumberofsubdomains[-] 32768 16384 32768 32768 16384Averagesubdomainsize[DOF] 9155 18311 9155 9155 18311Numberofiterations[-] 301 273 346 286 177Totalsolutiontimewithpreprocessing[s] 683 369 195 115 61

HybridTotalFETIwithDirichletpreconditionerNumberofcomputenodes[-] 47 93 186 372Numberofclusters[-] 1024 2048 4096 8192Numberofsubdomainspercluster[-] 128 128 32 32Totalnumberofsubdomains[-] 131072 262144 131072 262144Averagesubdomainsize[DOF] 2289 1144 2289 1144Numberofiterations[-] 1104 823 522 365Totalsolutiontimewithpreprocessing[s] 652 224 87 34

MeshProcessing

MatrixAssemblerFEM/BEM(BEM4I)

TFETI/HybridTFETISolvers

ANSYS/ELMER

ESPRESOGenerator

ESPRESOAPI

Visualization

ParaviewCatalyst

CPU

Preprocessing PostprocessingESPRESO C++library

MIC GPU

EnSight /Visit

Parallel Computing Center