PG-4037, Fast modal analysis with NX Nastran and GPUs, by Leonard Hoffnung

FAST MODAL ANALYSIS WITH NX NASTRAN AND GPUS

LEONARD HOFFNUNG SIEMENS PLM SOFTWARE

INTRODUCTION INTRO / FREQ RESP / MODES / CONCLUSIONS

3 | FAST MODAL ANALYSIS WITH NX NASTRAN AND GPUS | NOVEMBER 12, 2013 | CONFIDENTIAL

ABOUT NX NASTRAN

!  Industry standard finite element package from Siemens PLM

!  Analysis opSons include: ‒  Stress, vibraSon, structural failure ‒ Heat transfer, acousScs, rotor dynamics, and more

!  Advanced numerical capabiliSes and proven scalability: ‒ Problem sizes approaching 1 billion dofs ‒  SMP to 24 cores ‒ DMP to 2048 nodes


MODAL FREQUENCY RESPONSE OVERVIEW

!  Bread and bu\er industrial computaSon: modal frequency response

!  Widely used in automoSve & aerospace to determine response under varying excitaSons ‒ OpSmize weight, rigidity ‒ Minimize noise, resonance

!  Two phase calculaSon more efficient than direct: ‒ Modal analysis ‒  Frequency response calculaSon

NASTRAN SOL 111


MODAL FREQUENCY RESPONSE

!  EigensoluSon -‐-‐ ℎ normal modes of 𝑓×𝑓 structural matrices: structural matrices:

𝐾↓𝑓𝑓 Φ↓𝑓ℎ = 𝑀↓𝑓𝑓 Φ↓𝑓ℎ Λ↓ℎℎ 

!  Frequency response -‐-‐ ℎ×ℎ complex linear soluSon at each of 𝑛𝑟𝑒𝑠𝑝 frequencies:

(𝐾↓ℎℎ + 𝜔↓𝑘 𝑖 𝐵↓ℎℎ − 𝜔↓𝑘↑2 𝑀↓ℎℎ )𝑥↓𝑘 = 𝑏↓𝑘 , 𝑘=1,…,𝑛𝑟𝑒𝑠𝑝

!  All parameters large in typical customer usage: ‒ 𝑓-‐size 10-‐30M for model fidelity ‒ ℎ-‐size 10-‐60K for modal accuracy ‒ 𝑛𝑟𝑒𝑠𝑝 20K for detailed response graph

COMPUTATIONAL STEPS


PERFORMANCE CASE STUDY

!  Shell dominated SOL 111 model ‒ 245K degrees of freedom (𝑓-‐size) ‒ 1200 eigenpairs (ℎ-‐size) ‒ 20K frequency responses (𝑛𝑟𝑒𝑠𝑝)

!  EigensoluSon Sme: 30 minutes

!  Frequency response: 127 minutes

!  Frequency response cost 𝑂(𝑛𝑟𝑒𝑠𝑝 ∗ℎ↑3 ) ‒ EsSmated run Sme in decades as ℎ→60𝐾

PR MODEL – FREQUENCY RESPONSE COST


PERFORMANCE CASE STUDY

!  More typical industrial model: ‒ 11 million degrees of freedom (𝑓-‐size) ‒  Shell dominated model ‒ Approximately 3000 eigenpairs (ℎ-‐size) ‒ 300 frequency responses (𝑛𝑟𝑒𝑠𝑝)

!  Frequency response expensive, but modal calculaSon sSll expensive even with RDMODES: ‒ Modal calculaSon: 375 minutes ‒  Frequency response Sme: 22 minutes

!  Need to improve performance in both phases

CUSTOMER BENCHMARK

FREQUENCY RESPONSE INTRO / FREQ RESP / MODES / CONCLUSIONS


FREQUENCY RESPONSE IMPLEMENTATION

!  NX Nastran implementaSon uses symmetric 𝐿𝐷𝐿↑𝑇  factorizaSon and forward-‐backward subsStuSon:

For 𝑘=1,…,𝑛𝑟𝑒𝑠𝑝 Assemble 𝐴= 𝐾↓ℎℎ + 𝜔↓𝑘 𝑖𝐵↓ℎℎ − 𝜔↓𝑘↑2 𝑀↓ℎℎ  Factor 𝐴=𝐿𝐷𝐿↑𝑇  Solve 𝑥↓𝑘 = 𝐴↑−1 𝑏↓𝑘 = 𝐿↑−𝑇 𝐷↑−1 𝐿↑−1 𝑏↓𝑘  End for

!  NX Nastran sparse factorizaSon difficult to adapt to GPU: ‒  Disk oriented ‒ Tuned for sparse matrices ‒  Symmetric pivoSng required for stability (indefiniteness)

DETAILS OF ORIGINAL METHOD



!  For GPU code, use LU factorizaSon instead:

For 𝑘=1,…,𝑛𝑟𝑒𝑠𝑝 Assemble 𝐴= 𝐾↓ℎℎ + 𝜔↓𝑘 𝑖𝐵↓ℎℎ − 𝜔↓𝑘↑2 𝑀↓ℎℎ  Factor 𝐴=𝐿𝑈 Solve 𝑥↓𝑘 = 𝐴↑−1 𝑏↓𝑘 = 𝑈↑−1 𝐿↑−1 𝑏↓𝑘  End for

!  OpenCL port of LAPACK zgesv available with clMAGMA and clBLAS ‒  In core storage ‒ Dense oriented (okay for this applicaSon) ‒ Benefit mainly in factorizaSon step (cubic operaSon count)

DETAILS OF REVISED METHOD



!  Original NX Nastran sparse symmetric solver ‒  Spills to disk, requires minimal memory ‒ Minimizes flops by uSlizing symmetry ‒ Takes advantage of sparsity

!  Improved SMP method (system462=1 in NXN9.0) ‒  In core, based on LAPACK zsytrf/zsytrs ‒ Efficient parallelizaSon of 𝑛𝑟𝑒𝑠𝑝 loop ‒  Large memory requirements

!  OpenCL method (to appear in NXN9 MP) ‒  In core, based on clMAGMA zgesv (LU factorizaSon) ‒ USlizing GPU for best performance

LINEAR SOLVER SELECTION STRATEGY


FREQUENCY RESPONSE

!  Test machine: ‒ Magny-‐Cours 2.1 GHz, 24 cores ‒ 32GB memory ‒ 4GB TahiS GPU

!  GPU roughly 40% faster than 24-‐way SMP

INITIAL PERFORMANCE COMPARISON

0:00:00

0:14:24

0:28:48

0:43:12

0:57:36

1:12:00

1:26:24

1:40:48

1:55:12

2:09:36

2:24:00

e10k e20k e30k e40k

serial

smp=8

smp=24

GPU

Model Modes

e10k 1785

e20k 3631

e30k 5576

e40k 7646


FREQUENCY RESPONSE – FURTHER IMPROVEMENTS

!  Use single precision on GPU for improved performance ‒ Higher flop rate (typically 4-‐5 Smes) ‒  Lower memory uSlizaSon ‒  (larger dimension problems possible) ‒ Be\er scaling with larger systems

‒  Single precision disadvantage: lower precision ‒ Accuracy acceptable for most engineering purposes ‒  (largest relaSve error of 10↑−5 )

SINGLE PRECISION ARITHMETIC

1E-‐08

0.0000001

0.000001

0.00001

0.0001

0.001

0.01

0.1

1

Double precision

Single precision

RelaSve error



!  40-‐50% reducSon in run Sme

!  Largest example only possible in single precision

SINGLE PRECISION ACCURACY AND PERFORMANCE

0:00:00

0:02:53

0:05:46

0:08:38

0:11:31

0:14:24

0:17:17

e10k e20k e30k e40k e60k

Double

Single

Model Modes

e10k 1785

e20k 3631

e30k 5576

e40k 7646

e60k 12088



!  Perform addiSon of matrices at each frequency on GPU (assembly step)

𝐴= 𝐾↓ℎℎ + 𝜔↓𝑘 𝑖𝐵↓ℎℎ − 𝜔↓𝑘↑2 𝑀↓ℎℎ 

!  I.e. store 𝐾↓ℎℎ , 𝐵↓ℎℎ , 𝑀↓ℎℎ  in GPU buffers and sum using zaxpy/saxpy kernels:

𝐴≔ 𝐾↓ℎℎ 

𝐴≔𝐴+ 𝜔↓𝑘 𝑖 𝐵↓ℎℎ  𝐴≔𝐴− 𝜔↓𝑘↑2 𝑀↓ℎℎ 

!  Minimizes data transfer to/from main memory

!  AddiSonal GPU memory consumpSon

MATRIX SUMMATION ON GPU



!  Double precision best result (e30k): ‒ Time reduced 30% from 6:52 to 4:50 ‒ 2x faster than best CPU Sme

!  Single precision best result (e40k): ‒ Time reduced 22% from 6:23 to 4:58 ‒ 4x faster than best CPU Sme

!  Best scaling with largest problems ‒  Limited by GPU memory

MATRIX SUMMATION ON GPU PERFORMANCE

0:00:00

0:01:26

0:02:53

0:04:19

0:05:46

0:07:12

0:08:38

0:10:05

0:11:31

0:12:58

e10k e20k e30k e40k

Double Double + zaxpy Single Single + caxpy

MODAL ANALYSIS INTRO / FREQ RESP / MODES / CONCLUSIONS


MODAL ANALYSIS WITH RDMODES

!  RDMODES – proprietary high-‐performance approximate eigensolver

!  Tuned for typical customer use cases: ‒  Larger models (10 million+ dofs) ‒ Many modes (300+) ‒ Accelerated computaSon when few output dofs required ‒  Sufficient accuracy for frequency response calculaSons

!  Performance up to 20x faster than Lanczos

!  Demonstrated DMP scalability to 2048 nodes

OVERVIEW


MODAL ANALYSIS WITH RDMODES

!  RDMODES method comprised of mulSple smaller operaSons – five areas listed below

!  Costs for customer benchmark: ‒ 11 million dofs ‒  Shell dominated ‒ 3000 modes below 400 Hz ‒ 300 frequency responses

!  Dense operaSons good candidates for GPU ‒  FactorizaSon, eigensoluSon

COST BREAKDOWN

Opera?on Wall ?me

Sparse factorizaSon 18:40

Dense factorizaSon 24:00

Sparse eigensoluSon 9:33

Dense eigensoluSon 65:00

Reduced (dense) eigensoluSon 21:16

Total 250:06


RDMODES FACTORIZATION

!  Fairly large quanSty of each type

!  Sparse factorizaSons: ‒ Typically too large to treat efficiently as dense ‒ NXN mulSfrontal solver very efficient ‒ Efficient sparse soluSon on GPU difficult (acSve research)

!  Dense factorizaSons: ‒ Model dependent, typically small ‒  Symmetric posiSve definite, may use clMAGMA dposv ‒ Candidate for GPU

CLASSIFICATION


RDMODES FACTORIZATION

!  Dense factorizaSon wall Smes ‒ Costs include factorizaSon and miscellaneous assembly

!  As with frequency response, GPU suitable above threshold

‒ Threshold of 5000 for this example

!  Dense in core methods helpful

!  GPU ineffecSve for this model ‒  (all linear soluSons relaSvely small)

DENSE FACTORIZATION COST COMPARISON

0:00:00

0:02:53

0:05:46

0:08:38

0:11:31

0:14:24

0:17:17

0:20:10

0:23:02

0:25:55

Serial SMP=24

Dense factoriza?on ?mes

NXN LAPACK GPU


RDMODES EIGENSOLUTION

!  Sparse eigensoluSons: ‒  Large number ‒  Sparse, relaSvely large dimension ‒  Inexpensive with NXN sparse eigensolvers

!  Dense eigensoluSons: ‒  Large number ‒ Dense, small-‐medium dimension ‒ Candidate for GPU

!  Reduced eigensoluSon: ‒ Only one instance ‒ Dense, fairly large, many modes ‒  Strong candidate for GPU

CLASSIFICATION



!  Householder type soluSon for real symmetric problem (dsyev): ‒ Reduce to tridiagonal: 𝑄↑𝑇 𝐴𝑄=𝑇 ‒ Eigenvalues of tridiagonal: 𝑍↑𝑇 𝑇𝑍=Λ ‒ Compute eigenvectors: Φ=𝑄𝑍 ‒ Then 𝐴Φ=ΦΛ

!  Efficient choice for dense problems, and/or many eigenvectors needed ‒ High memory consumpSon

!  Transform generalized eigenvalue problem as follows: ‒  Factor: 𝑀=𝐿𝐿↑𝑇  ‒  Solve: 𝐿↑−1 𝐾𝐿↑−𝑇 𝑋=𝑋Λ ‒ Generalized eigensoluSon: 𝐾(𝐿↑−𝑇 𝑋)=𝑀(𝐿↑−𝑇 𝑋)Λ

DENSE SOLUTION METHODS



!  Dimensions range from 2800 to 8800 ‒ Dense problems, modes variable

!  GPU beneficial for larger sizes

!  Total Smes (serial) -‐-‐ 50% reducSon: ‒ 56:29 (all Lanczos) ‒ 15:30 (all LAPACK) ‒ 7:29 (using GPU)

!  Total Smes (SMP) – 36% reducSon: ‒ 52:22 (all Lanczos) ‒ 4:41 (all LAPACK) ‒ 3:00 (using GPU)

DENSE EIGENSOLUTION SCALABILITY

0:00:01

0:00:09

0:01:26

0:14:24

2:24:00

2000 4000 8000

Serial

Lanczos LAPACK GPU

0:00:01

0:00:09

0:01:26

0:14:24

2:24:00

2000 4000 8000

SMP=24

Lanczos LAPACK GPU



!  Householder methods well suited (as expected)

!  Larger dimension dense problems benefit from the GPU ‒ And are the most Sme consuming

!  Send most expensive problems to GPU

!  Threshold set to 3800 for this test ‒ Note: opSmal threshold depends on hardware and SMP

GPU SUPPORT



!  Reduced eigensoluSon ‒ Not ideally suited to NXN Lanczos eigensolver ‒ Unique, but large (14K dofs) ‒ Many eigenvectors needed ‒ GPU 30% speedup (both SMP and serial)

!  GPU in RDMODES conclusions ‒ Dense and reduced eigensoluSons benefit ‒ Threshold for dense eigensoluSon ‒ Dense factorizaSon benefits from LAPACK: li\le addiSonal benefit on GPU

!  Sparse methods not supported yet

MOST SIGNIFICANT COST COMPONENTS

0:00:00

0:07:12

0:14:24

0:21:36

0:28:48

0:36:00

0:43:12

0:50:24

0:57:36

Serial SMP=24

Reduced Eigensolu?on

NXN LAPACK GPU


RDMODES AND FREQUENCY RESPONSE

!  SMP=24, customer benchmark

!  Compared to NXN system: ‒  Frequency response 3x faster ‒ Reduced eigensoluSon 2.8x faster ‒  FactorizaSon 28% faster ‒ Dense eigensoluSon 9x faster ‒ 30% reducSon in total run Sme

!  Compared to LAPACK: ‒  Frequency response 3x faster ‒ Reduced eigensoluSon 2x faster ‒ 10% reducSon in total run Sme

BENCHMARK PERFORMANCE RESULTS

0:00:00

1:12:00

2:24:00

3:36:00

4:48:00

6:00:00

7:12:00

8:24:00

NXN LAPACK GPU

Frequency response

Reduced eigensoluSon

Dense eigensoluSon

FactorizaSon

Other



!  Performance advantages with single precision eigensoluSon ‒ As with linear soluSon in frequency response, single precision faster on GPU ‒  Lower GPU memory consumpSon ‒  (larger problems)

!  Dense eigensoluSons (customer benchmark) – 35-‐40% speedup:

!  Reduced eigensoluSon also benefits – 20% speedup: ‒ 3:05 to 2:29

SINGLE PRECISION

Double precision Single precision

Serial 7:01 4:16

SMP=24 3:41 2:23

CONCLUSIONS INTRO / FREQ RESP / MODES / CONCLUSIONS


CONCLUSIONS

!  Significant benefit with GPU for certain computaSon types ‒  Frequency response calculaSon 2x-‐3x faster, dense eigensoluSon 2x faster ‒ AddiSonal 35-‐50% improvement possible with single precision ‒ 30% lower turnaround Sme for typical customer benchmark

!  Efficient dense matrix algebra on GPU with clMath, clMAGMA

!  Many thanks to: Ben-‐Shan Liao, Wei Zhang (Siemens PLM), Antoine Reymond (AMD)

Thank you!


DISCLAIMER & ATTRIBUTION

The informaSon presented in this document is for informaSonal purposes only and may contain technical inaccuracies, omissions and typographical errors.

The informaSon contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, sotware changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligaSon to update or otherwise correct or revise this informaSon. However, AMD reserves the right to revise this informaSon and to make changes from Sme to Sme to the content hereof without obligaSon of AMD to noSfy any person of such revisions or changes.

AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.

AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

ATTRIBUTION

© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinaSons thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdicSons. SPEC is a registered trademark of the Standard Performance EvaluaSon CorporaSon (SPEC). Other names are for informaSonal purposes only and may be trademarks of their respecSve owners.

PG-4037, Fast modal analysis with NX Nastran and GPUs, by Leonard Hoffnung

Technology

Transcript of PG-4037, Fast modal analysis with NX Nastran and GPUs, by Leonard Hoffnung