MSC Nastran Installation and Operations Guide MSC Nastran ...
PG-4037, Fast modal analysis with NX Nastran and GPUs, by Leonard Hoffnung
-
Upload
amd-developer-central -
Category
Technology
-
view
511 -
download
2
description
Transcript of PG-4037, Fast modal analysis with NX Nastran and GPUs, by Leonard Hoffnung
FAST MODAL ANALYSIS WITH NX NASTRAN AND GPUS
LEONARD HOFFNUNG SIEMENS PLM SOFTWARE
INTRODUCTION INTRO / FREQ RESP / MODES / CONCLUSIONS
3 | FAST MODAL ANALYSIS WITH NX NASTRAN AND GPUS | NOVEMBER 12, 2013 | CONFIDENTIAL
ABOUT NX NASTRAN
! Industry standard finite element package from Siemens PLM
! Analysis opSons include: ‒ Stress, vibraSon, structural failure ‒ Heat transfer, acousScs, rotor dynamics, and more
! Advanced numerical capabiliSes and proven scalability: ‒ Problem sizes approaching 1 billion dofs ‒ SMP to 24 cores ‒ DMP to 2048 nodes
4 | FAST MODAL ANALYSIS WITH NX NASTRAN AND GPUS | NOVEMBER 12, 2013 | CONFIDENTIAL
MODAL FREQUENCY RESPONSE OVERVIEW
! Bread and bu\er industrial computaSon: modal frequency response
! Widely used in automoSve & aerospace to determine response under varying excitaSons ‒ OpSmize weight, rigidity ‒ Minimize noise, resonance
! Two phase calculaSon more efficient than direct: ‒ Modal analysis ‒ Frequency response calculaSon
NASTRAN SOL 111
5 | FAST MODAL ANALYSIS WITH NX NASTRAN AND GPUS | NOVEMBER 12, 2013 | CONFIDENTIAL
MODAL FREQUENCY RESPONSE
! EigensoluSon -‐-‐ ℎ normal modes of 𝑓×𝑓 structural matrices: structural matrices:
𝐾↓𝑓𝑓 Φ↓𝑓ℎ = 𝑀↓𝑓𝑓 Φ↓𝑓ℎ Λ↓ℎℎ
! Frequency response -‐-‐ ℎ×ℎ complex linear soluSon at each of 𝑛𝑟𝑒𝑠𝑝 frequencies:
(𝐾↓ℎℎ + 𝜔↓𝑘 𝑖 𝐵↓ℎℎ − 𝜔↓𝑘↑2 𝑀↓ℎℎ )𝑥↓𝑘 = 𝑏↓𝑘 , 𝑘=1,…,𝑛𝑟𝑒𝑠𝑝
! All parameters large in typical customer usage: ‒ 𝑓-‐size 10-‐30M for model fidelity ‒ ℎ-‐size 10-‐60K for modal accuracy ‒ 𝑛𝑟𝑒𝑠𝑝 20K for detailed response graph
COMPUTATIONAL STEPS
6 | FAST MODAL ANALYSIS WITH NX NASTRAN AND GPUS | NOVEMBER 12, 2013 | CONFIDENTIAL
PERFORMANCE CASE STUDY
! Shell dominated SOL 111 model ‒ 245K degrees of freedom (𝑓-‐size) ‒ 1200 eigenpairs (ℎ-‐size) ‒ 20K frequency responses (𝑛𝑟𝑒𝑠𝑝)
! EigensoluSon Sme: 30 minutes
! Frequency response: 127 minutes
! Frequency response cost 𝑂(𝑛𝑟𝑒𝑠𝑝 ∗ℎ↑3 ) ‒ EsSmated run Sme in decades as ℎ→60𝐾
PR MODEL – FREQUENCY RESPONSE COST
7 | FAST MODAL ANALYSIS WITH NX NASTRAN AND GPUS | NOVEMBER 12, 2013 | CONFIDENTIAL
PERFORMANCE CASE STUDY
! More typical industrial model: ‒ 11 million degrees of freedom (𝑓-‐size) ‒ Shell dominated model ‒ Approximately 3000 eigenpairs (ℎ-‐size) ‒ 300 frequency responses (𝑛𝑟𝑒𝑠𝑝)
! Frequency response expensive, but modal calculaSon sSll expensive even with RDMODES: ‒ Modal calculaSon: 375 minutes ‒ Frequency response Sme: 22 minutes
! Need to improve performance in both phases
CUSTOMER BENCHMARK
FREQUENCY RESPONSE INTRO / FREQ RESP / MODES / CONCLUSIONS
9 | FAST MODAL ANALYSIS WITH NX NASTRAN AND GPUS | NOVEMBER 12, 2013 | CONFIDENTIAL
FREQUENCY RESPONSE IMPLEMENTATION
! NX Nastran implementaSon uses symmetric 𝐿𝐷𝐿↑𝑇 factorizaSon and forward-‐backward subsStuSon:
For 𝑘=1,…,𝑛𝑟𝑒𝑠𝑝 Assemble 𝐴= 𝐾↓ℎℎ + 𝜔↓𝑘 𝑖𝐵↓ℎℎ − 𝜔↓𝑘↑2 𝑀↓ℎℎ Factor 𝐴=𝐿𝐷𝐿↑𝑇 Solve 𝑥↓𝑘 = 𝐴↑−1 𝑏↓𝑘 = 𝐿↑−𝑇 𝐷↑−1 𝐿↑−1 𝑏↓𝑘 End for
! NX Nastran sparse factorizaSon difficult to adapt to GPU: ‒ Disk oriented ‒ Tuned for sparse matrices ‒ Symmetric pivoSng required for stability (indefiniteness)
DETAILS OF ORIGINAL METHOD
10 | FAST MODAL ANALYSIS WITH NX NASTRAN AND GPUS | NOVEMBER 12, 2013 | CONFIDENTIAL
FREQUENCY RESPONSE IMPLEMENTATION
! For GPU code, use LU factorizaSon instead:
For 𝑘=1,…,𝑛𝑟𝑒𝑠𝑝 Assemble 𝐴= 𝐾↓ℎℎ + 𝜔↓𝑘 𝑖𝐵↓ℎℎ − 𝜔↓𝑘↑2 𝑀↓ℎℎ Factor 𝐴=𝐿𝑈 Solve 𝑥↓𝑘 = 𝐴↑−1 𝑏↓𝑘 = 𝑈↑−1 𝐿↑−1 𝑏↓𝑘 End for
! OpenCL port of LAPACK zgesv available with clMAGMA and clBLAS ‒ In core storage ‒ Dense oriented (okay for this applicaSon) ‒ Benefit mainly in factorizaSon step (cubic operaSon count)
DETAILS OF REVISED METHOD
11 | FAST MODAL ANALYSIS WITH NX NASTRAN AND GPUS | NOVEMBER 12, 2013 | CONFIDENTIAL
FREQUENCY RESPONSE IMPLEMENTATION
! Original NX Nastran sparse symmetric solver ‒ Spills to disk, requires minimal memory ‒ Minimizes flops by uSlizing symmetry ‒ Takes advantage of sparsity
! Improved SMP method (system462=1 in NXN9.0) ‒ In core, based on LAPACK zsytrf/zsytrs ‒ Efficient parallelizaSon of 𝑛𝑟𝑒𝑠𝑝 loop ‒ Large memory requirements
! OpenCL method (to appear in NXN9 MP) ‒ In core, based on clMAGMA zgesv (LU factorizaSon) ‒ USlizing GPU for best performance
LINEAR SOLVER SELECTION STRATEGY
12 | FAST MODAL ANALYSIS WITH NX NASTRAN AND GPUS | NOVEMBER 12, 2013 | CONFIDENTIAL
FREQUENCY RESPONSE
! Test machine: ‒ Magny-‐Cours 2.1 GHz, 24 cores ‒ 32GB memory ‒ 4GB TahiS GPU
! GPU roughly 40% faster than 24-‐way SMP
INITIAL PERFORMANCE COMPARISON
0:00:00
0:14:24
0:28:48
0:43:12
0:57:36
1:12:00
1:26:24
1:40:48
1:55:12
2:09:36
2:24:00
e10k e20k e30k e40k
serial
smp=8
smp=24
GPU
Model Modes
e10k 1785
e20k 3631
e30k 5576
e40k 7646
13 | FAST MODAL ANALYSIS WITH NX NASTRAN AND GPUS | NOVEMBER 12, 2013 | CONFIDENTIAL
FREQUENCY RESPONSE – FURTHER IMPROVEMENTS
! Use single precision on GPU for improved performance ‒ Higher flop rate (typically 4-‐5 Smes) ‒ Lower memory uSlizaSon ‒ (larger dimension problems possible) ‒ Be\er scaling with larger systems
‒ Single precision disadvantage: lower precision ‒ Accuracy acceptable for most engineering purposes ‒ (largest relaSve error of 10↑−5 )
SINGLE PRECISION ARITHMETIC
1E-‐08
0.0000001
0.000001
0.00001
0.0001
0.001
0.01
0.1
1
Double precision
Single precision
RelaSve error
14 | FAST MODAL ANALYSIS WITH NX NASTRAN AND GPUS | NOVEMBER 12, 2013 | CONFIDENTIAL
FREQUENCY RESPONSE – FURTHER IMPROVEMENTS
! 40-‐50% reducSon in run Sme
! Largest example only possible in single precision
SINGLE PRECISION ACCURACY AND PERFORMANCE
0:00:00
0:02:53
0:05:46
0:08:38
0:11:31
0:14:24
0:17:17
e10k e20k e30k e40k e60k
Double
Single
Model Modes
e10k 1785
e20k 3631
e30k 5576
e40k 7646
e60k 12088
15 | FAST MODAL ANALYSIS WITH NX NASTRAN AND GPUS | NOVEMBER 12, 2013 | CONFIDENTIAL
FREQUENCY RESPONSE – FURTHER IMPROVEMENTS
! Perform addiSon of matrices at each frequency on GPU (assembly step)
𝐴= 𝐾↓ℎℎ + 𝜔↓𝑘 𝑖𝐵↓ℎℎ − 𝜔↓𝑘↑2 𝑀↓ℎℎ
! I.e. store 𝐾↓ℎℎ , 𝐵↓ℎℎ , 𝑀↓ℎℎ in GPU buffers and sum using zaxpy/saxpy kernels:
𝐴≔ 𝐾↓ℎℎ
𝐴≔𝐴+ 𝜔↓𝑘 𝑖 𝐵↓ℎℎ 𝐴≔𝐴− 𝜔↓𝑘↑2 𝑀↓ℎℎ
! Minimizes data transfer to/from main memory
! AddiSonal GPU memory consumpSon
MATRIX SUMMATION ON GPU
16 | FAST MODAL ANALYSIS WITH NX NASTRAN AND GPUS | NOVEMBER 12, 2013 | CONFIDENTIAL
FREQUENCY RESPONSE – FURTHER IMPROVEMENTS
! Double precision best result (e30k): ‒ Time reduced 30% from 6:52 to 4:50 ‒ 2x faster than best CPU Sme
! Single precision best result (e40k): ‒ Time reduced 22% from 6:23 to 4:58 ‒ 4x faster than best CPU Sme
! Best scaling with largest problems ‒ Limited by GPU memory
MATRIX SUMMATION ON GPU PERFORMANCE
0:00:00
0:01:26
0:02:53
0:04:19
0:05:46
0:07:12
0:08:38
0:10:05
0:11:31
0:12:58
e10k e20k e30k e40k
Double Double + zaxpy Single Single + caxpy
MODAL ANALYSIS INTRO / FREQ RESP / MODES / CONCLUSIONS
18 | FAST MODAL ANALYSIS WITH NX NASTRAN AND GPUS | NOVEMBER 12, 2013 | CONFIDENTIAL
MODAL ANALYSIS WITH RDMODES
! RDMODES – proprietary high-‐performance approximate eigensolver
! Tuned for typical customer use cases: ‒ Larger models (10 million+ dofs) ‒ Many modes (300+) ‒ Accelerated computaSon when few output dofs required ‒ Sufficient accuracy for frequency response calculaSons
! Performance up to 20x faster than Lanczos
! Demonstrated DMP scalability to 2048 nodes
OVERVIEW
19 | FAST MODAL ANALYSIS WITH NX NASTRAN AND GPUS | NOVEMBER 12, 2013 | CONFIDENTIAL
MODAL ANALYSIS WITH RDMODES
! RDMODES method comprised of mulSple smaller operaSons – five areas listed below
! Costs for customer benchmark: ‒ 11 million dofs ‒ Shell dominated ‒ 3000 modes below 400 Hz ‒ 300 frequency responses
! Dense operaSons good candidates for GPU ‒ FactorizaSon, eigensoluSon
COST BREAKDOWN
Opera?on Wall ?me
Sparse factorizaSon 18:40
Dense factorizaSon 24:00
Sparse eigensoluSon 9:33
Dense eigensoluSon 65:00
Reduced (dense) eigensoluSon 21:16
Total 250:06
20 | FAST MODAL ANALYSIS WITH NX NASTRAN AND GPUS | NOVEMBER 12, 2013 | CONFIDENTIAL
RDMODES FACTORIZATION
! Fairly large quanSty of each type
! Sparse factorizaSons: ‒ Typically too large to treat efficiently as dense ‒ NXN mulSfrontal solver very efficient ‒ Efficient sparse soluSon on GPU difficult (acSve research)
! Dense factorizaSons: ‒ Model dependent, typically small ‒ Symmetric posiSve definite, may use clMAGMA dposv ‒ Candidate for GPU
CLASSIFICATION
21 | FAST MODAL ANALYSIS WITH NX NASTRAN AND GPUS | NOVEMBER 12, 2013 | CONFIDENTIAL
RDMODES FACTORIZATION
! Dense factorizaSon wall Smes ‒ Costs include factorizaSon and miscellaneous assembly
! As with frequency response, GPU suitable above threshold
‒ Threshold of 5000 for this example
! Dense in core methods helpful
! GPU ineffecSve for this model ‒ (all linear soluSons relaSvely small)
DENSE FACTORIZATION COST COMPARISON
0:00:00
0:02:53
0:05:46
0:08:38
0:11:31
0:14:24
0:17:17
0:20:10
0:23:02
0:25:55
Serial SMP=24
Dense factoriza?on ?mes
NXN LAPACK GPU
22 | FAST MODAL ANALYSIS WITH NX NASTRAN AND GPUS | NOVEMBER 12, 2013 | CONFIDENTIAL
RDMODES EIGENSOLUTION
! Sparse eigensoluSons: ‒ Large number ‒ Sparse, relaSvely large dimension ‒ Inexpensive with NXN sparse eigensolvers
! Dense eigensoluSons: ‒ Large number ‒ Dense, small-‐medium dimension ‒ Candidate for GPU
! Reduced eigensoluSon: ‒ Only one instance ‒ Dense, fairly large, many modes ‒ Strong candidate for GPU
CLASSIFICATION
23 | FAST MODAL ANALYSIS WITH NX NASTRAN AND GPUS | NOVEMBER 12, 2013 | CONFIDENTIAL
RDMODES EIGENSOLUTION
! Householder type soluSon for real symmetric problem (dsyev): ‒ Reduce to tridiagonal: 𝑄↑𝑇 𝐴𝑄=𝑇 ‒ Eigenvalues of tridiagonal: 𝑍↑𝑇 𝑇𝑍=Λ ‒ Compute eigenvectors: Φ=𝑄𝑍 ‒ Then 𝐴Φ=ΦΛ
! Efficient choice for dense problems, and/or many eigenvectors needed ‒ High memory consumpSon
! Transform generalized eigenvalue problem as follows: ‒ Factor: 𝑀=𝐿𝐿↑𝑇 ‒ Solve: 𝐿↑−1 𝐾𝐿↑−𝑇 𝑋=𝑋Λ ‒ Generalized eigensoluSon: 𝐾(𝐿↑−𝑇 𝑋)=𝑀(𝐿↑−𝑇 𝑋)Λ
DENSE SOLUTION METHODS
24 | FAST MODAL ANALYSIS WITH NX NASTRAN AND GPUS | NOVEMBER 12, 2013 | CONFIDENTIAL
RDMODES EIGENSOLUTION
! Dimensions range from 2800 to 8800 ‒ Dense problems, modes variable
! GPU beneficial for larger sizes
! Total Smes (serial) -‐-‐ 50% reducSon: ‒ 56:29 (all Lanczos) ‒ 15:30 (all LAPACK) ‒ 7:29 (using GPU)
! Total Smes (SMP) – 36% reducSon: ‒ 52:22 (all Lanczos) ‒ 4:41 (all LAPACK) ‒ 3:00 (using GPU)
DENSE EIGENSOLUTION SCALABILITY
0:00:01
0:00:09
0:01:26
0:14:24
2:24:00
2000 4000 8000
Serial
Lanczos LAPACK GPU
0:00:01
0:00:09
0:01:26
0:14:24
2:24:00
2000 4000 8000
SMP=24
Lanczos LAPACK GPU
25 | FAST MODAL ANALYSIS WITH NX NASTRAN AND GPUS | NOVEMBER 12, 2013 | CONFIDENTIAL
RDMODES EIGENSOLUTION
! Householder methods well suited (as expected)
! Larger dimension dense problems benefit from the GPU ‒ And are the most Sme consuming
! Send most expensive problems to GPU
! Threshold set to 3800 for this test ‒ Note: opSmal threshold depends on hardware and SMP
GPU SUPPORT
26 | FAST MODAL ANALYSIS WITH NX NASTRAN AND GPUS | NOVEMBER 12, 2013 | CONFIDENTIAL
RDMODES EIGENSOLUTION
! Reduced eigensoluSon ‒ Not ideally suited to NXN Lanczos eigensolver ‒ Unique, but large (14K dofs) ‒ Many eigenvectors needed ‒ GPU 30% speedup (both SMP and serial)
! GPU in RDMODES conclusions ‒ Dense and reduced eigensoluSons benefit ‒ Threshold for dense eigensoluSon ‒ Dense factorizaSon benefits from LAPACK: li\le addiSonal benefit on GPU
! Sparse methods not supported yet
MOST SIGNIFICANT COST COMPONENTS
0:00:00
0:07:12
0:14:24
0:21:36
0:28:48
0:36:00
0:43:12
0:50:24
0:57:36
Serial SMP=24
Reduced Eigensolu?on
NXN LAPACK GPU
27 | FAST MODAL ANALYSIS WITH NX NASTRAN AND GPUS | NOVEMBER 12, 2013 | CONFIDENTIAL
RDMODES AND FREQUENCY RESPONSE
! SMP=24, customer benchmark
! Compared to NXN system: ‒ Frequency response 3x faster ‒ Reduced eigensoluSon 2.8x faster ‒ FactorizaSon 28% faster ‒ Dense eigensoluSon 9x faster ‒ 30% reducSon in total run Sme
! Compared to LAPACK: ‒ Frequency response 3x faster ‒ Reduced eigensoluSon 2x faster ‒ 10% reducSon in total run Sme
BENCHMARK PERFORMANCE RESULTS
0:00:00
1:12:00
2:24:00
3:36:00
4:48:00
6:00:00
7:12:00
8:24:00
NXN LAPACK GPU
Frequency response
Reduced eigensoluSon
Dense eigensoluSon
FactorizaSon
Other
28 | FAST MODAL ANALYSIS WITH NX NASTRAN AND GPUS | NOVEMBER 12, 2013 | CONFIDENTIAL
RDMODES EIGENSOLUTION
! Performance advantages with single precision eigensoluSon ‒ As with linear soluSon in frequency response, single precision faster on GPU ‒ Lower GPU memory consumpSon ‒ (larger problems)
! Dense eigensoluSons (customer benchmark) – 35-‐40% speedup:
! Reduced eigensoluSon also benefits – 20% speedup: ‒ 3:05 to 2:29
SINGLE PRECISION
Double precision Single precision
Serial 7:01 4:16
SMP=24 3:41 2:23
CONCLUSIONS INTRO / FREQ RESP / MODES / CONCLUSIONS
30 | FAST MODAL ANALYSIS WITH NX NASTRAN AND GPUS | NOVEMBER 12, 2013 | CONFIDENTIAL
CONCLUSIONS
! Significant benefit with GPU for certain computaSon types ‒ Frequency response calculaSon 2x-‐3x faster, dense eigensoluSon 2x faster ‒ AddiSonal 35-‐50% improvement possible with single precision ‒ 30% lower turnaround Sme for typical customer benchmark
! Efficient dense matrix algebra on GPU with clMath, clMAGMA
! Many thanks to: Ben-‐Shan Liao, Wei Zhang (Siemens PLM), Antoine Reymond (AMD)
Thank you!
31 | FAST MODAL ANALYSIS WITH NX NASTRAN AND GPUS | NOVEMBER 12, 2013 | CONFIDENTIAL
DISCLAIMER & ATTRIBUTION
The informaSon presented in this document is for informaSonal purposes only and may contain technical inaccuracies, omissions and typographical errors.
The informaSon contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, sotware changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligaSon to update or otherwise correct or revise this informaSon. However, AMD reserves the right to revise this informaSon and to make changes from Sme to Sme to the content hereof without obligaSon of AMD to noSfy any person of such revisions or changes.
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
ATTRIBUTION
© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinaSons thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdicSons. SPEC is a registered trademark of the Standard Performance EvaluaSon CorporaSon (SPEC). Other names are for informaSonal purposes only and may be trademarks of their respecSve owners.