GPU-Accelerated Ab-Initio Simulations of Low-Pressure Turbines
Richard D. Sandberg, Richard Pichler Aerodynamics and Flight Mechanics Research Group
Vittorio Michelassi GE Global Research
NVIDIA GPU Technology Conference, session on “Extreme-Scale Supercomputing with Titan Supercomputer”
2 GPU-acceleration
3/26/2014
GE HPC Collaborations
Mill
ion
CP
U H
ou
rs
External collaborations, both in EU and USA, have
greatly accelerated HPC’s impact at GE ...
Year
3 GPU-acceleration
3/26/2014
HPT: Capture hot spot migration and
other temperature non-uniformity
Large-scale 3D unsteady flows
HPC impacting across components/disciplines...and driving compute cost.
Gas Turbines: Impacting Across Engine
Fan – Inlet:
Coupling and
predicting cross
wind behavior
Combustion: High fidelity for
Dynamics/Emissions
Jet noise: Move from test
screening to predict
LPT: Wake-transitional boundary layer
Interaction in unsteady environment
4 GPU-acceleration
3/26/2014
Motivation: why Low-Pressure-Turbine
Casing
Hub
Midspan Region
• Profile loss driven
Endwall Region
• Secondary loss drivenBlade 2
Vane 4
Blade 6
S
R
MID
SP
AN
Losses
EN
DW
ALL L
osses 1%
1% of LPT aerodynamic efficiency is worth 0.6-0.7% SFC
US airlines 35Bgallons in 2012
GE GT fleet alone $150B oil&gas/year
70+%
5 GPU-acceleration
3/26/2014
Aerodynamic design is assisted by tools based on modelling of turbulence However: modelling generally inaccurate
Want (model-free) DNS based on first principles, all length/time-scales must be resolved
Challenge: computational effort required increases approximately as ~Re3
Motivation
Spanwise vorticity for Ekman layer at Ret=1,241 Spanwise vorticity for Ekman layer at Ret=403
6 GPU-acceleration
3/26/2014
Motivation (why bother with DNS?)
• Reward for expending enormous computational resources:
wealth of reliable data free from modeling uncertainties used to answer
basic questions regarding the physics and modeling of a variety of flows
Advances in computing power mean DNS of Reynolds
numbers/ geometries of interest now possible
TOP500 Global Trends
from Jack Dongarra,
IESP
7 GPU-acceleration
3/26/2014
Aim of this presentation:
Introduce novel Navier-Stokes solver purposely developed to exploit
modern HPC architectures for DNS
1) Present key ingredients for an efficient algorithm
2) Performance study of original hybrid OMP/MPI code
3) Porting of code to GPU-accelerated architecture (Titan) using Open-ACC
4) Performance study of Open-ACC code
5) Preliminary results of LPT at real operating conditions
Motivation
8 GPU-acceleration
3/26/2014
Numerical Method • In-house multi-block structured curvilinear compressible Navier-Stokes solver
HiPSTAR (High-Performance Solver for Turbulence and Aeracoustics Research)
developed for DNS studies on today’s computing architectures
• To minimize computation time for given problem numerical algorithm designed with
these requirements
1) Stability of the scheme
2) Resolution of flow features with minimal amplitude and phase errors
3) Efficiency of the scheme (i.e. high ratio of accuracy to computational cost)
4) High parallel efficiency on HPC systems
• Due to very high processor clock-speeds of modern HPC systems, memory access
limiting factor in the total performance of simulations and not number of operations
9 GPU-acceleration
3/26/2014
Numerical Method • Important ingredient to increase performance of code on bandwidth-limited
system:
Minimizing the allocated memory:
1) Improve efficiency of algorithm to reduce number of allocated arrays
ultra-low storage five-step, fourth-order accurate Runge-Kutta
scheme only requires two registers
2) Reduce the grid-cell count required to
spatially resolve flow
a) Use novel parallelizable compact-
difference schemes
b) FFTW for spanwise homogeneous
direction
3) Small number of 2D metric terms
Domain Decomposition
11 GPU-acceleration
3/26/2014
Numerical Method Hybrid MPI/OMP parallelisation: FFT’s only in one direction
Rather than parallelising individual FFT calls, take advantage of independence
of each call from the others
Each transformation from/to Fourier domain synchronises all threads then
divides Fourier transforms between threads before synchronising
and continuing
Easy to implement with shared memory as each thread has access to all
memory so there is no change to the layout of data in memory
12 GPU-acceleration
3/26/2014
HiPSTAR – OMP/MPI Performance
• Performance of DNS code evaluated on UK national HPC facilities
1) HECToR (CRAY XE6):
90,112 cores with 0.83 PFLOPs
Nodes contains two 16-core AMD Opteron 2.3GHz
Interlagos processors, each with 16Gb of memory.
Each 16-core socket coupled with Cray Gemini
routing/communications chip achieving
MPI point-to-point bandwidth of 5 GB/s and
latency between two nodes of around 1-1.5μs
2) Blue Joule (IBM Blue Gene/Q):
114,688 cores with 1.47 PFLOPs
Nodes equipped with 16-core Power BQC
processors with 1.6GHz
and 16Gb of memory
13 GPU-acceleration
3/26/2014
Strong Scaling – OMP/MPI
Smallest number of cores that
could be used was 512
at least 2.15×106 points
require less than 1Gb of
memory and can thus
be allocated per core
Production-like test case on HECToR
2048×2048×128 modes
1.08×109 collocation points
Good scaling observed up
to 65,536 cores
14 GPU-acceleration
3/26/2014
Weak Scaling
323 points per core: efficiency > 90%
643 points per core: efficiency > 96%
(at 65,536 cores: 17×109 points)
643 points per core – compact FD:
efficiency > 95%
65,536 cores: 5.49 vs. 4.87s/step
only 13% increase in
computational cost of CFD
Better ratio of algorithmic operations (FLOPs) over communication for
compact FD scheme presumably reason for why increase in overall computational
time small despite significantly higher algorithmic cost
15 GPU-acceleration
3/26/2014
Strong Scaling - Multiblock Production case of a nine-block curvilinear configuration for study of LPT case
300×106 points
2048 MPI processes, 1,2, 4 and 8 OMP threads
On CRAY XE6 good performance
for all numbers of threads
On IBM BG/Q performance good
up to 4 OMP threads but slightly
deteriorates with 8 OMP threads
Note: all scaling tests performed with
production setup and B.C.s
load balance appears good
CRAY CoE: “In addition to good scaling of code, performance of the underlying
algorithm also very efficient. On CRAY XE6, sustained performance of
1 GFLOP/s or 12.5% of peak performance measured”
16 GPU-acceleration
3/26/2014
Preliminary LPT results
Simulation on
TITAN at
realistic conditions: • Re = 240,000
• Ma = 0.4
Simulation
key data: • # of grid points:
1.5 billion
• # of GPUs used:
3,552
GPU speed up at same node count : • 1st step of porting: 18%
• 2nd step of porting: estimated 25-30%
17 GPU-acceleration
3/26/2014
Isosurface Q=500 colored by velocity magnitude
Stretching of vortices
Streamwise-oriented
Long streamwise streaky
structures
Preliminary LPT results
18 GPU-acceleration
3/26/2014
Profiling of Hybrid OMP/MPI code
USER 72%
MPI 14%
ETC 7%
MPI_SYNC 1%
OMP 6%
FOURIER 45%
RHS 32%
DERIV 18%
REST 5%
• Profiling on Titan
19 GPU-acceleration
3/26/2014
Porting to GPU with Open-ACC
Porting strategy:
• 1st step: port routines responsible for most of the
computational time
• 2nd step: port remaining subroutines in Runge-Kutta
• 3rd step: restructure code to obtain better performance
(yet to be done)
20 GPU-acceleration
3/26/2014
• Define memory region and scope for codes to be executed on GPU
• Routines in general consist of 3 nested loops that are ported as
Porting to GPU: 1st step
MPI/OMP ACC
k
k
k
21 GPU-acceleration
3/26/2014
GPU-code performance after step 1
• Compare with MPI/OMP hybrid version
USER 93%
MPI 7%
FOURIER 18%
RHS 36%
DERIV 25%
REST 21%
USER 72%
MPI 14%
ETC 7%
MPI_SYNC 1%
OMP 6%
FOURIER 45%
RHS 32%
DERIV 18%
REST 5%
• Test case ran 6.13s on GPU vs 7.45s CPU 18% speed-up
Note:
FFTW ‘faster’
Deriv ‘slower’
22 GPU-acceleration
3/26/2014
• Profiling: 30% of runtime spent in data transfer CPU ↔ GPU
• Issue: routines in RK steps require copying of main array port
• BC porting straight forward
• MPI communication:
• use GPU address space directives
• Maintain portability of code: also allow these routines on CPU
use conditional execution
• Compared to porting of the core routines significant amount of
coding work for small savings in computational time (saving mainly
memcopy)
Porting to GPU: 2nd step
23 GPU-acceleration
3/26/2014
Summary
• Well tuned hybrid OpenMP/MPI multi-block compressible
flow solver ported to GPU using OpenACC
• To date only main routines of RHS ported
• Despite identified bottlenecks 18% speed-up
• Currently evaluating updated version with better
(less) transfer of data GPU ↔ CPU (see Tuesday keynote?)
• Next step: rewrite entire RHS to group
compute intensive routines better
24 GPU-acceleration
3/26/2014
Acknowledgements
• Director’s Discretionary Grant – Jack Wells
Suzy Tichenor
• Centre of Excellence – John Levesque,
Tom Edwards
Top Related