Co
pyr
igh
t 2
01
4 b
y O
pti
mal
Syn
thes
is I
nc.
All
Rig
hts
Res
erve
d
UC Davis
Massively Parallel Earthquake Simulation on GPUs
Prasenjit Sengupta, Shagandeep Kaur, Jason Kwan, and P. K. Menon
Optimal Synthesis Inc. &
John Rundle and Eric Heien University of California, Davis
NVIDIA GPU Technology Conference
San Jose, CA March 26, 2014
Research Sponsored by NASA ROSES: Research Opportunities in Space and Earth Sciences
Co
pyr
igh
t 2
01
4 b
y O
pti
mal
Syn
thes
is I
nc.
All
Rig
hts
Res
erve
d
UC Davis Agenda
About Optimal Synthesis Inc.
Research Problem: Earthquake Simulation
Profiling the C Code
Implementation of Computationally Intensive Functions on the GPU • Matrix Vector Multiply-Accumulate for stress calculation
• Generation of the Greens Function Matrix
Validation and Run Time Performance Evaluation
Summary & Next Steps
2
Co
pyr
igh
t 2
01
4 b
y O
pti
mal
Syn
thes
is I
nc.
All
Rig
hts
Res
erve
d
UC Davis Optimal Synthesis Inc.
Innovative R&D company active in aerospace and defense technologies
Located in Los Altos CA
R&D for NASA and DoD, among several Fortune 500 companies
Core Expertise • Software for High Performance Computing
• Modeling, Simulation and Analysis
• Vehicle Navigation, Guidance and Control
• Air Traffic Management Systems
• Wireless Signal Processing Technologies
• Handwriting, Speech and Image Recognition Technologies
3
Co
pyr
igh
t 2
01
4 b
y O
pti
mal
Syn
thes
is I
nc.
All
Rig
hts
Res
erve
d
UC Davis
GPU Implementation of Computationally Intensive Algorithms at OSI
Project Title Algorithm Sponsor Speedup
1 Computational Appliance for Rapid Prediction of
Aircraft Trajectories (CARPAT) Trajectory Prediction NASA 250x
2
Next-Generation Target State Estimation Algorithms
for the Interception of Maneuvering Ballistic
Missiles
Particle Filtering MDA 300x
3
Trajectory Smoothing for Nonlinear Non-Gaussian
Models Using Particle Smoothing on Graphics
Processing Units
Particle Smoothing Air Force 32x
4 Accelerating ATM Optimization Algorithms on
Emerging High Performance Computing Hardware Linear Programming NASA 30x
5 Stochastic Queuing Model Analysis to Support
Airspace Super Density Operations
Discrete-Event &
Monte Carlo
Simulation
NASA 8x
6 Accelerating Earthquake Simulations on General-
Purpose Graphics Processors
Stress Propagation
on FEM NASA 45x
7 Turbine Engine Performance Estimation using
Particle Filters Particle Filtering NASA 125x
.
4
Co
pyr
igh
t 2
01
4 b
y O
pti
mal
Syn
thes
is I
nc.
All
Rig
hts
Res
erve
d
UC Davis
The Research Problem
5
Co
pyr
igh
t 2
01
4 b
y O
pti
mal
Syn
thes
is I
nc.
All
Rig
hts
Res
erve
d
UC Davis Motivation
Virtual California (VC): • Topologically realistic numerical
simulation (boundary element code) of earthquakes
• Fault Model: Northern California
• Developed by Dr. John Rundle’s research group at UC Davis over the past two decades
• Supported by NASA under the Solid Earth and Natural Hazards Program
• The Working Group on California Earthquake Probabilities (WGCEP) is considering the use of VC for Earthquake Insurance Rate detemination
6
Co
pyr
igh
t 2
01
4 b
y O
pti
mal
Syn
thes
is I
nc.
All
Rig
hts
Res
erve
d
UC Davis Virtual California Simulation Fault Model:
• Faults discretized into 3 km x 3 km fault elements
• Position, orientation, coefficient of friction, rate of slip, failure stresses
Simulation
Long Term Stress Accumulation • Accumulate stress by applying back slip to various elements
Rupture Propagation • Release of accumulated stress through a cascading series of fault
element failures
Simulation continued for 10,000 to 30,000 years
Generates probabilities of earthquake occurance of a given magnitude at a specified location
Has been validated by comparing observed earthquake statistics over the past 30 to 40 years
7
Co
pyr
igh
t 2
01
4 b
y O
pti
mal
Syn
thes
is I
nc.
All
Rig
hts
Res
erve
d
UC Davis Computational Need
Increasing the Model Resolution • Benchmark VC model uses (3km x 3km) grid
• Benchmark run times are 1536 core-hours (48 hours on a 32-core machine)
• Desire model resolution of the order of (100m x 100m) to achieve the required prediction accuracy
Real-time Earthquake Forecasting • Real-time forecasting of aftershocks
Proposed Solution • Leverage Parallel Computing on GPUs to accelerate execution of
Virtual California simulations
8
Co
pyr
igh
t 2
01
4 b
y O
pti
mal
Syn
thes
is I
nc.
All
Rig
hts
Res
erve
d
UC Davis
Profiling of the C Code
9
Co
pyr
igh
t 2
01
4 b
y O
pti
mal
Syn
thes
is I
nc.
All
Rig
hts
Res
erve
d
UC Davis Profiling Summary
Profiling performed using AMD Code Analyst
Hot Spots
Matrix Vector Multiply Accumulate • Calculates the stress on all elements given the strain vector
• Implemented as
• Consumes 67% of run time
• Code: Single function
Calculating the Greens Function Matrix • Determine matrix from fault element data
• Calculated using closed-form analytical expression by Okada
• Consumes 29% of the run time
• Code: 232 Functions
10
𝜎𝑖𝑗𝐴 𝑡 = 𝑇𝑖𝑗
𝐴𝐵𝑠𝐵(𝑡)
𝜎𝑖𝑗𝐴 𝑡 = 𝑇𝑖𝑗
𝐴𝐵𝑠𝐵(𝑡)
𝑐 = 𝐴 × 𝑏
Co
pyr
igh
t 2
01
4 b
y O
pti
mal
Syn
thes
is I
nc.
All
Rig
hts
Res
erve
d
UC Davis
Parallelization Opportunities: Matrix Vector Multiply-Accumulate Dense Matrix Vector Mulitply: Long term stress accumulation
phase • Parallelize over the number of entries in c (Rows in A)
• Each thread calculates one element ci
• Store the b vector in shared memory
Sparse Matrix Vector Multiply: Rupture Propagation Phase • Operate only on non-zero elements of b
ci = ai1 … aim
Row i
×
b1
⋮bm
Thread (i)
Thread (i)
Co
pyr
igh
t 2
01
4 b
y O
pti
mal
Syn
thes
is I
nc.
All
Rig
hts
Res
erve
d
UC Davis
Optimization: Storing the Strain Vector in Shared Memory
Matrix vector multiply
vector used times, once by each row of
Store in shared memory
Divide copying of vector to shared memory among threads of the kernel
Size of shared memory (49 KB): Copy smaller segments of vector
𝑐 = 𝐴 × 𝑏
𝑏 𝑛
𝐴
𝑏
A1 A2 A4
A3
b1
b2
b3
b4
𝑏
𝑏
Co
pyr
igh
t 2
01
4 b
y O
pti
mal
Syn
thes
is I
nc.
All
Rig
hts
Res
erve
d
UC Davis
Parallelization Opportunities: Generation of Greens Function Matrix
Generate Greens function matrix (stress influence matrix ) from fault element data
Closed-form analytical expressions: 232 functions in QuakeLibOkada.cpp
Independence: Calculation of each element is independent
Amount of Parallelism: Number of parallel threads = • AllCal_NoCreep: =13482 =181 million
Arithmetic Intensity: Number of computations performed per memory transaction • Input data: 17 doubles for every element ( )
• Mathematical operations: 14,000 to 43,000 for every matrix element ( )
13
𝜎𝑖𝑗𝐴 𝑡 = 𝑇𝑖𝑗
𝐴𝐵𝑠𝐵(𝑡)
𝑛2
𝑛2 𝑛
𝑛
𝑛2
Co
pyr
igh
t 2
01
4 b
y O
pti
mal
Syn
thes
is I
nc.
All
Rig
hts
Res
erve
d
UC Davis Optimization: Coalescing
Green’s function matrices stored in column-major format to ensure global memory coalescing • thread processes row of the matrix
Shear and normal Green’s function matrices: Allocated with cudaMallocPitch • Appropriate zero padding to ensure memory coalescing
• 2D array: Each row starts at a 64 byte boundary
14
𝑖𝑡ℎ 𝑖𝑡ℎ
Co
pyr
igh
t 2
01
4 b
y O
pti
mal
Syn
thes
is I
nc.
All
Rig
hts
Res
erve
d
UC Davis
Validation & Runtime Performance Evaluation
Co
pyr
igh
t 2
01
4 b
y O
pti
mal
Syn
thes
is I
nc.
All
Rig
hts
Res
erve
d
UC Davis Virtual California Fault Models
No Model Name Number of
Fault Elements
Size of Green’s Function Matrix
CPU Execution Times
(12-Core)
1 Parkfield 48 24 KB 0.57s
2 SAF 1508 17.48 MB 14.95s
3 AllCal2_Trunc4905 4905 183.82 MB 165.51s (2m 45s)
4 AllCal2_Trunc7453 7453 423.96 MB 371.49s (6m 12s)
5 AllCal2_NoCreep_13482 13482 1.35 GB 5137.03s (1h 25m)
6 AllCal_17757 17757 2.35 GB 3432.16s (57m 12s)
Simulation Time Horizon for Performance Benchmarking: 500 years
• Typical Simulation Time Horizon: 10,000 to 30,000 years
Co
pyr
igh
t 2
01
4 b
y O
pti
mal
Syn
thes
is I
nc.
All
Rig
hts
Res
erve
d
UC Davis Test Hardware In-house GPU Workstation
17
2 CPUs Intel Xeon E5620, Cores: 2x4, # Threads: 2x8, 2.4 GHz
RAM 24 GB, 1.3 GHz, DDR3, ECC
GPUs Tesla C2050 : 3 GB, 448 cores, 144 GB/s
Tesla K20X: 5 GB, 2496 cores, 208 GB/s
Titan: 6 GB, 2688 cores, 288.4 GB/s
Tesla C2050
Titan
Tesla K20X
Co
pyr
igh
t 2
01
4 b
y O
pti
mal
Syn
thes
is I
nc.
All
Rig
hts
Res
erve
d
UC Davis
Run Time Performance
18
Co
pyr
igh
t 2
01
4 b
y O
pti
mal
Syn
thes
is I
nc.
All
Rig
hts
Res
erve
d
UC Davis
Run Times: Green’s Function Matrix Generation
19
All Cal_NoCreep
3378 s
308 s
42 s
Co
pyr
igh
t 2
01
4 b
y O
pti
mal
Syn
thes
is I
nc.
All
Rig
hts
Res
erve
d
UC Davis Run Times: Matrix Mulitply
20
All Cal_NoCreep
4605 s 4813 s
102 s
Co
pyr
igh
t 2
01
4 b
y O
pti
mal
Syn
thes
is I
nc.
All
Rig
hts
Res
erve
d
UC Davis Run Times: Total
21
All Cal_NoCreep
8000 s 5137 s
158 s
Co
pyr
igh
t 2
01
4 b
y O
pti
mal
Syn
thes
is I
nc.
All
Rig
hts
Res
erve
d
UC Davis
Software Speedup
22
Co
pyr
igh
t 2
01
4 b
y O
pti
mal
Syn
thes
is I
nc.
All
Rig
hts
Res
erve
d
UC Davis
GPU Speedup: Green’s Function Matrix Generation
23
All Cal_NoCreep
80 x
7.3 x
Co
pyr
igh
t 2
01
4 b
y O
pti
mal
Syn
thes
is I
nc.
All
Rig
hts
Res
erve
d
UC Davis GPU Speedup: Matrix Multiply
24
All Cal_NoCreep
45 x 47 x
Co
pyr
igh
t 2
01
4 b
y O
pti
mal
Syn
thes
is I
nc.
All
Rig
hts
Res
erve
d
UC Davis GPU Speedup: Total
25
All Cal_NoCreep
50.3 x
32.3 x
Co
pyr
igh
t 2
01
4 b
y O
pti
mal
Syn
thes
is I
nc.
All
Rig
hts
Res
erve
d
UC Davis 30,000 Year VC Simulation CPU Benchmark
• Hardware: TACC (Texas Advanced Computing Center): Lonestar HPC Cluster: 80 Cores ( 7 compute nodes, each with 2-Hex Core CPUs)
• Run time: 24549 s (~6hr 50 min)
GPU Runtime • Hardware: Single GPU: NVIDIA Titan: 2688 Cores
• Run time = 11591 s (~3 hr 13 min)
• Speedup = 2.12 x
26
NVIDIA Titan 7 Dell Blade Servers
2.12 x faster
Co
pyr
igh
t 2
01
4 b
y O
pti
mal
Syn
thes
is I
nc.
All
Rig
hts
Res
erve
d
UC Davis Significant Accomplishments
Implemented Green’s function matrix generation using closed form analytical expressions by Okada, on the GPU
• Speedup of 80 x over single-core CPU
• Speedup of 7.3 x over 12-core CPU running 24 threads
Implemented matrix vector multiply-accumulate on GPU for calculating shear and normal stresses
• Speedup of 45-47 x over single/12-core CPU
• Note: Multi threaded version of matrix multiply does not show any speedup w.r.t single core: Code is memory bandwidth bound
500 year VC simulations: 32.3 x faster than 12-core CPU running 24 threads
• Speedup approaches 45 x as number of years increase
30,000 year simulation runs on a single GPU 2.12 x faster than HPC cluster with 7 compute nodes, each with 2-Hex Core CPUs.
27
Co
pyr
igh
t 2
01
4 b
y O
pti
mal
Syn
thes
is I
nc.
All
Rig
hts
Res
erve
d
UC Davis Current and Future R&D
28
Year II Research: Distributing computations across
Multiple GPUs in a single compute node
Cluster with GPU-enabled compute nodes
Perform Test Runs on • Texas Advanced Computing Center (TACC): Stampede
Number of Nodes: 128
GPUs Per Node: 1 NVIDIA K20 GPU (2496 CUDA cores) per node
• NASA Ames: Pleiades
Number of Nodes: 64
GPUs Per Node: 1 NVIDIA Tesla M2090 GPU (512 CUDA cores) per node
Co
pyr
igh
t 2
01
4 b
y O
pti
mal
Syn
thes
is I
nc.
All
Rig
hts
Res
erve
d
UC Davis
Thank You Contact: [email protected]
www. optisyn.com
29
Top Related