Download - Massively Parallel Earthquake Simulations on GPUs · 2014. 4. 16. · San Jose, CA March 26, 2014 Research Sponsored by NASA ROSES: Research Opportunities in Space and Earth Sciences

Co

pyr

igh

t 2

01

4 b

y O

pti

mal

Syn

thes

is I

nc.

All

Rig

hts

Res

erve

d

UC Davis

Massively Parallel Earthquake Simulation on GPUs

Prasenjit Sengupta, Shagandeep Kaur, Jason Kwan, and P. K. Menon

Optimal Synthesis Inc. &

John Rundle and Eric Heien University of California, Davis

NVIDIA GPU Technology Conference

San Jose, CA March 26, 2014

Research Sponsored by NASA ROSES: Research Opportunities in Space and Earth Sciences

Co

pyr

igh

t 2

01

4 b

y O

pti

mal

Syn

thes

is I

nc.

All

Rig

hts

Res

erve

d

UC Davis Agenda

About Optimal Synthesis Inc.

Research Problem: Earthquake Simulation

Profiling the C Code

Implementation of Computationally Intensive Functions on the GPU • Matrix Vector Multiply-Accumulate for stress calculation

• Generation of the Greens Function Matrix

Validation and Run Time Performance Evaluation

Summary & Next Steps

2

Co

pyr

igh

t 2

01

4 b

y O

pti

mal

Syn

thes

is I

nc.

All

Rig

hts

Res

erve

d

UC Davis Optimal Synthesis Inc.

Innovative R&D company active in aerospace and defense technologies

Located in Los Altos CA

R&D for NASA and DoD, among several Fortune 500 companies

Core Expertise • Software for High Performance Computing

• Modeling, Simulation and Analysis

• Vehicle Navigation, Guidance and Control

• Air Traffic Management Systems

• Wireless Signal Processing Technologies

• Handwriting, Speech and Image Recognition Technologies

3

Co

pyr

igh

t 2

01

4 b

y O

pti

mal

Syn

thes

is I

nc.

All

Rig

hts

Res

erve

d

UC Davis

GPU Implementation of Computationally Intensive Algorithms at OSI

Project Title Algorithm Sponsor Speedup

1 Computational Appliance for Rapid Prediction of

Aircraft Trajectories (CARPAT) Trajectory Prediction NASA 250x

2

Next-Generation Target State Estimation Algorithms

for the Interception of Maneuvering Ballistic

Missiles

Particle Filtering MDA 300x

3

Trajectory Smoothing for Nonlinear Non-Gaussian

Models Using Particle Smoothing on Graphics

Processing Units

Particle Smoothing Air Force 32x

4 Accelerating ATM Optimization Algorithms on

Emerging High Performance Computing Hardware Linear Programming NASA 30x

5 Stochastic Queuing Model Analysis to Support

Airspace Super Density Operations

Discrete-Event &

Monte Carlo

Simulation

NASA 8x

6 Accelerating Earthquake Simulations on General-

Purpose Graphics Processors

Stress Propagation

on FEM NASA 45x

7 Turbine Engine Performance Estimation using

Particle Filters Particle Filtering NASA 125x

.

4

Co

pyr

igh

t 2

01

4 b

y O

pti

mal

Syn

thes

is I

nc.

All

Rig

hts

Res

erve

d

UC Davis

The Research Problem

5

Co

pyr

igh

t 2

01

4 b

y O

pti

mal

Syn

thes

is I

nc.

All

Rig

hts

Res

erve

d

UC Davis Motivation

Virtual California (VC): • Topologically realistic numerical

simulation (boundary element code) of earthquakes

• Fault Model: Northern California

• Developed by Dr. John Rundle’s research group at UC Davis over the past two decades

• Supported by NASA under the Solid Earth and Natural Hazards Program

• The Working Group on California Earthquake Probabilities (WGCEP) is considering the use of VC for Earthquake Insurance Rate detemination

6

Co

pyr

igh

t 2

01

4 b

y O

pti

mal

Syn

thes

is I

nc.

All

Rig

hts

Res

erve

d

UC Davis Virtual California Simulation Fault Model:

• Faults discretized into 3 km x 3 km fault elements

• Position, orientation, coefficient of friction, rate of slip, failure stresses

Simulation

Long Term Stress Accumulation • Accumulate stress by applying back slip to various elements

Rupture Propagation • Release of accumulated stress through a cascading series of fault

element failures

Simulation continued for 10,000 to 30,000 years

Generates probabilities of earthquake occurance of a given magnitude at a specified location

Has been validated by comparing observed earthquake statistics over the past 30 to 40 years

7

Co

pyr

igh

t 2

01

4 b

y O

pti

mal

Syn

thes

is I

nc.

All

Rig

hts

Res

erve

d

UC Davis Computational Need

Increasing the Model Resolution • Benchmark VC model uses (3km x 3km) grid

• Benchmark run times are 1536 core-hours (48 hours on a 32-core machine)

• Desire model resolution of the order of (100m x 100m) to achieve the required prediction accuracy

Real-time Earthquake Forecasting • Real-time forecasting of aftershocks

Proposed Solution • Leverage Parallel Computing on GPUs to accelerate execution of

Virtual California simulations

8

Co

pyr

igh

t 2

01

4 b

y O

pti

mal

Syn

thes

is I

nc.

All

Rig

hts

Res

erve

d

UC Davis

Profiling of the C Code

9

Co

pyr

igh

t 2

01

4 b

y O

pti

mal

Syn

thes

is I

nc.

All

Rig

hts

Res

erve

d

UC Davis Profiling Summary

Profiling performed using AMD Code Analyst

Hot Spots

Matrix Vector Multiply Accumulate • Calculates the stress on all elements given the strain vector

• Implemented as

• Consumes 67% of run time

• Code: Single function

Calculating the Greens Function Matrix • Determine matrix from fault element data

• Calculated using closed-form analytical expression by Okada

• Consumes 29% of the run time

• Code: 232 Functions

10

𝜎𝑖𝑗𝐴 𝑡 = 𝑇𝑖𝑗

𝐴𝐵𝑠𝐵(𝑡)



𝑐 = 𝐴 × 𝑏

Co

pyr

igh

t 2

01

4 b

y O

pti

mal

Syn

thes

is I

nc.

All

Rig

hts

Res

erve

d

UC Davis

Parallelization Opportunities: Matrix Vector Multiply-Accumulate Dense Matrix Vector Mulitply: Long term stress accumulation

phase • Parallelize over the number of entries in c (Rows in A)

• Each thread calculates one element ci

• Store the b vector in shared memory

Sparse Matrix Vector Multiply: Rupture Propagation Phase • Operate only on non-zero elements of b

ci = ai1 … aim

Row i

×

b1

⋮bm

Thread (i)

Thread (i)

Co

pyr

igh

t 2

01

4 b

y O

pti

mal

Syn

thes

is I

nc.

All

Rig

hts

Res

erve

d

UC Davis

Optimization: Storing the Strain Vector in Shared Memory

Matrix vector multiply

vector used times, once by each row of

Store in shared memory

Divide copying of vector to shared memory among threads of the kernel

Size of shared memory (49 KB): Copy smaller segments of vector

𝑐 = 𝐴 × 𝑏

𝑏 𝑛

𝐴

𝑏

A1 A2 A4

A3

b1

b2

b3

b4

𝑏

𝑏

Co

pyr

igh

t 2

01

4 b

y O

pti

mal

Syn

thes

is I

nc.

All

Rig

hts

Res

erve

d

UC Davis

Parallelization Opportunities: Generation of Greens Function Matrix

Generate Greens function matrix (stress influence matrix ) from fault element data

Closed-form analytical expressions: 232 functions in QuakeLibOkada.cpp

Independence: Calculation of each element is independent

Amount of Parallelism: Number of parallel threads = • AllCal_NoCreep: =13482 =181 million

Arithmetic Intensity: Number of computations performed per memory transaction • Input data: 17 doubles for every element ( )

• Mathematical operations: 14,000 to 43,000 for every matrix element ( )

13



𝑛2

𝑛2 𝑛

𝑛

𝑛2

Co

pyr

igh

t 2

01

4 b

y O

pti

mal

Syn

thes

is I

nc.

All

Rig

hts

Res

erve

d

UC Davis Optimization: Coalescing

Green’s function matrices stored in column-major format to ensure global memory coalescing • thread processes row of the matrix

Shear and normal Green’s function matrices: Allocated with cudaMallocPitch • Appropriate zero padding to ensure memory coalescing

• 2D array: Each row starts at a 64 byte boundary

14

𝑖𝑡ℎ 𝑖𝑡ℎ

Co

pyr

igh

t 2

01

4 b

y O

pti

mal

Syn

thes

is I

nc.

All

Rig

hts

Res

erve

d

UC Davis

Validation & Runtime Performance Evaluation

Co

pyr

igh

t 2

01

4 b

y O

pti

mal

Syn

thes

is I

nc.

All

Rig

hts

Res

erve

d

UC Davis Virtual California Fault Models

No Model Name Number of

Fault Elements

Size of Green’s Function Matrix

CPU Execution Times

(12-Core)

1 Parkfield 48 24 KB 0.57s

2 SAF 1508 17.48 MB 14.95s

3 AllCal2_Trunc4905 4905 183.82 MB 165.51s (2m 45s)

4 AllCal2_Trunc7453 7453 423.96 MB 371.49s (6m 12s)

5 AllCal2_NoCreep_13482 13482 1.35 GB 5137.03s (1h 25m)

6 AllCal_17757 17757 2.35 GB 3432.16s (57m 12s)

Simulation Time Horizon for Performance Benchmarking: 500 years

• Typical Simulation Time Horizon: 10,000 to 30,000 years

Co

pyr

igh

t 2

01

4 b

y O

pti

mal

Syn

thes

is I

nc.

All

Rig

hts

Res

erve

d

UC Davis Test Hardware In-house GPU Workstation

17

2 CPUs Intel Xeon E5620, Cores: 2x4, # Threads: 2x8, 2.4 GHz

RAM 24 GB, 1.3 GHz, DDR3, ECC

GPUs Tesla C2050 : 3 GB, 448 cores, 144 GB/s

Tesla K20X: 5 GB, 2496 cores, 208 GB/s

Titan: 6 GB, 2688 cores, 288.4 GB/s

Tesla C2050

Titan

Tesla K20X

Co

pyr

igh

t 2

01

4 b

y O

pti

mal

Syn

thes

is I

nc.

All

Rig

hts

Res

erve

d

UC Davis

Run Time Performance

18

Co

pyr

igh

t 2

01

4 b

y O

pti

mal

Syn

thes

is I

nc.

All

Rig

hts

Res

erve

d

UC Davis

Run Times: Green’s Function Matrix Generation

19

All Cal_NoCreep

3378 s

308 s

42 s

Co

pyr

igh

t 2

01

4 b

y O

pti

mal

Syn

thes

is I

nc.

All

Rig

hts

Res

erve

d

UC Davis Run Times: Matrix Mulitply

20

All Cal_NoCreep

4605 s 4813 s

102 s

Co

pyr

igh

t 2

01

4 b

y O

pti

mal

Syn

thes

is I

nc.

All

Rig

hts

Res

erve

d

UC Davis Run Times: Total

21

All Cal_NoCreep

8000 s 5137 s

158 s

Co

pyr

igh

t 2

01

4 b

y O

pti

mal

Syn

thes

is I

nc.

All

Rig

hts

Res

erve

d

UC Davis

Software Speedup

22

Co

pyr

igh

t 2

01

4 b

y O

pti

mal

Syn

thes

is I

nc.

All

Rig

hts

Res

erve

d

UC Davis

GPU Speedup: Green’s Function Matrix Generation

23

All Cal_NoCreep

80 x

7.3 x

Co

pyr

igh

t 2

01

4 b

y O

pti

mal

Syn

thes

is I

nc.

All

Rig

hts

Res

erve

d

UC Davis GPU Speedup: Matrix Multiply

24

All Cal_NoCreep

45 x 47 x

Co

pyr

igh

t 2

01

4 b

y O

pti

mal

Syn

thes

is I

nc.

All

Rig

hts

Res

erve

d

UC Davis GPU Speedup: Total

25

All Cal_NoCreep

50.3 x

32.3 x

Co

pyr

igh

t 2

01

4 b

y O

pti

mal

Syn

thes

is I

nc.

All

Rig

hts

Res

erve

d

UC Davis 30,000 Year VC Simulation CPU Benchmark

• Hardware: TACC (Texas Advanced Computing Center): Lonestar HPC Cluster: 80 Cores ( 7 compute nodes, each with 2-Hex Core CPUs)

• Run time: 24549 s (~6hr 50 min)

GPU Runtime • Hardware: Single GPU: NVIDIA Titan: 2688 Cores

• Run time = 11591 s (~3 hr 13 min)

• Speedup = 2.12 x

26

NVIDIA Titan 7 Dell Blade Servers

2.12 x faster

Co

pyr

igh

t 2

01

4 b

y O

pti

mal

Syn

thes

is I

nc.

All

Rig

hts

Res

erve

d

UC Davis Significant Accomplishments

Implemented Green’s function matrix generation using closed form analytical expressions by Okada, on the GPU

• Speedup of 80 x over single-core CPU

• Speedup of 7.3 x over 12-core CPU running 24 threads

Implemented matrix vector multiply-accumulate on GPU for calculating shear and normal stresses

• Speedup of 45-47 x over single/12-core CPU

• Note: Multi threaded version of matrix multiply does not show any speedup w.r.t single core: Code is memory bandwidth bound

500 year VC simulations: 32.3 x faster than 12-core CPU running 24 threads

• Speedup approaches 45 x as number of years increase

30,000 year simulation runs on a single GPU 2.12 x faster than HPC cluster with 7 compute nodes, each with 2-Hex Core CPUs.

27

Co

pyr

igh

t 2

01

4 b

y O

pti

mal

Syn

thes

is I

nc.

All

Rig

hts

Res

erve

d

UC Davis Current and Future R&D

28

Year II Research: Distributing computations across

Multiple GPUs in a single compute node

Cluster with GPU-enabled compute nodes

Perform Test Runs on • Texas Advanced Computing Center (TACC): Stampede

Number of Nodes: 128

GPUs Per Node: 1 NVIDIA K20 GPU (2496 CUDA cores) per node

• NASA Ames: Pleiades

Number of Nodes: 64

GPUs Per Node: 1 NVIDIA Tesla M2090 GPU (512 CUDA cores) per node

Co

pyr

igh

t 2

01

4 b

y O

pti

mal

Syn

thes

is I

nc.

All

Rig

hts

Res

erve

d

UC Davis

Thank You Contact: [email protected]

www. optisyn.com

29

mailto:[email protected]