Accelerating Astronomy & Astrophysics in the New Era of Parallel Computing: GPUs, Phi and Cloud...

Accelerating Astronomy & Astrophysics in the New Era of Parallel Computing:

GPUs, Phi and Cloud Computing

Eric FordPenn State

Department of Astronomy & AstrophysicsInstitute for CyberScienceCenter for Astrostatistics

Center for Exoplanets & Habitable Worlds

IAU General Assembly, Division BAugust 7, 2015

Background: NASA

Why should Astronomers care about trends in Computing?

• Theory:– Increase resolution/particles in simulations– Add more physics to models

• Observation:– Analyze massive datasets– Rapid analysis of time-domain data

• Comparing theory & observations:– Explore high-dimensional parameter space– Statistics of rare events

Image credit: LSST Corp.

Why not just wait for faster Computers?• “The Free Lunch is Over: A Fundamental Turn

Towards Concurrency in Software” (Sutter 2005)

– CPU clock frequency has not increased since 2004– Increased performance comes from increased parallelism– Work/time is often limited by memory access, not computing power

Clock frequency vs Year Performance gap CPU vs DRAM memory

processor

memory

Hennesy & Patterson 2012

2004

Image credit: HerbSutter.com

A computing revolution is underway

“Supercomputing” power is very accessible• GPU: ~7 Tflops (NVIDIA Titan X) Cost: ~$1k• Equivalent to #1 on TOP500 in 2004 Cost (~10’s M$)

How do we harness this power?• Choose appropriate hardware & programming model• Keep software development costs in check• Produce code that is accurate, maintainable, fast & portable

Balancing development costsagainst code execution time

Development costs go up

Execution time goes down

Realizing Performance Requires New Programming Priorities

Old• Hand optimize arithmetic

• Reduce number of floating point operations

• Write serial program, then parallelize

New• Code simply so compilers

can optimize arithmetic• Reduce memory access

(that result in cache misses)

• Design algorithm for parallel architectures from the start

How to Harness Parallelism ?• Parallel operations within a core (e.g., AVX)

• Multiple cores per CPU or workstation (shared memory)

• Cluster of compute nodes (distributed memory)

• Cloud (e.g., Amazon EC2, Google Compute Engine, Domino)

• Hardware accelerators (e.g., GPUs, Intel Xeon Phi)


Parallel Programming Patterns

Compiler Optimizations (low-level parallelism)• Often just require recompiling with different flags

or setting environment variables• No reason not to take advantage of.

Libraries (e.g., FFT, Linear Algebra, N-body):• Easy to use, maintain. Often well-optimized.• Limited availability & difficult to customize


Multiple Trivial Options (i.e., use these first):• Scripting (e.g., breaking into PBS jobs)

• Compiler directives (e.g., OpenMP, OpenAcc)

• Parallel language features (e.g., parallel for, map)

Advantages:• Higher-level parallelism w/ minimal effort• Easy to get started, maintain. • Very efficient for parallelizing outer loops and

“embarrassingly parallel” applications/functions• Impose limitations beyond hardware. • Great option if a good fit for your problem


Languages built for accelerators (e.g., CUDA, OpenCL)

• Moderate time investment to start, growing to major investment to write well-optimized code

• Often requires rewriting significant code to execute on GPU

• Efficient for parallelizing inner loops• Optimization focuses on memory, cache,

branching• Can be challenging to debug & maintain• Wise to involve computer scientists, early and

perhaps throughout project


Parallel Template Libraries (e.g., Thrust)

• Easy to start (if have experience with C++), can grow to become moderately complex

• Require rewriting some portion of code• Relatively easy to maintain, often very well-

optimized by compiler• Efficient for parallelizing inner loops• Highly customizable, runs on variety of hardware• Cache, shared memory, block size, ...

are transparent to programmer • Programmer still controls host↔device transfers

GPU Computing is about High Throughput

• Clock frequency ~ velocity of stream• Throughput ~ volumetric flow rate of a stream

GPUGIG = CPU +

Image Credit: Poco a poco Image Credit: NASA

What is in a Intel Core i7?

Most transistors aren’t doing computation!Image credit: Intel Corp.

What is in a GPU?

Core i7 ~70 GFLOPS

GT200 256 cores ~240 GFLOPS

GF100 “Fermi” 512 cores ~616 GFLOPS

GK110 “Titan Black”2,280 cores

~1.7 TFLOPS

Picture : GK104 w/ 1,536 cores

StreamingMultiprocessorClusters (SMX)

Image credit: NVIDIA Corp.

Hardware Comparison

Traditional CPUs• Optimized for fast single-

threaded execution• Often compute-bound• Astronomer-programmers

ignores memory hierarchy

Hardware Accelerators• Optimized for high throughput

execution• Typically memory-bound• All programmer should

pay attention to memory hierarchy

• Greater energy efficiency• Opportunities/challenges


GPU Challenges• Is problem highly parallelizable? • Does algorithm require extensive branching?• When GPU kernel speeds up by >100x,

other parts of code become significant• Minimizing CPU↔GPU Data Transfers• Memory latency (i.e., want lots of computation

on relatively few numbers)• Identifying/designing data structures & algorithms

for memory locality and maximize cache hits• Increased development time/cost for

complex GPU codes

GPU Programming• Parallelism & memory hierarchy are not transparent for

the programmer!– Explicitly write code that can execute in hundreds to

millions of threads running simultaneously – Place data in appropriate type of memory/cache

• Programmers need to design algorithms differently:– Choose algorithms that avoid large, complex branching – Memory access is key: memory hierarchy, latency, throughput,

caches, coalescing, bank conflicts, etc...– Data structures matter– Unexpected bottlenecks– May need new/different algorithms to harness computational power

→ Programming GPUs is more difficult than traditional CPUs.– But you can do it!– Students deserve it, because…– All programming will become parallel programming

The Future of Computing Performance: Game Over or Next Level?

• “Invest in research in and development of algorithms that can exploit parallel processing”

• “Incorporate in computer science education an increased emphasis on parallelism, and use a variety of methods and approaches to better prepare students for computing resources that they will encounter in their careers.”

NRC & CSTB: S. Fuller et al. 2011

Swarm-NG: N-Body Integration on GPU

• Newton’s Laws of Motion• Newton’s Law of Gravity

• Several Integration Algorithmse.g., Time-Symmetric 4th order Hermite

https://github.com/AstroGPU/swarm Dindar et al. 2013; Nelson et al. 2014

https://github.com/AstroGPU/swarm

https://github.com/AstroGPU/swarm

A Simple CPU Implementation

Initial conditions

Compute accelerations, jerks

Advance x,v,t

Store output

Run for N timesteps

Sys 1

CPU parallelization(trivial: run N=Ncores jobs at a time)

Initial conditions


Advance x,v,t

Store output

Run for N timesteps

Sys 1Initial conditions


Advance x,v,t

Store output

Run for N timesteps

Sys 2

Initial conditions


Advance x,v,t

Store output

Run for N timesteps

Sys 3Initial conditions


Advance x,v,t

Store output

Run for N timesteps

Sys 4

First Generation GPU parallelization (slightly nontrivial: N~16 ͯ� ͯNcores jobs in 1 process)

Initial conditions for N systems

Download results from GPU to CPU [to disk]

Upload to GPU

Store intermediate output in GPU memory

GPU as coprocessor

All cores run the same program

Easy to parallelize as 1 system/thread

Pack N>4,000 jobs into 1 process to hide memory latency & achieve large speed-ups

Simple integration algorithms Analyze results (currently with CPU)


Advance x,v,t

Sys 1Compute

accelerations, jerks

Advance x,v,t

Sys 1


Advance x,v,t

Sys 2


Advance x,v,t

Sys …


Advance x,v,t

Sys M


Advance x,v,t

Sys …


Advance x,v,t

Sys …


Advance x,v,t

Sys …


Advance x,v,t

Sys N

MP NMPCompute accelerations, jerks

Advance x,v,t

Sys 1Compute


Advance x,v,t

Sys 1


Advance x,v,t

Sys 2


Advance x,v,t

Sys …


Advance x,v,t

Sys M


Advance x,v,t

Sys 2M+1


Advance x,v,t

Sys 2M+2


Advance x,v,t

Sys …


Advance x,v,t

Sys 3M

MP …Compute accelerations, jerks

Advance x,v,t

Sys 1Compute


Advance x,v,t

Sys 1


Advance x,v,t

Sys 2


Advance x,v,t

Sys …


Advance x,v,t

Sys M


Advance x,v,t

Sys M+1


Advance x,v,t

Sys M+2


Advance x,v,t

Sys …


Advance x,v,t

Sys 2M

MP 2Compute accelerations, jerks

Advance x,v,t

Sys 1Compute


Advance x,v,t

Sys 1


Advance x,v,t

Sys 2


Advance x,v,t

Sys …


Advance x,v,t

Sys M


Advance x,v,t

Sys 1


Advance x,v,t

Sys 2


Advance x,v,t

Sys …


Advance x,v,t

Sys M

MP 1

High Performance GPU parallelization(challenging: N~few ͯ� ͯNcores jobs in 1 process)

Initial conditions for N systems

Download results from GPU to CPU [to disk]

Upload to GPU

Store intermediate output in GPU memory

Analyze results (currently with CPU)

All multiprocessors (MPs) run the same program, each for M systems

Parallelize each step optimally, for finer parallelization

Achieve large speed-ups with only >~256 jobs

More complex integration algorithms


Advance x,v,t

Sys 1

Advance x,v,t

Sys … ̶ N MP NMP


Load x, v, t

Sync

SyncSync


Advance x,v,t

Sys 1

Advance x,v,t

Sys … ̶ … MP …


Load x, v, t

Sync

SyncSync


Advance x,v,t

Sys 1

Advance x,v,t

Sys 2M+1 ̶ 3M

MP 3


Load x, v, t

Sync

SyncSync


Advance x,v,t

Sys 1

Advance x,v,t

Sys M+1 ̶ 2M

MP 2


Load x, v, t

Sync

SyncSync


Advance x,v,t

Sys 1

Advance & Store x,v,t

Sys 1 ̶ M MP 1


Load x, v, t

Sync

SyncSync

Engineers Developing Highly Parallel Codes

Pros:• Understand capabilities of hardware & languages• Bring experience in software development • Intrinsically interested in developing optimized algorithms

Cons:• Limited knowledge of astronomy & astrophysics• Continued training required as codes become more

realistic and more astrophysics becomes relevant• Concern that Computer Science may not reward efforts• May consider our science as “just a job” • Often get “real jobs” before finish our projects

Astronomers Developing Highly Parallel Codes

Pros:• Understand broader context of tasks & goals• Focus on scientifically interesting problems• Avoid spending lots of time on details

Cons:• Limited training & experience with software development• Significant time investment to become proficient• Further time required to keep up with rapidly evolving

technology• Concern that Astronomy may not reward efforts and/or

value expertise and skills acquired

GPU Applications to Astrophysics

Questions?

Illustration Credit: Lynette Cook

Accelerating Astronomy & Astrophysics in the New Era of Parallel Computing: GPUs, Phi and Cloud...

Documents

Transcript of Accelerating Astronomy & Astrophysics in the New Era of Parallel Computing: GPUs, Phi and Cloud...