Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.
-
date post
19-Dec-2015 -
Category
Documents
-
view
226 -
download
1
Transcript of Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL.
Parallel Scaling of MPI Codes
A practical talk on using MPI with focus on:
• Distribution of work within a parallel program
• Placement of computation within a parallel computer
• Performance costs of different types of communication
• Understanding scaling performance terminology
Scale: Practical Importance
Time required to compute the NxN matrix product C=A*B
Assuming you can address 64GB from
one task, can you wait a month?
How to balancecomputational goal
vs.compute resources?
Choose the right scale!
Let’s jump to an example
• Sharks and Fish II : N2 parallel force evalulation
• e.g. 4 CPUs evaluate force for 125 fish
• Domain decomposition: Each CPU is “in charge” of ~31 fish, but keeps a fairly recent copy of all the fishes positions (replicated data)
• Is it not possible to uniformly decompose problems in general, especially in many dimensions
• This toy problem is simple, has fine granularity and is 2D
• Let’s see how it scales
31 31 31 32
Sharks and Fish II : Program
Data:
n_fish global
my_fish local
fishi = {x, y, …}
Dynamics:
F = ma
…
V = Σ 1/rij
dq/dt = m * p
dp/dt = -dV/dq
MPI_Allgatherv(myfish_buf, len[rank], MPI_FishType…)
for (i = 0; i < my_fish; ++i) { for (j = 0; j < n_fish; ++j) { // i!=j ai += g * massj * ( fishi – fishj ) / rij
}}
Move fish
• 100 fish can move 1000 steps in
1 task 5.459s
32 tasks 2.756s
• 1000 fish can move 1000 steps in
1 task 511.14s
32 tasks 20.815s
• So what’s the “best” way to run?–How many fish do we really have?–How large a computer (time) do we have? –How quickly do we need the answer?
Sharks and Fish II: How fast?
x 24.6 speedup
x 1.98 speedup
Scaling: Good 1st Step: Do runtimes make sense?
1 Task
32 Tasks
…
Running fish_sim for 100-1000 fish on 1-32 CPUs we see
time ~ fish2
Scaling: Walltimes
walltime is (all)important but let’s look at some other scaling metrics
Each line is contour describing computations doable in a given time
Scaling: terminology
• Scaling studies involve changing the degree of parallelism. Will we be changing the problem also?–Strong scaling Fixed problem size
–Weak scaling Problem size grows with additional compute resources
• How do we measure success in parallel scaling?–Speed up = Ts/Tp(n)
–Efficiency = Ts/(n*Tp(n))
Multiple definitions
exist!
Scaling: Efficiencies
Remarkably smooth! Often algorithm and architecture make efficiency landscape quite complex
Scaling: Analysis
Why does efficiency drop?–Serial code sections
Amdahl’s law–Surface to Volume
Communication bound–Algorithm complexity
or switching–Communication
protocol switching–Scalability of computer
and interconnect
W
hoa!
Scaling: Analysis
• In general, changing problem size and concurrency expose or remove compute resources. Bottlenecks shift.
• In general, first bottleneck wins.
• Scaling brings additional resources too. –More CPUs (of course)–More cache(s)–More memory BW in some cases
Strong Scaling: Communication Bound
64 tasks , 52% comm 192 tasks , 66% comm 768 tasks , 79% comm
MPI_Allreduce buffer size is 32 bytes.
Q: What resource is being depleted here?A: Small message latency
1) Compute per task is decreasing2) Synchronization rate is increasing3) Surface:Volume ratio is increasing
Sharks and Atoms:
At HPC centers like NERSC fish are rarely modeled as point masses. The associated algorithms and their scalings are none the less of great practical importance for scientific problems.
Particle Mesh EwaldMatSci Computation
600s@256 way or
80s@1024 way
Topics
• Load Balance
• Synchronization
• Simple stuff
• File I/O
Now instead of looking at scaling of specific applications lets look at general issues in parallel application scalability
Load Balance : Application Cartoon
Universal App Unbalanced:
Balanced:
Time saved by load balanceWill define synchronization later
Load Balance : performance data
MPI ranks sorted by total communication time
Communication Time: 64 tasks show 200s, 960 tasks show 230s
Load Balance : analysis
• The 64 slow tasks (with more compute work) cause 30 seconds more “communication” in 960 tasks
• This leads to 28800 CPU*seconds (8 CPU*hours) of unproductive computing
• All load imbalance requires is one slow task and a synchronizing collective!
• Pair well problem size and concurrency.
• Parallel computers allow you to waste time faster!
Load Balance: Summary
•Imbalance is most often a byproduct of data decomposition•Must be addressed before further MPI tuning can happen•Good software exists for graph partitioning / remeshing
•For regular grids consider padding or contracting
Synchronization: terminology
MPI_Barrier(MPI_COMM_WORLD);
T1 = MPI_Wtime();
e.g. MPI_Allreduce();
T2 = MPI_Wtime()-T1;
• For a code running on N tasks what is the distribution of the T2’s?
• The average and width of this distribution tell us how synchronizing e.g. MPI_Allreduce is relative to some given interconnect. (HW & SW)
How synchronizing is MPI_Allreduce?
Synchronization : MPI Functions
Completion semantics of MPI functions
• Local : leave based on local logic – MPI_Comm_rank, MPI_Get_count
• Probably Local : try to leave w/o messaging other tasks
– MPI_Isend/Irecv
• Partially synchronizing : leave after messaging M<N tasks
– MPI_Bcast, MPI_Reduce
• Fully synchronizing : leave after every else enters– MPI_Barrier, MPI_Allreduce
seaborg.nersc.gov
• It’s hard to discuss synchronization outside of the context a particular parallel computer
• MPI timings depend on HW, SW, and environment –How much of MPI is handled by the switch adapter?–How big are messaging buffers?–How many thread locks per function?–How noisy is the machine (today)?
• This is hard to model, so take an empirical approach based on an IBM SP which is largely applicable to other clusters…
Colony Switch
Colony Switch
PG F S
seaborg.nersc.gov basics
Resource Speed Bytes
Registers 3 ns 2560 B
L1 Cache 5 ns 32 KB
L2 Cache 45 ns 8 MB
Main Memory 300 ns 16 GB
Remote Memory 19 us 7 TB
GPFS 10 ms 50 TB
HPSS 5 s 9 PB
380 x
HPSSHPSS
CSS0
CSS1
•6080 dedicated CPUs, 96 shared login CPUs•Hierarchy of caching, speeds not balanced •Bottleneck determined by first depleted resource
16 way SMP NHII Node
Main MemoryGPFS
IBM SP
Colony Switch
Colony Switch
PG F S
MPI on the IBM SP
HPSSHPSS
CSS0
CSS1
16 way SMP NHII Node
Main MemoryGPFS
•2-4096 way concurrency
•MPI-1 and ~MPI-2
•GPFS aware MPI-IO
•Thread safety
•Ranks on same node bypass the switch
MPI: seaborg.nersc.gov
Intra and Inter Node Communication
MP_EUIDEVICE
(fabric)
Bandwidth
(MB/sec)
Latency
(usec)
css0 500 / 350 9 / 21
css1 X X
csss 500 / 350 9 / 21
•Lower latency can satisfy more syncs/sec•What is the benefit of two adapters?•This is for a single pair of tasks
16 way SMP NHII Node
Main MemoryGPFS
css0
css1
csss
Seaborg : point to point messaging
16 way SMP NHII Node
Main MemoryGPFS
16 way SMP NHII Node
Main MemoryGPFS
IntranodeInternode
Switch BW and latency are often stated in optimistic terms.The number and size of concurrent messages changes things.
A fat tree / crossbar switch helps hide this.
Inter-Node Bandwidth
csss
css0
• Tune message size to optimize throughput
• Aggregate messages when possible
MPI Performance is often Hierarchical
message size and task placement are key to performance
Intra
Inter
MPI: Latency not always 1 or 2 numbers
The set of all possibly latencies describes
the interconnect geometry from the application
perspective
Synchronization: measurement
MPI_Barrier(MPI_COMM_WORLD);
T1 = MPI_Wtime();
e.g. MPI_Allreduce();
T2 = MPI_Wtime()-T1;
How synchronizing is MPI_Allreduce?
For a code running on N tasks what is the distribution of the T2’s?
One can derive the level of synchronization from MPI algorithms.
Instead let’s just measure …
Synchronization: MPI Collectives
2048 tasks
Beyond load balance there is a distribution on MPI timings intrinsic to the MPI Call
Synchronization: Architecture
t is the frequencykernel process
scheduling
Unix : cron et al.
…and from the machine itself
Synchronization : Summary
• As a programmer you can control–Which MPI calls you use (it’s not required to use
them all). –Message sizes, Problem size (maybe) –The temporal granularity of synchronization, i.e.,
where do synchronization occur.
• Language writers and system architects control–How hard is it to do the above–The intrinsic amount of noise in the machine
Simple Stuff
Parallel programs are easier to mess up than serial ones. Here are some common
pitfalls.
Is MPI_Barrier time bad? Probably. Is it avoidable?
~three cases:1) The stray / unknown / debug barrier2) The barrier which is masking compute balance 3) Barriers used for I/O ordering
Often very easy to fix
MPI_Barrier
Parallel File I/O: Metadata
• A parallel file system is great, but it is also another place to create contention.
• Avoid uneeded disk I/O, know your file system
• Often avoid file per task I/O strategies when running at scale
Other sources of information:
• MPI Performance: http://www-unix.mcs.anl.gov/mpi/tutorial/perf/mpiperf/
• Seaborg MPI Scaling: • http://www.nersc.gov/news/reports/technical/seaborg_scaling/
• MPI Synchronization : Fabrizio Petrini, Darren J. Kerbyson, Scott Pakin, "The Case of the Missing
Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q", in Proc. SuperComputing, Phoenix, November 2003.
• Domain decomposition:
http://www.ddm.org/
google://”space filling”&”decomposition” etc.
• Metis : http://www-users.cs.umn.edu/~karypis/metis