Techniques for Developing Efficient Petascale Applications
description
Transcript of Techniques for Developing Efficient Petascale Applications
![Page 1: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/1.jpg)
Techniques for Developing Efficient Petascale Applications
Laxmikant (Sanjay) Kale
http://charm.cs.uiuc.eduParallel Programming Laboratory
Department of Computer Science
University of Illinois at Urbana Champaign
![Page 2: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/2.jpg)
2
Outline• Basic Techniques for attaining good performance
• Scalability analysis of Algorithms
• Measurements and Tools
• Communication optimizations:– Communication basic– Overlapping communication and computation– Alpha-beta optimizations
• Combining and pipelining– (topology-awareness)
• Sequential optimizations
• (Load balancing)04/21/23 2Performance Techniques04/21/23
![Page 3: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/3.jpg)
3
Parallel Objects,
Adaptive Runtime System
Libraries and Tools
Examples based on multiple applications:
Molecular Dynamics
Crack Propagation
Space-time meshes
Computational Cosmology
Rocket Simulation
Protein Folding
Dendritic Growth
Quantum Chemistry (QM/MM)
Performance Techniques04/21/23
![Page 4: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/4.jpg)
4
Analyze Performance with both: Simple as well as
Sophisticated Tools
Performance Techniques04/21/23
![Page 5: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/5.jpg)
Simple techniques• Timers: wall timer (time.h)
• Counters: Use papi library raw counters, ..– Esp. useful:
• number of floating point operations,
• cache misses (L2, L1, ..)
• Memory accesses
• Output method:– “printf” (or cout) can be expensive
– Store timer values into an array or buffer, and print at the end
Performance Techniques 504/21/23
![Page 6: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/6.jpg)
Sophisticated Tools• Many tools exist• Have some learning curve, but can be beneficial• Example tools:
– Jumpshot
– TAU
– Scalasca
– Projections
– Vampir ($$)
• PMPI interface: – Allows you to intercept MPI calls
• So you can write your own tools
– PMPI interface for projections:
• git://charm.cs.uiuc.edu/PMPI_Projections.git
Performance Techniques 604/21/23
![Page 7: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/7.jpg)
7
Example: Projections Performance Analysis Tool
• Automatic instrumentation via runtime
• Graphical visualizations• More insightful feedback
– because runtime understands application events better
Performance Techniques04/21/23
![Page 8: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/8.jpg)
8
Exploit sophisticated Performance analysis tools
• We use a tool called “Projections”
• Many other tools exist
• Need for scalable analysis
• A not-so-recent example:– Trying to identify the next performance obstacle for NAMD
• Running on 8192 processors, with 92,000 atom simulation
• Test scenario: without PME
• Time is 3 ms per step, but lower bound is 1.6ms or so..
Performance Techniques04/21/23
![Page 9: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/9.jpg)
9Performance Techniques04/21/23
![Page 10: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/10.jpg)
10Performance Techniques04/21/23
![Page 11: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/11.jpg)
11Performance Techniques04/21/23
![Page 12: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/12.jpg)
12
Performance Tuning withPatience and Perseverance
Performance Techniques04/21/23
![Page 13: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/13.jpg)
13
Performance Tuning with Perseverance
• Recognize multi-dimensional nature of the performance space
• Don’t stop optimizing until you know for sure why it cannot be speeded up further– Measure, measure, measure ...
Performance Techniques04/21/23
![Page 14: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/14.jpg)
14
Shallow valleys, high peaks, nicely overlapped PME
green: communication
Red: integration Blue/Purple: electrostatics
turquoise: angle/dihedral
Orange: PME
94% efficiency
Apo-A1, on BlueGene/L, 1024 procs
Charm++’s “Projections” Analysis too
Time intervals on x axis, activity added across processors on Y axisl
Performance Techniques04/21/23
![Page 15: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/15.jpg)
15
Cray XT3, 512 processors: Initial runs
Clearly, needed further tuning, especially PME.
But, had more potential (much faster processors)
76% efficiency
Performance Techniques04/21/23
![Page 16: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/16.jpg)
16
On Cray XT3, 512 processors: after optimizations
96% efficiency
Performance Techniques04/21/23
![Page 17: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/17.jpg)
17
Communication Issues
Performance Techniques04/21/23
![Page 18: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/18.jpg)
Recap: Communication Basics: Point-to-point
Sending processorSending co-processor
Network
Receiving co-processor
Receiving processor
Each component has a per-message cost, and per byte cost
04/21/23 18Performance Techniques
Each cost, for a n-byte message = ά + n β
Important metrics: Overhead at processor, co-processor Network latency Network bandwidth consumed
Number of hops traversed
![Page 19: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/19.jpg)
Communication Basics
04/21/23 Performance Techniques 19
• Message Latency: time between the application sending the message and receiving it on the other processor
• Send overhead: time for which the sending processor was “occupied” with the message
• Receive overhead: the time for which the receiving processor was “occupied” with the message
• Network latency
![Page 20: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/20.jpg)
Communication: Diagnostic Techniques
• A simple technique: (find “grainsize”)– Count the number of messages per second of computation per processor!
(max, average)– Count number of bytes– Calculate: computation per message (and per byte)
• Use profiling tools:– Identify time spent in different communication operations– Classified by modules
• Examine idle time using time-line displays– On important processors– Determine the causes
• Be careful with “synchronization overhead”– May be load balancing masquerading as sync overhead. – Common mistake.
04/21/23 20Performance Techniques
![Page 21: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/21.jpg)
Communication: Problems and Issues• Too small a Grainsize
– Total Computation time / total number of messages
– Separated by phases, modules, etc.
• Too many, but short messages– vs. tradeoff
• Processors wait too long
• Later: – Locality of communication
• Local vs. non-local
• How far is non-local? (Does that matter?)
– Synchronization
– Global (Collective) operations• All-to-all operations, gather, scatter
• We will focus on communication cost (grainsize)
04/21/23 21Performance Techniques
![Page 22: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/22.jpg)
Communication: Solution Techniques• Overview:
– Overlap with Computation
• Manual
• Automatic and adaptive, using virtualization
– Increasing grainsize
– -reducing optimizations
• Message combining
• communication patterns
– Controlled Pipelining
– Locality enhancement: decomposition control
• Local-remote and bw reduction
– Asynchronous reductions
– Improved Collective-operation implementations
04/21/23 22Performance Techniques
![Page 23: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/23.jpg)
• Problem: – Processors wait for too long at “receive” statements
• Idea: – Instead of waiting for data, do useful work
– Issue: How to create such work?
• Can’t depend on the data to be received
• Routine communication optimizations in MPI– Move sends up and receives down
• Keep data dependencies in mind..
– Moving receive down has a cost: system needs to buffer message
• Use irecvs, but be careful
• irecv allows you to post a buffer for a recv, but not wait for it
04/21/23 23Performance Techniques
Overlapping Communication-Computation
![Page 24: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/24.jpg)
Major analytical/theoretical techniques• Typically involves simple algebraic formulas, and ratios
– Typical variables are:
• data size (N), number of processors (P), machine constants
– Model performance of individual operations, components, algorithms in terms of the above
• Be careful to characterize variations across processors, and model them with (typically) max operators
– E.g. max{Load I}
– Remember that constants are important in practical parallel computing
• Be wary of asymptotic analysis: use it, but carefully
• Scalability analysis:– Isoefficiency
04/21/23 24Performance Techniques
![Page 25: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/25.jpg)
25
Analyze Scalability of the Algorithm(say via the iso-efficiency metric)
Performance Techniques04/21/23
![Page 26: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/26.jpg)
Scalability• The Program should scale up to use a large
number of processors. – But what does that mean?
• An individual simulation isn’t truly scalable
• Better definition of scalability:– If I double the number of processors, I should be able to retain
parallel efficiency by increasing the problem size
04/21/23 26Performance Techniques
![Page 27: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/27.jpg)
27
Isoefficiency Analysis• An algorithm (*) is scalable if
– If you double the number of processors available to it, it can retain the same parallel efficiency by increasing the size of the problem by some amount
– Not all algorithms are scalable in this sense..
– Isoefficiency is the rate at which the problem size must be increased, as a function of number of processors, to keep the same efficiency
– Use η(p,N) = η(x.p, y.N) to get this equation
Parallel efficiency= T1/(Tp*P)
T1 : Time on one processor
Tp: Time on P processors
Problem
size
processors
Equal efficiency curves
Performance Techniques04/21/23
![Page 28: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/28.jpg)
28
Gauss-Jacobi Relaxation
while (maxError > Threshold) {
Re-apply Boundary conditions
maxError = 0;
for i = 0 to N-1 {
for j = 0 to N-1 {
B[i,j] = 0.2 * (A[i,j] + A[i,j-1] +
A[i,j+1] + A[i+1, j] + A[i-1,j]) ;
if (|B[i,j]- A[i,j]| > maxError)
maxError = |B[i,j]- A[i,j]|
}
}
swap B and A
}
Sequential Pseudocode: Decomposition by:
Row
Blocks
Or Column
04/21/23 Performance Techniques
![Page 29: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/29.jpg)
Isoefficiency of Jacobi Relaxation
Row decomposition• Computation per proc:
• Communication:
• Ratio:
• Efficiency:
• Isoefficiency:
Block decomposition• Computation per proc:
• Communication:
• Ratio
• Efficiency
• Isoefficiency
04/21/23 29Performance Techniques
![Page 30: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/30.jpg)
Isoefficiency of Jacobi Relaxation
Row decomposition• Computation per PE:
– A * N * (N/P)
• Communication– 16 * N
• Comm-to-comp Ratio:– (16 * P) / (A * N) = γ
• Efficiency:– 1 / (1 + γ)
• Isoefficiency: – N4
– problem-size = N2
– = (problem-size)^2
Block decomposition• Computation per PE:
– A * N * (N/P)
• Communication:– 32 * N / P1/2
• Comm-to-comp Ratio– (32 * P1/2) / (A * N)
• Efficiency
• Isoefficiency– N2
– Linear in problem size
04/21/23 30Performance Techniques
![Page 31: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/31.jpg)
3104/21/23 CharmWorkshop2007 31
NAMD: A Production MD program
NAMD• Fully featured program
• NIH-funded development
• Distributed free of charge (~20,000 registered users)
• Binaries and source code
• Installed at NSF centers
• User training and support
• Large published simulations
Performance Techniques04/21/23
![Page 32: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/32.jpg)
32
Molecular Dynamics in NAMD• Collection of [charged] atoms, with bonds
– Newtonian mechanics
– Thousands of atoms (10,000 – 5,000,000)
• At each time-step– Calculate forces on each atom
• Bonds:
• Non-bonded: electrostatic and van der Waal’s– Short-distance: every timestep
– Long-distance: using PME (3D FFT)
– Multiple Time Stepping : PME every 4 timesteps
– Calculate velocities and advance positions
• Challenge: femtosecond time-step, millions needed!
Collaboration with K. Schulten, R. Skeel, and coworkersPerformance Techniques04/21/23
![Page 33: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/33.jpg)
33
Traditional Approaches: non isoefficient
• Replicated Data:– All atom coordinates stored on each processor
• Communication/Computation ratio: P log P
• Partition the Atoms array across processors– Nearby atoms may not be on the same processor
– C/C ratio: O(P)
• Distribute force matrix to processors– Matrix is sparse, non uniform,
– C/C Ratio: sqrt(P)
Performance Techniques04/21/23
![Page 34: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/34.jpg)
34
Spatial Decomposition Via Charm
•Atoms distributed to cubes based on their location
• Size of each cube :
•Just a bit larger than cut-off radius
•Communicate only with neighbors
•Work: for each pair of nbr objects
•C/C ratio: O(1)
•However:
•Load Imbalance
•Limited Parallelism
Cells, Cubes or“Patches”
Charm++ is useful to handle this
Performance Techniques04/21/23
![Page 35: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/35.jpg)
35
Object Based Parallelization for MD:
Force Decomposition + Spatial Decomposition
•Now, we have many objects to load balance:
•Each diamond can be assigned to any proc.
• Number of diamonds (3D):
–14·Number of Patches
–2-away variation:
–Half-size cubes
–5x5x5 interactions
–3-away interactions: 7x7x7Performance Techniques04/21/23
![Page 36: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/36.jpg)
Strong Scaling on JaguarPF
6,720 cores
53,760 cores
107,520 cores
224,076 cores
36Performance Techniques04/21/23
![Page 37: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/37.jpg)
37
Gauss-Seidel Relaxation
While (maxError > Threshold) {
Re-apply Boundary conditions
maxError = 0;
for i = 0 to N-1 {
for j = 0 to N-1 {
old = A[i, j]
A[i, j] = 0.2 * (A[i,j] + A[i,j-1] +A[i,j+1]
+ A[i+1,j] + A[i-1,j]) ;
if (|A[i,j]-old| > maxError)
maxError = |A[i,j]-old|
}
}
}
Sequential Pseudocode: No old-new arrays..
Sequentially, how well does this work?
It works much better!
How to parallelize this?
Spring 2009 CS420: Parallel Algorithms
![Page 38: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/38.jpg)
38
How do we parallelize Gauss-Seidel?
• Visualize the flow of values
• Not the control flow:– That goes row-by-row
• Flow of dependences: which values depend on which values
• Does that give us a clue on how to parallelize?
Spring 2009 CS420: Parallel Algorithms
![Page 39: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/39.jpg)
39
Parallelizing Gauss Seidel
• Some ideas– Row decomposition, with pipelining
– Square over-decomposition
• Assign many squares to a processor (essentially same?)
Spring 2009 CS420: Parallel Algorithms
PE 0PE 1PE 2
![Page 40: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/40.jpg)
N
N
W
N/P
# Columns = N/W# Rows = P
11
22
22
...
...
...
P
...
...
...
P
P
P# Of Phases
P
P
P
P ... ... ...
... ... ...
... ... ...
... ... ...
NWNW
NWNW
NWNW
NWNW
N + 1W N + 1W
...N + 1W N + 1W
...N + 1W N + 1W
N/W
N +PW
+ P (-1)
Row decomposition, with pipelining
![Page 41: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/41.jpg)
Time
# ProcsUsed
P
0 P NW
N + P -1W
![Page 42: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/42.jpg)
Red-Black Squares Method
Spring 2009 CS420: Parallel Algorithms 42
• Red squares calculate values based on the black squares– Then black squares use values from red squares
– Now red ones can be done in parallel and then black ones can be done in parallel Each square locally can do Gauss-
Seidel computation
![Page 43: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/43.jpg)
Communication : alpha reducing optimizations• When you are sending too many tiny messages:
– Alpha cost is high (a microsecond per msg, for example)
– How to reduce it?
• Simple combining: – Combine messages going to the same destination
– Cost: delay (lesser pipelining)
• More complex scenario:– AllToAll: everyone wants to send a short message to everyone
else
– Direct method: . (P-1) +.(P-1).m
– For small m, the cost dominates
04/21/23 Performance Techniques 43
![Page 44: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/44.jpg)
All to all via Mesh
Organize processors in a 2D (virtual) grid
Phase 1:
Each processor sends messages within its row
Phase 2:
Each processor sends messages within its column
Message from (x1,y1) to (x2,y2) goes via (x1,y2)
2. messages instead of P-1 1P
For us: 26 messages instead of 192
1P
1P
![Page 45: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/45.jpg)
0
10
20
30
40
50
60
16 32 64 96 128 192 256 512 1024 1280 1536 2048Processors
Tim
e (
ms
)
MPI
Mesh
Hypercube
3d Grid
All to all on Lemieux for 76 byte Msg.
![Page 46: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/46.jpg)
All to all on Lemieux 1024 processors
0
100
200
300
400
500
600
700
800
900
76 276 476 876 1276 1676 2076 3076 4076 6076 8076Message Size (Bytes)
Tim
e (
ms
)
Mesh
Mesh Compute
Bigger benefit: CPU is occupied for a much shorter time!
![Page 47: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/47.jpg)
Namd Performance on Lemieux
![Page 48: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/48.jpg)
Impact on Application Performance
0
20
40
60
80
100
120
140
Step Time
256 512 1024
Processors
MeshDirectMPI
Namd Performance on Lemieux, with the transpose step implemented using different all-to-all algorithms
![Page 49: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/49.jpg)
49
Sequential Performance Issues
Performance Techniques04/21/23
![Page 50: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/50.jpg)
Example program
• Imagine a sequential program running using a large array, A• For each I, A[i] = A[i] + A[some other index]• How long should the program take, if each addition is a ns• What is the performance difference you expect, depending on how
the other index is chosen?
for (i=0, index2=0; i<size; i++) { index2 += SOME_NUMBER; // smaller than size if (index2 > size) index2 -= size; A[i] += A[index2];}
Spring 2009 50CS420: Cache Hierarchies
for (i=0; i<size-1; i++) { A[i] += A[i+1];}
![Page 51: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/51.jpg)
Caches and Cache Performance
Spring 2009 CS420: Cache Hierarchies 51
• Remember the von Neumann model
CPU
MemoryMemory
Registers
Cache
CPU
![Page 52: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/52.jpg)
Why and how does a cache help?• Only because of the principle of locality
– Programs tend to access the same and/or nearby data repeatedly
– Temporal and spatial locality
• Size of cache
• Multiple levels of cache
• Performance impact of caches– Designing programs for good sequential performance
Spring 2009 52CS420: Cache Hierarchies
![Page 53: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/53.jpg)
Reality today: multi-level caches
Spring 2009 CS420: Cache Hierarchies 53
• Remember the von Neumann model
CPU
Memory
Cache
CPU
Memory
Caches
![Page 54: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/54.jpg)
Example: Intel’s Nehalem
Spring 2009 CS420: Cache Hierarchies 54
• Nehalem architecture, core i7 chip:– 64 KB L1 instruction and 64 KB L1 data cache per core
– 256 KB L2 cache (combined instruction and data) per core
– 8 MB L3 (combined instruction and data) "inclusive", shared by all cores
• Still, even L1 cache is several cycles – (reportedly 4 cycles, inreased from 3 before)
– L2: 10 cycles
![Page 55: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/55.jpg)
A little bit about microprocessor architecture
Spring 2009 CS420: Cache Hierarchies 55
• Architecture over the last 2-3 decades was driven by the need to make clock cycle go faster and faster– Pipelining developed as an essential technique early on.
– Each instruction execution is pipelinesd:
• Fetch, decode, execute, stages at least
• In addition, floating point operations, which take longer to calculate, have their own separate pipeline
• L1 cache accesses in Nehalem are pipelined – so even though it takes 4 cycles to get the result, you can keep issuing a new load every cycle, and you wouldn’t notice a difference (almost) if they are all found in L1 cache (i.e. are “hit”s)
![Page 56: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/56.jpg)
Another issue: SIMD vectors• Hardware is capable of executing multiple
floting point instructions per cycle– Need to enable that, by using SIMD vector instructions
– Example: Intel’s SSE and IBM’s AltiVec
• Compilers try to automate it, – but are not always successful
• Learn manual vectorization– Or use libraries that help
04/21/23 Performance Techniques 56
movaps xmm0, [v1] ;xmm0 = v1.w | v1.z | v1.y | v1.x addps xmm0, [v2] ;xmm0 = v1.w+v2.w | v1.z+v2.z | v1.y+v2.y | v1.x+v2.x movaps [vec_res], xmm0
Example from Wikipedia:
![Page 57: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/57.jpg)
Broad Approach To Performance Tuning• Understand (for a given appliction)
– Fraction of peak performance
• (10% is good for many apps!)
– Parallel efficiency:
• Speedup plots
• Strong scaling: keep problem size fixed
• Weak scaling: increase problem size with processors
– These help you decide where to focus
• Sequential optimizations => fraction of peak– Use right compiler flags (basic: -O3)
• Parallel inefficiency: – Grainsize, Communication costs, idle time, critical paths,
– load imbalances
• One way to recognize it: wait time at barriers!
04/21/23 Performance Techniques 57
![Page 58: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/58.jpg)
58
Decouple decomposition from Physical Processors
Performance Techniques04/21/23
![Page 59: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/59.jpg)
59
Migratable Objects (aka Processor Virtualization)
User View
System implementation
Programmer: [Over] decomposition into virtual processors
Runtime: Assigns VPs to processors
Enables adaptive runtime strategies
Implementations: Charm++, AMPI
• Software engineering– Number of virtual processors can be
independently controlled– Separate VPs for different modules
• Message driven execution– Adaptive overlap of communication– Predictability :
• Automatic out-of-core– Asynchronous reductions
• Dynamic mapping– Heterogeneous clusters
• Vacate, adjust to speed, share– Automatic checkpointing– Change set of processors used– Automatic dynamic load balancing– Communication optimization
Benefits
Performance Techniques04/21/23
![Page 60: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/60.jpg)
60
Migratable Objects (aka Processor Virtualization)
• Software engineering– Number of virtual processors can be
independently controlled– Separate VPs for different modules
• Message driven execution– Adaptive overlap of communication– Predictability :
• Automatic out-of-core– Asynchronous reductions
• Dynamic mapping– Heterogeneous clusters
• Vacate, adjust to speed, share– Automatic checkpointing– Change set of processors used– Automatic dynamic load balancing– Communication optimization
Benefits
Real Processors
MPI processes
Virtual Processors (user-level migratable threads)
Programmer: [Over] decomposition into virtual processors
Runtime: Assigns VPs to processors
Enables adaptive runtime strategies
Implementations: Charm++, AMPI
Performance Techniques04/21/23
![Page 61: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/61.jpg)
61
Parallel Decomposition and Processors
• MPI-style encourages– Decomposition into P pieces, where P is the number of physical
processors available
– If your natural decomposition is a cube, then the number of processors must be a cube
– …
• Charm++/AMPI style “virtual processors”– Decompose into natural objects of the application
– Let the runtime map them to processors
– Decouple decomposition from load balancing
Performance Techniques04/21/23
![Page 62: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/62.jpg)
6204/21/23 LSU PetaScale 62
Decomposition independent of numCores
• Rocket simulation example under traditional MPI vs. Charm++/AMPI framework
– Benefit: load balance, communication optimizations, modularity
Solid
Fluid
Solid
Fluid
Solid
Fluid. . .
1 2 P
Solid1
Fluid1
Solid2
Fluid2
Solidn
Fluidm. . .
Solid3. . .
Performance Techniques04/21/23
![Page 63: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/63.jpg)
63
OpenAtomCar-Parinello ab initio MD
Collabrative IT project with: R. Car, M. Klein, M. Tuckerman, Glenn Martyna, N. Nystrom, ..
Specific software project (leanCP): Glenn Martyna, Mark Tuckerman, L.V. Kale and co-workers (E. Bohm, Yan Shi, Ramkumar Vadali)
Funding : NSF-CHE, NSF-CS, NSF-ITR, IBMFunding : NSF-CHE, NSF-CS, NSF-ITR, IBMPerformance Techniques04/21/23
![Page 64: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/64.jpg)
New OpenAtom Collaboration
• Principle Investigators – M. Tuckerman (NYU)– L.V. Kale (UIUC)– G. Martyna (IBM TJ Watson)– K. Schulten (UIUC)– J. Dongarra (UTK/ORNL)
• Current effort is focused on – QMMM via integration with NAMD2– ORNL Cray XT4 Jaguar (31,328 cores) – ALCF IBM Blue Gene/P (163,840 cores)
Performance Techniques 6404/21/23
![Page 65: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/65.jpg)
Ab initio Molecular Dynamics, electronic structure simulation enables the study of many important systems
Molecular Clusters : Nanowires:
Semiconductor Surfaces: 3D-Solids/Liquids:
Performance Techniques
6504/21/23
![Page 66: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/66.jpg)
Quantum Chemistry
• Car-Parinello Molecular Dynamics– High precision: AIMD molecular dynamics uses
quantum mechanical descriptions of electronic structure to determine forces between atoms. Thereby permitting accurate atomistic descriptions of chemical reactions.
– PPL's OpenAtom project features a unique parallel decomposition of the Car-Parinello method. Using Charm++ virtualization we can efficiently scale small (32 molecule) systems to thousands of processors.
Performance Techniques 6604/21/23
![Page 67: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/67.jpg)
Computation Flow
67Performance Techniques04/21/23
![Page 68: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/68.jpg)
Torus Aware Object Mapping
68Performance Techniques04/21/23
![Page 69: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/69.jpg)
OpenAtom Performance
Performance Techniques 6904/21/23
![Page 70: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/70.jpg)
Benefits of Topology Mapping
Watson Blue Gene/L(CO mode)
PSC BigBen (XT3)(SN and VN mode)
Performance Techniques 7004/21/23
![Page 71: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/71.jpg)
7104/21/23 LSU PetaScale 71
Use Dynamic Load BalancingBased on the
Principle of Persistence
Principle of persistence
Computational loads and communication patterns tend to persist, even in dynamic computations
So, recent past is a good predictor or near future
Performance Techniques04/21/23
![Page 72: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/72.jpg)
72
Load Balancing Steps
Regular Timesteps
Instrumented Timesteps
Detailed, aggressive Load Balancing
Refinement Load Balancing
Performance Techniques04/21/23
![Page 73: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/73.jpg)
73
Processor Utilization against Time on 128 and 1024 processors
On 128 processor, a single load balancing step suffices, but
On 1024 processors, we need a “refinement” step.
Load Balancing
Aggressive Load Balancing
Refinement Load
Balancing
Performance Techniques04/21/23
![Page 74: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/74.jpg)
74
ChaNGa: Parallel Gravity
• Collaborative project (NSF ITR)– with Prof. Tom Quinn, Univ. of Washington
• Components: gravity, gas dynamics• Barnes-Hut tree codes
– Oct tree is natural decomposition:• Geometry has better aspect ratios, and so you “open” fewer
nodes up.• But is not used because it leads to bad load balance• Assumption: one-to-one map between sub-trees and
processors• Binary trees are considered better load balanced
– With Charm++: Use Oct-Tree, and let Charm++ map subtrees to processors
Performance Techniques04/21/23
![Page 75: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/75.jpg)
75
Load balancing with GreedyLB
dwarf 5M on 1,024 BlueGene/L processors
5.6s 6.1s
Messages x1000 Bytes transferred (MB)
0
5000
10000
15000
20000
25000
30000
Main title
Before LB
After LB
Performance Techniques04/21/23
![Page 76: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/76.jpg)
76
Load balancing with OrbRefineLB
dwarf 5M on 1,024 BlueGene/L processors
5.6s 5.0s
Performance Techniques04/21/23
![Page 77: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/77.jpg)
ChaNGa: Parallel Gravity Code
Developed in Collaboration with Tom Quinn (Univ. Washington) using Charm++
ChaNGa Preliminary Performance
77Performance Techniques04/21/23
![Page 78: Techniques for Developing Efficient Petascale Applications](https://reader035.fdocuments.us/reader035/viewer/2022062802/56814645550346895db351f9/html5/thumbnails/78.jpg)
78
Summary
• Exciting times ahead• Petascale computers
– unprecedented opportunities for progress in science and engg.– Petascale Applications will require a large toolbox, with
• Algorithms, Adaptive Runtime System, Performance tools, …• Object-based decomposition• Dynamic Load balancing• Scalable Performance analysis
• Early performance development via BigSim
My Research:http://charm.cs.uiuc.edu
Blue Waters: http://www.ncsa.uiuc.edu/BlueWaters/
Performance Techniques04/21/23