Towards Semi-Automatic Prediction of Wavefront Code ... fileIntroduction • Performance Prediction...
Transcript of Towards Semi-Automatic Prediction of Wavefront Code ... fileIntroduction • Performance Prediction...
Towards Semi-Automatic Prediction of Wavefront Code Performance
Si Hammond (Gihan Mudalige, Dr Stephen Jarvis, Prof. M.K. Vernon)High Performance Systems Group
Department of Computer Science
Introduction
• Performance Prediction - the why.
• Current Techniques - the how.
• Introduction to the Warwick Performance Prediction Toolkit (WPPT)
• Sample Wavefront Applications - Sweep3D and Chimaera
• Results
• Conclusion
Performance Prediction
Performance Prediction: Why?
• Principle Question: How long does it take to execute a piece of code on a particular machine/network?
Performance Prediction: Why?
• Principle Question: How long does it take to execute a piece of code on a particular machine/network?
DO i = 0, nSUM = SUM + n**3
END DO
Performance Prediction: Why?
• Principle Question: How long does it take to execute a piece of code on a particular machine/network?
DO i = 0, nSUM = SUM + n**3
END DO
• Execution Time = Time(for one iteration) * n + Loop Overhead
Performance Prediction: Why?
• Principle Question: How long does it take to execute a piece of code on a particular machine/network?
• 2 processors, each does n/2 work
DO i = 0, n/2SUM = SUM + n**3
END DO
IF(I AM P0) THEN SEND(P1, SUM); RECV(P1, SUM0); SUM = SUM + SUM0
ELSESEND(P0, SUM); RECV(P0, SUM1); SUM = SUM + SUM1
• Execution Time = Time(for one iteration) * (n/2) + Send + Recv
Performance Prediction: Why?
• Principle Question: How long does it take to execute a piece of code on a particular machine/network?
• 2 processors, each does n/2 work
• Execution Time = Time(for one iteration) * (n/2) + Send + Recv
• What if.....• More processors?• Faster processors?• Faster network?• Different Network Topology?
Performance Prediction: Why?
• Principle Question: How long does it take to execute a piece of code on a particular machine/network?
‣ Enables speculation on different architectures
‣ Can verify whether a supplied architecture is as good as a vendor says it is
‣ Can find potential sources of optimisation within code
‣ Can find potential sources of specialisation within an architecture
‣ Can check that an upgrade/change in hardware has/hasn’t affected performance
‣ Can improve utilisation and scheduling
Performance Prediction: How?
• Two main classes of performance prediction:
• Analytical (e.g. LogP, LogGP)‣ Involves decomposing an application into mathematical functions‣ Requires a very deep understanding of the code and how this will
execute‣ Generally very accurate but require a long time to develop‣ Once model developed quick to get predictions
• Simulation (e.g. PACE)‣ Involves decomposing an application into some form of simulation‣ Often requires less knowledge of application behaviour provided
simulator captures this when executed‣ Generally accurate if simulator captures machine behaviour, often less
time needed to develop‣ Simulation can take a long time
The Warwick Performance Prediction Toolkit
• In a lab..... long long ago....
Application decomposed into a series of mnemonics representing machine code operations e.g.
sum = sum + a*3
MFDG (Multiplication, Floating Point Double Global)+ AFDG (Addition, Floating Point Double Global)+ SFDG (Store, Floating Point Double Global)
• To generate prediction obtain cost of each mnemonics by benchmarking then add them up
Simulation History
sum = sum + a*2 + b*
2*MFDG + 2*AFDG + SFDG
Pipelining and Operation Codes in Action
Time
PACE
Processor
Floating Point Multiplications
Over-Estimate
• Over-prediction - pipelining and out of order execution make predicting the precise order of instructions largely impossible.
• Doing this in a generic way is even more difficult.
• Problem is made worse by compiler optimisation and use of special register sets such as SSE, 3D-Now!, AltiVec etc.
PACE and Pipelining
• Replace the use of mnemonics with blocks of time taken from actual runs (in a similar method to analytical).
• Prediction process:
• Inject code with timing routines• Execute instrumented application to obtain timings• Generate a simulation with timings added• Simulate code
• Similar to PACE with timings used instead of mnemonics
Alternative Approach
• Is it possible to automate the generation of models for Fortran code which are accurate?
• Is it possible to automate the injection of timing routines within code?
• How accurate can the automated approach be compared to hand written analytical models?
WPPT Research Questions
WPPT Modelling Process
Parse code through (automated) analyser
Compile and run injected code
Add Timing and networking values to
model
Simulate
Instrumented Code WPPT Model
Timings
WPPT Model + Timings
Prediction
WPPT Modelling Process
Parse code through (automated) analyser
Compile and run injected code
Add Timing and networking values to
model
Simulate
Instrumented Code WPPT Model
Timings
WPPT Model + Timings
Prediction
Analytical
Automatic Model Generation
• Challenge: can we take Fortran 90 code and automatically generate a model?
• Method:
1. Copy control flow from original code
2. Analyse original code, instead of computations add a place holder to say computation occurs at this point
3. Copy over expressions needed for control flow (as much as possible)
• Problem: often scientific codes read values from files - these aren’t available when generating a model, this requires user intervention
Creating a Model
SUM = 0
DO i = 0, 1000SUM = SUM + i*i
END DO
FOR(i=0; i<1000; i++) {compute(region1);
}
Fortran 90 Code WPPT Script
• Compute region1 by timing code when actually running.
• Abbreviated form replaces computation in ‘region1’ with a single time
Limits to Automation
• Difficult to fully analyse code using limited analysis (need to balance speed and accuracy).
• Even good compiler analysis doesn’t produce all of the answers we need!
• Some parameters read from file by code, can’t get hold of these at ‘analysis’-time
• Answer: User must step in to complete model (usually by setting ‘global’ values which would have been read from a file).
• Can still have a model quickly but often need to manually tweak
Performance Prediction in Practice:
Wavefront Codes
Wavefront Codes
• Account for 50-80% of execution time in defence programs.
• Wavefront codes are used to solve equations where the output of each cell relies on the output of all of its neighbours.
..
..
.... .. .. ..
..
Wavefront Codes
• Solve cells in a series of ‘sweeps’ - one sweep from each corner of the array
• We solve as much as we can in each sweep, anything left over is solved in a later sweep
Being Processed
Processed
Unprocessed
Wavefront Codes
• Solve cells in a series of ‘sweeps’ - one sweep from each corner of the array
• We solve as much as we can in each sweep, anything left over is solved in a later sweep
Being Processed
Processed
Unprocessed
Wavefront Codes and Prediction
• Occupy a large proportion of super-computing execution time (~50-80%)
• Ranked by US Department of Energy and UK AWE as an important benchmark of machine performance
• Relatively simple benchmarking codes
• Lots of established literature on topic
Results
WPPT (Semi-Automated) Model
for(isw=1;isw<=8;isw=isw+1) { for(l=lfirst;l<=llast;l=l+lstep) { if(from_top!=mpi_proc_null) if((isw == 1) || (isw == 2) || (isw == 4) || (isw == 6))
mpi_recv(proc_top, topbot_bufl*mpi_myreal); if(from_right!=mpi_proc_null) if((isw == 1) || (isw == 2) || (isw == 3) || (isw == 5)) mpi_recv(proc_right, rightleft_bufl*mpi_myreal);
if(from_bot!=mpi_proc_null) if((isw == 3) || (isw == 5) || (isw == 7) || (isw == 8)) mpi_recv(proc_bot, topbot_bufl*mpi_myreal);
if(from_left!=mpi_proc_null) if((isw == 4) || (isw == 6) || (isw == 7) || (isw == 8)) { mpi_recv(proc_left, rightleft_bufl*mpi_myreal); // Computation Replaced with: compute(cpu_time_region_1); .... if(to_top!=mpi_proc_null) { if((isw == 3) || (isw == 5) || (isw == 7) || (isw == 8)) { //writeln("processor " + rank + " sending to: " + proc_top); mpi_send(proc_top, topbot_bufl*mpi_myreal); } } if(to_right!=mpi_proc_null) { if((isw == 4) || (isw == 6) || (isw == 7) || (isw == 8)) { //writeln("processor " + rank + " sending to: " + proc_right); mpi_send(proc_right, rightleft_bufl*mpi_myreal); } } if(to_bot!=mpi_proc_null) { if((isw == 1) || (isw == 2) || (isw == 4) || (isw == 6)) { //writeln("processor " + rank + " sending to: " + proc_bot); mpi_send(proc_bot, topbot_bufl*mpi_myreal); } } if(to_left!=mpi_proc_null) { if((isw == 1) || (isw == 2) || (isw == 3) || (isw == 5)) { //writeln("processor " + rank + " sending to: " + proc_left); mpi_send(proc_left, topbot_bufl*mpi_myreal); } } } mark_stack(); if(rank == 0) { writeln("Processor[0] completed sweep: " + isw); } }
WPPT (Semi-Automated) Model
for(isw=1;isw<=8;isw=isw+1) { for(l=lfirst;l<=llast;l=l+lstep) { if(from_top!=mpi_proc_null) if((isw == 1) || (isw == 2) || (isw == 4) || (isw == 6))
mpi_recv(proc_top, topbot_bufl*mpi_myreal); if(from_right!=mpi_proc_null) if((isw == 1) || (isw == 2) || (isw == 3) || (isw == 5)) mpi_recv(proc_right, rightleft_bufl*mpi_myreal);
if(from_bot!=mpi_proc_null) if((isw == 3) || (isw == 5) || (isw == 7) || (isw == 8)) mpi_recv(proc_bot, topbot_bufl*mpi_myreal);
if(from_left!=mpi_proc_null) if((isw == 4) || (isw == 6) || (isw == 7) || (isw == 8)) { mpi_recv(proc_left, rightleft_bufl*mpi_myreal); // Computation Replaced with: compute(cpu_time_region_1); .... if(to_top!=mpi_proc_null) { if((isw == 3) || (isw == 5) || (isw == 7) || (isw == 8)) { //writeln("processor " + rank + " sending to: " + proc_top); mpi_send(proc_top, topbot_bufl*mpi_myreal); } } if(to_right!=mpi_proc_null) { if((isw == 4) || (isw == 6) || (isw == 7) || (isw == 8)) { //writeln("processor " + rank + " sending to: " + proc_right); mpi_send(proc_right, rightleft_bufl*mpi_myreal); } } if(to_bot!=mpi_proc_null) { if((isw == 1) || (isw == 2) || (isw == 4) || (isw == 6)) { //writeln("processor " + rank + " sending to: " + proc_bot); mpi_send(proc_bot, topbot_bufl*mpi_myreal); } } if(to_left!=mpi_proc_null) { if((isw == 1) || (isw == 2) || (isw == 3) || (isw == 5)) { //writeln("processor " + rank + " sending to: " + proc_left); mpi_send(proc_left, topbot_bufl*mpi_myreal); } } } mark_stack(); if(rank == 0) { writeln("Processor[0] completed sweep: " + isw); } }
WPPT (Semi-Automated) Model
for(isw=1;isw<=8;isw=isw+1) { for(l=lfirst;l<=llast;l=l+lstep) { if(from_top!=mpi_proc_null) if((isw == 1) || (isw == 2) || (isw == 4) || (isw == 6))
mpi_recv(proc_top, topbot_bufl*mpi_myreal); if(from_right!=mpi_proc_null) if((isw == 1) || (isw == 2) || (isw == 3) || (isw == 5)) mpi_recv(proc_right, rightleft_bufl*mpi_myreal);
if(from_bot!=mpi_proc_null) if((isw == 3) || (isw == 5) || (isw == 7) || (isw == 8)) mpi_recv(proc_bot, topbot_bufl*mpi_myreal);
if(from_left!=mpi_proc_null) if((isw == 4) || (isw == 6) || (isw == 7) || (isw == 8)) { mpi_recv(proc_left, rightleft_bufl*mpi_myreal); // Computation Replaced with: compute(cpu_time_region_1); .... if(to_top!=mpi_proc_null) { if((isw == 3) || (isw == 5) || (isw == 7) || (isw == 8)) { //writeln("processor " + rank + " sending to: " + proc_top); mpi_send(proc_top, topbot_bufl*mpi_myreal); } } if(to_right!=mpi_proc_null) { if((isw == 4) || (isw == 6) || (isw == 7) || (isw == 8)) { //writeln("processor " + rank + " sending to: " + proc_right); mpi_send(proc_right, rightleft_bufl*mpi_myreal); } } if(to_bot!=mpi_proc_null) { if((isw == 1) || (isw == 2) || (isw == 4) || (isw == 6)) { //writeln("processor " + rank + " sending to: " + proc_bot); mpi_send(proc_bot, topbot_bufl*mpi_myreal); } } if(to_left!=mpi_proc_null) { if((isw == 1) || (isw == 2) || (isw == 3) || (isw == 5)) { //writeln("processor " + rank + " sending to: " + proc_left); mpi_send(proc_left, topbot_bufl*mpi_myreal); } } } mark_stack(); if(rank == 0) { writeln("Processor[0] completed sweep: " + isw); } }
HPSG Cluster
• 2 x AMD 246, 2Ghz Processors per node
• 2Gb RAM per node (shared between 2 processors)
• 16 x nodes = 32 processors
• Gigabit Ethernet between all nodes
• MPICH v1 (MPI Libraries), compiled for shared memory transfer
• Portland Group Fortran 90, C and C++ Compilers
Sweep3D on the HPSG Cluster
Prediction (s) Actual (s) Error
8 761.28 725.46 4.94%10 610.35 575.62 6.03%12 522.60 478.45 9.23%16 402.66 366.19 9.96%20 322.88 307.87 4.88%22 280.98 305.01 -7.88%24 269.70 268.03 0.62%32 203.22 201.16 1.02%
Sweep3D, 500x500x500 ProblemAverage Error ~ 5.57%
Chimaera on the HPSG Cluster
Prediction (s) Actual (s) Error
8 118.08 107.35 9.99%10 98.42 91.02 8.14%12 86.16 80.92 6.47%16 70.27 65.74 6.89%20 60.66 56.51 7.35%22 59.77 59.04 1.23%24 54.18 52.06 4.07%32 47.97 45.59 5.22%
Chimaera, 60x60x60 Problem Size
Cray (Jaguar) XT4 Supercomputer
• 16 x Dual/Quad Core AMD Processors, 2.6Ghz per blade
• 6 blades per cabinet = 96 processors (note, can be more cores)
• 2Gb of memory per processor
• Each AMD processor is connected to a SeaStar router chip which processes network transmissions.
• 3D-taurus SeaStar Network
• Cray UNICOS (Catamount) Operating System, lightweight kernel
Sweep3D on the Cray XT4
Sweep3D, 50x50x50 cells per Processor
Predicted (s) Actual (s) Error (%)4 3.96 4.13 -4.0916 4.52 4.91 -7.8732 4.96 5.54 -10.5364 5.64 5.47 3.06128 6.38 6.01 6.19256 7.86 7.28 7.96512 9.35 10.02 -6.73
Comparing Results
• Los Alamos Model - Kerbyson et al. (Analytical) 5-15%, general wavefront model, less accurate as number of processors increases (> 20%).
• General LogGP Model (Analytical) - Struckel & Vernon, currently being revised by Mudalige and Vernon, general wavefront model, extra contention and SMP-case terms, (< 10%).
• WPPT (Simulative) - errors < 10% on HPSG-cluster, 10-15% on XT4, currently does not model network contention.
Conclusions
Quick Recap
• Simulation as a performance prediction technique needs some new ideas - motivated by changing architectures, networks and more complex codes
• Wavefront codes - account for large amount of super-computing time, used as a method of benchmarking and understanding machine performance
• WPPT able to produce average errors of ~10% on variety of hardware, comparable to other techniques but not as accurate in all cases
Where in 2007-2008?
• More models - wavefront and other types (adaptive hydrocodes)
• Using simulation outputs for visualising decomposition and execution
• Improving simulation performance and automated code analysis tools
• Question of how multi-core systems will be modelled - these are extremely complex, technology moving very quickly
• Question of contention when use cores, does putting more cores on the same network = better performance?
Thank you for your attention
Questions?