Towards Semi-Automatic Prediction of Wavefront Code ... fileIntroduction • Performance Prediction...

Towards Semi-Automatic Prediction of Wavefront Code Performance

Si Hammond (Gihan Mudalige, Dr Stephen Jarvis, Prof. M.K. Vernon)High Performance Systems Group

Department of Computer Science

Introduction

• Performance Prediction - the why.

• Current Techniques - the how.

• Introduction to the Warwick Performance Prediction Toolkit (WPPT)

• Sample Wavefront Applications - Sweep3D and Chimaera

• Results

• Conclusion

Performance Prediction

Performance Prediction: Why?

• Principle Question: How long does it take to execute a piece of code on a particular machine/network?



DO i = 0, nSUM = SUM + n**3

END DO



DO i = 0, nSUM = SUM + n**3

END DO

• Execution Time = Time(for one iteration) * n + Loop Overhead



• 2 processors, each does n/2 work

DO i = 0, n/2SUM = SUM + n**3

END DO

IF(I AM P0) THEN SEND(P1, SUM); RECV(P1, SUM0); SUM = SUM + SUM0

ELSESEND(P0, SUM); RECV(P0, SUM1); SUM = SUM + SUM1

• Execution Time = Time(for one iteration) * (n/2) + Send + Recv



• 2 processors, each does n/2 work

• Execution Time = Time(for one iteration) * (n/2) + Send + Recv

• What if.....• More processors?• Faster processors?• Faster network?• Different Network Topology?



‣ Enables speculation on different architectures

‣ Can verify whether a supplied architecture is as good as a vendor says it is

‣ Can find potential sources of optimisation within code

‣ Can find potential sources of specialisation within an architecture

‣ Can check that an upgrade/change in hardware has/hasn’t affected performance

‣ Can improve utilisation and scheduling

Performance Prediction: How?

• Two main classes of performance prediction:

• Analytical (e.g. LogP, LogGP)‣ Involves decomposing an application into mathematical functions‣ Requires a very deep understanding of the code and how this will

execute‣ Generally very accurate but require a long time to develop‣ Once model developed quick to get predictions

• Simulation (e.g. PACE)‣ Involves decomposing an application into some form of simulation‣ Often requires less knowledge of application behaviour provided

simulator captures this when executed‣ Generally accurate if simulator captures machine behaviour, often less

time needed to develop‣ Simulation can take a long time

The Warwick Performance Prediction Toolkit

• In a lab..... long long ago....

Application decomposed into a series of mnemonics representing machine code operations e.g.

sum = sum + a*3

MFDG (Multiplication, Floating Point Double Global)+ AFDG (Addition, Floating Point Double Global)+ SFDG (Store, Floating Point Double Global)

• To generate prediction obtain cost of each mnemonics by benchmarking then add them up

Simulation History

sum = sum + a*2 + b*

2*MFDG + 2*AFDG + SFDG

Pipelining and Operation Codes in Action

Time

PACE

Processor

Floating Point Multiplications

Over-Estimate

• Over-prediction - pipelining and out of order execution make predicting the precise order of instructions largely impossible.

• Doing this in a generic way is even more difficult.

• Problem is made worse by compiler optimisation and use of special register sets such as SSE, 3D-Now!, AltiVec etc.

PACE and Pipelining

• Replace the use of mnemonics with blocks of time taken from actual runs (in a similar method to analytical).

• Prediction process:

• Inject code with timing routines• Execute instrumented application to obtain timings• Generate a simulation with timings added• Simulate code

• Similar to PACE with timings used instead of mnemonics

Alternative Approach

• Is it possible to automate the generation of models for Fortran code which are accurate?

• Is it possible to automate the injection of timing routines within code?

• How accurate can the automated approach be compared to hand written analytical models?

WPPT Research Questions

WPPT Modelling Process

Parse code through (automated) analyser

Compile and run injected code

Add Timing and networking values to

model

Simulate

Instrumented Code WPPT Model

Timings

WPPT Model + Timings

Prediction

WPPT Modelling Process

Parse code through (automated) analyser

Compile and run injected code

Add Timing and networking values to

model

Simulate

Instrumented Code WPPT Model

Timings

WPPT Model + Timings

Prediction

Analytical

Automatic Model Generation

• Challenge: can we take Fortran 90 code and automatically generate a model?

• Method:

1. Copy control flow from original code

2. Analyse original code, instead of computations add a place holder to say computation occurs at this point

3. Copy over expressions needed for control flow (as much as possible)

• Problem: often scientific codes read values from files - these aren’t available when generating a model, this requires user intervention

Creating a Model

SUM = 0

DO i = 0, 1000SUM = SUM + i*i

END DO

FOR(i=0; i<1000; i++) {compute(region1);

}

Fortran 90 Code WPPT Script

• Compute region1 by timing code when actually running.

• Abbreviated form replaces computation in ‘region1’ with a single time

Limits to Automation

• Difficult to fully analyse code using limited analysis (need to balance speed and accuracy).

• Even good compiler analysis doesn’t produce all of the answers we need!

• Some parameters read from file by code, can’t get hold of these at ‘analysis’-time

• Answer: User must step in to complete model (usually by setting ‘global’ values which would have been read from a file).

• Can still have a model quickly but often need to manually tweak

Performance Prediction in Practice:

Wavefront Codes

Wavefront Codes

• Account for 50-80% of execution time in defence programs.

• Wavefront codes are used to solve equations where the output of each cell relies on the output of all of its neighbours.

..

..

.... .. .. ..

..

Wavefront Codes

• Solve cells in a series of ‘sweeps’ - one sweep from each corner of the array

• We solve as much as we can in each sweep, anything left over is solved in a later sweep

Being Processed

Processed

Unprocessed

Wavefront Codes and Prediction

• Occupy a large proportion of super-computing execution time (~50-80%)

• Ranked by US Department of Energy and UK AWE as an important benchmark of machine performance

• Relatively simple benchmarking codes

• Lots of established literature on topic

Results

WPPT (Semi-Automated) Model

for(isw=1;isw<=8;isw=isw+1) { for(l=lfirst;l<=llast;l=l+lstep) { if(from_top!=mpi_proc_null) if((isw == 1) || (isw == 2) || (isw == 4) || (isw == 6))

mpi_recv(proc_top, topbot_bufl*mpi_myreal); if(from_right!=mpi_proc_null) if((isw == 1) || (isw == 2) || (isw == 3) || (isw == 5)) mpi_recv(proc_right, rightleft_bufl*mpi_myreal);

if(from_bot!=mpi_proc_null) if((isw == 3) || (isw == 5) || (isw == 7) || (isw == 8)) mpi_recv(proc_bot, topbot_bufl*mpi_myreal);

if(from_left!=mpi_proc_null) if((isw == 4) || (isw == 6) || (isw == 7) || (isw == 8)) { mpi_recv(proc_left, rightleft_bufl*mpi_myreal); // Computation Replaced with: compute(cpu_time_region_1); .... if(to_top!=mpi_proc_null) { if((isw == 3) || (isw == 5) || (isw == 7) || (isw == 8)) { //writeln("processor " + rank + " sending to: " + proc_top); mpi_send(proc_top, topbot_bufl*mpi_myreal); } } if(to_right!=mpi_proc_null) { if((isw == 4) || (isw == 6) || (isw == 7) || (isw == 8)) { //writeln("processor " + rank + " sending to: " + proc_right); mpi_send(proc_right, rightleft_bufl*mpi_myreal); } } if(to_bot!=mpi_proc_null) { if((isw == 1) || (isw == 2) || (isw == 4) || (isw == 6)) { //writeln("processor " + rank + " sending to: " + proc_bot); mpi_send(proc_bot, topbot_bufl*mpi_myreal); } } if(to_left!=mpi_proc_null) { if((isw == 1) || (isw == 2) || (isw == 3) || (isw == 5)) { //writeln("processor " + rank + " sending to: " + proc_left); mpi_send(proc_left, topbot_bufl*mpi_myreal); } } } mark_stack(); if(rank == 0) { writeln("Processor[0] completed sweep: " + isw); } }

HPSG Cluster

• 2 x AMD 246, 2Ghz Processors per node

• 2Gb RAM per node (shared between 2 processors)

• 16 x nodes = 32 processors

• Gigabit Ethernet between all nodes

• MPICH v1 (MPI Libraries), compiled for shared memory transfer

• Portland Group Fortran 90, C and C++ Compilers

Sweep3D on the HPSG Cluster

Prediction (s) Actual (s) Error

8 761.28 725.46 4.94%10 610.35 575.62 6.03%12 522.60 478.45 9.23%16 402.66 366.19 9.96%20 322.88 307.87 4.88%22 280.98 305.01 -7.88%24 269.70 268.03 0.62%32 203.22 201.16 1.02%

Sweep3D, 500x500x500 ProblemAverage Error ~ 5.57%

Chimaera on the HPSG Cluster

Prediction (s) Actual (s) Error

8 118.08 107.35 9.99%10 98.42 91.02 8.14%12 86.16 80.92 6.47%16 70.27 65.74 6.89%20 60.66 56.51 7.35%22 59.77 59.04 1.23%24 54.18 52.06 4.07%32 47.97 45.59 5.22%

Chimaera, 60x60x60 Problem Size

Cray (Jaguar) XT4 Supercomputer

• 16 x Dual/Quad Core AMD Processors, 2.6Ghz per blade

• 6 blades per cabinet = 96 processors (note, can be more cores)

• 2Gb of memory per processor

• Each AMD processor is connected to a SeaStar router chip which processes network transmissions.

• 3D-taurus SeaStar Network

• Cray UNICOS (Catamount) Operating System, lightweight kernel

Sweep3D on the Cray XT4

Sweep3D, 50x50x50 cells per Processor

Predicted (s) Actual (s) Error (%)4 3.96 4.13 -4.0916 4.52 4.91 -7.8732 4.96 5.54 -10.5364 5.64 5.47 3.06128 6.38 6.01 6.19256 7.86 7.28 7.96512 9.35 10.02 -6.73

Comparing Results

• Los Alamos Model - Kerbyson et al. (Analytical) 5-15%, general wavefront model, less accurate as number of processors increases (> 20%).

• General LogGP Model (Analytical) - Struckel & Vernon, currently being revised by Mudalige and Vernon, general wavefront model, extra contention and SMP-case terms, (< 10%).

• WPPT (Simulative) - errors < 10% on HPSG-cluster, 10-15% on XT4, currently does not model network contention.

Conclusions

Quick Recap

• Simulation as a performance prediction technique needs some new ideas - motivated by changing architectures, networks and more complex codes

• Wavefront codes - account for large amount of super-computing time, used as a method of benchmarking and understanding machine performance

• WPPT able to produce average errors of ~10% on variety of hardware, comparable to other techniques but not as accurate in all cases

Where in 2007-2008?

• More models - wavefront and other types (adaptive hydrocodes)

• Using simulation outputs for visualising decomposition and execution

• Improving simulation performance and automated code analysis tools

• Question of how multi-core systems will be modelled - these are extremely complex, technology moving very quickly

• Question of contention when use cores, does putting more cores on the same network = better performance?

Thank you for your attention

Questions?

Towards Semi-Automatic Prediction of Wavefront Code ... fileIntroduction • Performance Prediction...

Documents

Transcript of Towards Semi-Automatic Prediction of Wavefront Code ... fileIntroduction • Performance Prediction...