Towards Semi-Automatic Prediction of Wavefront Code ... fileIntroduction • Performance Prediction...

40
Towards Semi-Automatic Prediction of Wavefront Code Performance Si Hammond (Gihan Mudalige, Dr Stephen Jarvis, Prof. M.K.Vernon) High Performance Systems Group Department of Computer Science

Transcript of Towards Semi-Automatic Prediction of Wavefront Code ... fileIntroduction • Performance Prediction...

Page 1: Towards Semi-Automatic Prediction of Wavefront Code ... fileIntroduction • Performance Prediction - the why. • Current Techniques - the how. • Introduction to the Warwick Performance

Towards Semi-Automatic Prediction of Wavefront Code Performance

Si Hammond (Gihan Mudalige, Dr Stephen Jarvis, Prof. M.K. Vernon)High Performance Systems Group

Department of Computer Science

Page 2: Towards Semi-Automatic Prediction of Wavefront Code ... fileIntroduction • Performance Prediction - the why. • Current Techniques - the how. • Introduction to the Warwick Performance

Introduction

• Performance Prediction - the why.

• Current Techniques - the how.

• Introduction to the Warwick Performance Prediction Toolkit (WPPT)

• Sample Wavefront Applications - Sweep3D and Chimaera

• Results

• Conclusion

Page 3: Towards Semi-Automatic Prediction of Wavefront Code ... fileIntroduction • Performance Prediction - the why. • Current Techniques - the how. • Introduction to the Warwick Performance

Performance Prediction

Page 4: Towards Semi-Automatic Prediction of Wavefront Code ... fileIntroduction • Performance Prediction - the why. • Current Techniques - the how. • Introduction to the Warwick Performance

Performance Prediction: Why?

• Principle Question: How long does it take to execute a piece of code on a particular machine/network?

Page 5: Towards Semi-Automatic Prediction of Wavefront Code ... fileIntroduction • Performance Prediction - the why. • Current Techniques - the how. • Introduction to the Warwick Performance

Performance Prediction: Why?

• Principle Question: How long does it take to execute a piece of code on a particular machine/network?

DO i = 0, nSUM = SUM + n**3

END DO

Page 6: Towards Semi-Automatic Prediction of Wavefront Code ... fileIntroduction • Performance Prediction - the why. • Current Techniques - the how. • Introduction to the Warwick Performance

Performance Prediction: Why?

• Principle Question: How long does it take to execute a piece of code on a particular machine/network?

DO i = 0, nSUM = SUM + n**3

END DO

• Execution Time = Time(for one iteration) * n + Loop Overhead

Page 7: Towards Semi-Automatic Prediction of Wavefront Code ... fileIntroduction • Performance Prediction - the why. • Current Techniques - the how. • Introduction to the Warwick Performance

Performance Prediction: Why?

• Principle Question: How long does it take to execute a piece of code on a particular machine/network?

• 2 processors, each does n/2 work

DO i = 0, n/2SUM = SUM + n**3

END DO

IF(I AM P0) THEN SEND(P1, SUM); RECV(P1, SUM0); SUM = SUM + SUM0

ELSESEND(P0, SUM); RECV(P0, SUM1); SUM = SUM + SUM1

• Execution Time = Time(for one iteration) * (n/2) + Send + Recv

Page 8: Towards Semi-Automatic Prediction of Wavefront Code ... fileIntroduction • Performance Prediction - the why. • Current Techniques - the how. • Introduction to the Warwick Performance

Performance Prediction: Why?

• Principle Question: How long does it take to execute a piece of code on a particular machine/network?

• 2 processors, each does n/2 work

• Execution Time = Time(for one iteration) * (n/2) + Send + Recv

• What if.....• More processors?• Faster processors?• Faster network?• Different Network Topology?

Page 9: Towards Semi-Automatic Prediction of Wavefront Code ... fileIntroduction • Performance Prediction - the why. • Current Techniques - the how. • Introduction to the Warwick Performance

Performance Prediction: Why?

• Principle Question: How long does it take to execute a piece of code on a particular machine/network?

‣ Enables speculation on different architectures

‣ Can verify whether a supplied architecture is as good as a vendor says it is

‣ Can find potential sources of optimisation within code

‣ Can find potential sources of specialisation within an architecture

‣ Can check that an upgrade/change in hardware has/hasn’t affected performance

‣ Can improve utilisation and scheduling

Page 10: Towards Semi-Automatic Prediction of Wavefront Code ... fileIntroduction • Performance Prediction - the why. • Current Techniques - the how. • Introduction to the Warwick Performance

Performance Prediction: How?

• Two main classes of performance prediction:

• Analytical (e.g. LogP, LogGP)‣ Involves decomposing an application into mathematical functions‣ Requires a very deep understanding of the code and how this will

execute‣ Generally very accurate but require a long time to develop‣ Once model developed quick to get predictions

• Simulation (e.g. PACE)‣ Involves decomposing an application into some form of simulation‣ Often requires less knowledge of application behaviour provided

simulator captures this when executed‣ Generally accurate if simulator captures machine behaviour, often less

time needed to develop‣ Simulation can take a long time

Page 11: Towards Semi-Automatic Prediction of Wavefront Code ... fileIntroduction • Performance Prediction - the why. • Current Techniques - the how. • Introduction to the Warwick Performance

The Warwick Performance Prediction Toolkit

Page 12: Towards Semi-Automatic Prediction of Wavefront Code ... fileIntroduction • Performance Prediction - the why. • Current Techniques - the how. • Introduction to the Warwick Performance

• In a lab..... long long ago....

Application decomposed into a series of mnemonics representing machine code operations e.g.

sum = sum + a*3

MFDG (Multiplication, Floating Point Double Global)+ AFDG (Addition, Floating Point Double Global)+ SFDG (Store, Floating Point Double Global)

• To generate prediction obtain cost of each mnemonics by benchmarking then add them up

Simulation History

Page 13: Towards Semi-Automatic Prediction of Wavefront Code ... fileIntroduction • Performance Prediction - the why. • Current Techniques - the how. • Introduction to the Warwick Performance

sum = sum + a*2 + b*

2*MFDG + 2*AFDG + SFDG

Pipelining and Operation Codes in Action

Time

PACE

Processor

Floating Point Multiplications

Over-Estimate

Page 14: Towards Semi-Automatic Prediction of Wavefront Code ... fileIntroduction • Performance Prediction - the why. • Current Techniques - the how. • Introduction to the Warwick Performance

• Over-prediction - pipelining and out of order execution make predicting the precise order of instructions largely impossible.

• Doing this in a generic way is even more difficult.

• Problem is made worse by compiler optimisation and use of special register sets such as SSE, 3D-Now!, AltiVec etc.

PACE and Pipelining

Page 15: Towards Semi-Automatic Prediction of Wavefront Code ... fileIntroduction • Performance Prediction - the why. • Current Techniques - the how. • Introduction to the Warwick Performance

• Replace the use of mnemonics with blocks of time taken from actual runs (in a similar method to analytical).

• Prediction process:

• Inject code with timing routines• Execute instrumented application to obtain timings• Generate a simulation with timings added• Simulate code

• Similar to PACE with timings used instead of mnemonics

Alternative Approach

Page 16: Towards Semi-Automatic Prediction of Wavefront Code ... fileIntroduction • Performance Prediction - the why. • Current Techniques - the how. • Introduction to the Warwick Performance

• Is it possible to automate the generation of models for Fortran code which are accurate?

• Is it possible to automate the injection of timing routines within code?

• How accurate can the automated approach be compared to hand written analytical models?

WPPT Research Questions

Page 17: Towards Semi-Automatic Prediction of Wavefront Code ... fileIntroduction • Performance Prediction - the why. • Current Techniques - the how. • Introduction to the Warwick Performance

WPPT Modelling Process

Parse code through (automated) analyser

Compile and run injected code

Add Timing and networking values to

model

Simulate

Instrumented Code WPPT Model

Timings

WPPT Model + Timings

Prediction

Page 18: Towards Semi-Automatic Prediction of Wavefront Code ... fileIntroduction • Performance Prediction - the why. • Current Techniques - the how. • Introduction to the Warwick Performance

WPPT Modelling Process

Parse code through (automated) analyser

Compile and run injected code

Add Timing and networking values to

model

Simulate

Instrumented Code WPPT Model

Timings

WPPT Model + Timings

Prediction

Analytical

Page 19: Towards Semi-Automatic Prediction of Wavefront Code ... fileIntroduction • Performance Prediction - the why. • Current Techniques - the how. • Introduction to the Warwick Performance

Automatic Model Generation

• Challenge: can we take Fortran 90 code and automatically generate a model?

• Method:

1. Copy control flow from original code

2. Analyse original code, instead of computations add a place holder to say computation occurs at this point

3. Copy over expressions needed for control flow (as much as possible)

• Problem: often scientific codes read values from files - these aren’t available when generating a model, this requires user intervention

Page 20: Towards Semi-Automatic Prediction of Wavefront Code ... fileIntroduction • Performance Prediction - the why. • Current Techniques - the how. • Introduction to the Warwick Performance

Creating a Model

SUM = 0

DO i = 0, 1000SUM = SUM + i*i

END DO

FOR(i=0; i<1000; i++) {compute(region1);

}

Fortran 90 Code WPPT Script

• Compute region1 by timing code when actually running.

• Abbreviated form replaces computation in ‘region1’ with a single time

Page 21: Towards Semi-Automatic Prediction of Wavefront Code ... fileIntroduction • Performance Prediction - the why. • Current Techniques - the how. • Introduction to the Warwick Performance

Limits to Automation

• Difficult to fully analyse code using limited analysis (need to balance speed and accuracy).

• Even good compiler analysis doesn’t produce all of the answers we need!

• Some parameters read from file by code, can’t get hold of these at ‘analysis’-time

• Answer: User must step in to complete model (usually by setting ‘global’ values which would have been read from a file).

• Can still have a model quickly but often need to manually tweak

Page 22: Towards Semi-Automatic Prediction of Wavefront Code ... fileIntroduction • Performance Prediction - the why. • Current Techniques - the how. • Introduction to the Warwick Performance

Performance Prediction in Practice:

Wavefront Codes

Page 23: Towards Semi-Automatic Prediction of Wavefront Code ... fileIntroduction • Performance Prediction - the why. • Current Techniques - the how. • Introduction to the Warwick Performance

Wavefront Codes

• Account for 50-80% of execution time in defence programs.

• Wavefront codes are used to solve equations where the output of each cell relies on the output of all of its neighbours.

..

..

.... .. .. ..

..

Page 24: Towards Semi-Automatic Prediction of Wavefront Code ... fileIntroduction • Performance Prediction - the why. • Current Techniques - the how. • Introduction to the Warwick Performance

Wavefront Codes

• Solve cells in a series of ‘sweeps’ - one sweep from each corner of the array

• We solve as much as we can in each sweep, anything left over is solved in a later sweep

Being Processed

Processed

Unprocessed

Page 25: Towards Semi-Automatic Prediction of Wavefront Code ... fileIntroduction • Performance Prediction - the why. • Current Techniques - the how. • Introduction to the Warwick Performance

Wavefront Codes

• Solve cells in a series of ‘sweeps’ - one sweep from each corner of the array

• We solve as much as we can in each sweep, anything left over is solved in a later sweep

Being Processed

Processed

Unprocessed

Page 26: Towards Semi-Automatic Prediction of Wavefront Code ... fileIntroduction • Performance Prediction - the why. • Current Techniques - the how. • Introduction to the Warwick Performance

Wavefront Codes and Prediction

• Occupy a large proportion of super-computing execution time (~50-80%)

• Ranked by US Department of Energy and UK AWE as an important benchmark of machine performance

• Relatively simple benchmarking codes

• Lots of established literature on topic

Page 27: Towards Semi-Automatic Prediction of Wavefront Code ... fileIntroduction • Performance Prediction - the why. • Current Techniques - the how. • Introduction to the Warwick Performance

Results

Page 28: Towards Semi-Automatic Prediction of Wavefront Code ... fileIntroduction • Performance Prediction - the why. • Current Techniques - the how. • Introduction to the Warwick Performance

WPPT (Semi-Automated) Model

for(isw=1;isw<=8;isw=isw+1) { for(l=lfirst;l<=llast;l=l+lstep) { if(from_top!=mpi_proc_null) if((isw == 1) || (isw == 2) || (isw == 4) || (isw == 6))

mpi_recv(proc_top, topbot_bufl*mpi_myreal); if(from_right!=mpi_proc_null) if((isw == 1) || (isw == 2) || (isw == 3) || (isw == 5)) mpi_recv(proc_right, rightleft_bufl*mpi_myreal);

if(from_bot!=mpi_proc_null) if((isw == 3) || (isw == 5) || (isw == 7) || (isw == 8)) mpi_recv(proc_bot, topbot_bufl*mpi_myreal);

if(from_left!=mpi_proc_null) if((isw == 4) || (isw == 6) || (isw == 7) || (isw == 8)) { mpi_recv(proc_left, rightleft_bufl*mpi_myreal); // Computation Replaced with: compute(cpu_time_region_1); .... if(to_top!=mpi_proc_null) { if((isw == 3) || (isw == 5) || (isw == 7) || (isw == 8)) { //writeln("processor " + rank + " sending to: " + proc_top); mpi_send(proc_top, topbot_bufl*mpi_myreal); } } if(to_right!=mpi_proc_null) { if((isw == 4) || (isw == 6) || (isw == 7) || (isw == 8)) { //writeln("processor " + rank + " sending to: " + proc_right); mpi_send(proc_right, rightleft_bufl*mpi_myreal); } } if(to_bot!=mpi_proc_null) { if((isw == 1) || (isw == 2) || (isw == 4) || (isw == 6)) { //writeln("processor " + rank + " sending to: " + proc_bot); mpi_send(proc_bot, topbot_bufl*mpi_myreal); } } if(to_left!=mpi_proc_null) { if((isw == 1) || (isw == 2) || (isw == 3) || (isw == 5)) { //writeln("processor " + rank + " sending to: " + proc_left); mpi_send(proc_left, topbot_bufl*mpi_myreal); } } } mark_stack(); if(rank == 0) { writeln("Processor[0] completed sweep: " + isw); } }

Page 29: Towards Semi-Automatic Prediction of Wavefront Code ... fileIntroduction • Performance Prediction - the why. • Current Techniques - the how. • Introduction to the Warwick Performance

WPPT (Semi-Automated) Model

for(isw=1;isw<=8;isw=isw+1) { for(l=lfirst;l<=llast;l=l+lstep) { if(from_top!=mpi_proc_null) if((isw == 1) || (isw == 2) || (isw == 4) || (isw == 6))

mpi_recv(proc_top, topbot_bufl*mpi_myreal); if(from_right!=mpi_proc_null) if((isw == 1) || (isw == 2) || (isw == 3) || (isw == 5)) mpi_recv(proc_right, rightleft_bufl*mpi_myreal);

if(from_bot!=mpi_proc_null) if((isw == 3) || (isw == 5) || (isw == 7) || (isw == 8)) mpi_recv(proc_bot, topbot_bufl*mpi_myreal);

if(from_left!=mpi_proc_null) if((isw == 4) || (isw == 6) || (isw == 7) || (isw == 8)) { mpi_recv(proc_left, rightleft_bufl*mpi_myreal); // Computation Replaced with: compute(cpu_time_region_1); .... if(to_top!=mpi_proc_null) { if((isw == 3) || (isw == 5) || (isw == 7) || (isw == 8)) { //writeln("processor " + rank + " sending to: " + proc_top); mpi_send(proc_top, topbot_bufl*mpi_myreal); } } if(to_right!=mpi_proc_null) { if((isw == 4) || (isw == 6) || (isw == 7) || (isw == 8)) { //writeln("processor " + rank + " sending to: " + proc_right); mpi_send(proc_right, rightleft_bufl*mpi_myreal); } } if(to_bot!=mpi_proc_null) { if((isw == 1) || (isw == 2) || (isw == 4) || (isw == 6)) { //writeln("processor " + rank + " sending to: " + proc_bot); mpi_send(proc_bot, topbot_bufl*mpi_myreal); } } if(to_left!=mpi_proc_null) { if((isw == 1) || (isw == 2) || (isw == 3) || (isw == 5)) { //writeln("processor " + rank + " sending to: " + proc_left); mpi_send(proc_left, topbot_bufl*mpi_myreal); } } } mark_stack(); if(rank == 0) { writeln("Processor[0] completed sweep: " + isw); } }

Page 30: Towards Semi-Automatic Prediction of Wavefront Code ... fileIntroduction • Performance Prediction - the why. • Current Techniques - the how. • Introduction to the Warwick Performance

WPPT (Semi-Automated) Model

for(isw=1;isw<=8;isw=isw+1) { for(l=lfirst;l<=llast;l=l+lstep) { if(from_top!=mpi_proc_null) if((isw == 1) || (isw == 2) || (isw == 4) || (isw == 6))

mpi_recv(proc_top, topbot_bufl*mpi_myreal); if(from_right!=mpi_proc_null) if((isw == 1) || (isw == 2) || (isw == 3) || (isw == 5)) mpi_recv(proc_right, rightleft_bufl*mpi_myreal);

if(from_bot!=mpi_proc_null) if((isw == 3) || (isw == 5) || (isw == 7) || (isw == 8)) mpi_recv(proc_bot, topbot_bufl*mpi_myreal);

if(from_left!=mpi_proc_null) if((isw == 4) || (isw == 6) || (isw == 7) || (isw == 8)) { mpi_recv(proc_left, rightleft_bufl*mpi_myreal); // Computation Replaced with: compute(cpu_time_region_1); .... if(to_top!=mpi_proc_null) { if((isw == 3) || (isw == 5) || (isw == 7) || (isw == 8)) { //writeln("processor " + rank + " sending to: " + proc_top); mpi_send(proc_top, topbot_bufl*mpi_myreal); } } if(to_right!=mpi_proc_null) { if((isw == 4) || (isw == 6) || (isw == 7) || (isw == 8)) { //writeln("processor " + rank + " sending to: " + proc_right); mpi_send(proc_right, rightleft_bufl*mpi_myreal); } } if(to_bot!=mpi_proc_null) { if((isw == 1) || (isw == 2) || (isw == 4) || (isw == 6)) { //writeln("processor " + rank + " sending to: " + proc_bot); mpi_send(proc_bot, topbot_bufl*mpi_myreal); } } if(to_left!=mpi_proc_null) { if((isw == 1) || (isw == 2) || (isw == 3) || (isw == 5)) { //writeln("processor " + rank + " sending to: " + proc_left); mpi_send(proc_left, topbot_bufl*mpi_myreal); } } } mark_stack(); if(rank == 0) { writeln("Processor[0] completed sweep: " + isw); } }

Page 31: Towards Semi-Automatic Prediction of Wavefront Code ... fileIntroduction • Performance Prediction - the why. • Current Techniques - the how. • Introduction to the Warwick Performance

HPSG Cluster

• 2 x AMD 246, 2Ghz Processors per node

• 2Gb RAM per node (shared between 2 processors)

• 16 x nodes = 32 processors

• Gigabit Ethernet between all nodes

• MPICH v1 (MPI Libraries), compiled for shared memory transfer

• Portland Group Fortran 90, C and C++ Compilers

Page 32: Towards Semi-Automatic Prediction of Wavefront Code ... fileIntroduction • Performance Prediction - the why. • Current Techniques - the how. • Introduction to the Warwick Performance

Sweep3D on the HPSG Cluster

Prediction (s) Actual (s) Error

8 761.28 725.46 4.94%10 610.35 575.62 6.03%12 522.60 478.45 9.23%16 402.66 366.19 9.96%20 322.88 307.87 4.88%22 280.98 305.01 -7.88%24 269.70 268.03 0.62%32 203.22 201.16 1.02%

Sweep3D, 500x500x500 ProblemAverage Error ~ 5.57%

Page 33: Towards Semi-Automatic Prediction of Wavefront Code ... fileIntroduction • Performance Prediction - the why. • Current Techniques - the how. • Introduction to the Warwick Performance

Chimaera on the HPSG Cluster

Prediction (s) Actual (s) Error

8 118.08 107.35 9.99%10 98.42 91.02 8.14%12 86.16 80.92 6.47%16 70.27 65.74 6.89%20 60.66 56.51 7.35%22 59.77 59.04 1.23%24 54.18 52.06 4.07%32 47.97 45.59 5.22%

Chimaera, 60x60x60 Problem Size

Page 34: Towards Semi-Automatic Prediction of Wavefront Code ... fileIntroduction • Performance Prediction - the why. • Current Techniques - the how. • Introduction to the Warwick Performance

Cray (Jaguar) XT4 Supercomputer

• 16 x Dual/Quad Core AMD Processors, 2.6Ghz per blade

• 6 blades per cabinet = 96 processors (note, can be more cores)

• 2Gb of memory per processor

• Each AMD processor is connected to a SeaStar router chip which processes network transmissions.

• 3D-taurus SeaStar Network

• Cray UNICOS (Catamount) Operating System, lightweight kernel

Page 35: Towards Semi-Automatic Prediction of Wavefront Code ... fileIntroduction • Performance Prediction - the why. • Current Techniques - the how. • Introduction to the Warwick Performance

Sweep3D on the Cray XT4

Sweep3D, 50x50x50 cells per Processor

Predicted (s) Actual (s) Error (%)4 3.96 4.13 -4.0916 4.52 4.91 -7.8732 4.96 5.54 -10.5364 5.64 5.47 3.06128 6.38 6.01 6.19256 7.86 7.28 7.96512 9.35 10.02 -6.73

Page 36: Towards Semi-Automatic Prediction of Wavefront Code ... fileIntroduction • Performance Prediction - the why. • Current Techniques - the how. • Introduction to the Warwick Performance

Comparing Results

• Los Alamos Model - Kerbyson et al. (Analytical) 5-15%, general wavefront model, less accurate as number of processors increases (> 20%).

• General LogGP Model (Analytical) - Struckel & Vernon, currently being revised by Mudalige and Vernon, general wavefront model, extra contention and SMP-case terms, (< 10%).

• WPPT (Simulative) - errors < 10% on HPSG-cluster, 10-15% on XT4, currently does not model network contention.

Page 37: Towards Semi-Automatic Prediction of Wavefront Code ... fileIntroduction • Performance Prediction - the why. • Current Techniques - the how. • Introduction to the Warwick Performance

Conclusions

Page 38: Towards Semi-Automatic Prediction of Wavefront Code ... fileIntroduction • Performance Prediction - the why. • Current Techniques - the how. • Introduction to the Warwick Performance

Quick Recap

• Simulation as a performance prediction technique needs some new ideas - motivated by changing architectures, networks and more complex codes

• Wavefront codes - account for large amount of super-computing time, used as a method of benchmarking and understanding machine performance

• WPPT able to produce average errors of ~10% on variety of hardware, comparable to other techniques but not as accurate in all cases

Page 39: Towards Semi-Automatic Prediction of Wavefront Code ... fileIntroduction • Performance Prediction - the why. • Current Techniques - the how. • Introduction to the Warwick Performance

Where in 2007-2008?

• More models - wavefront and other types (adaptive hydrocodes)

• Using simulation outputs for visualising decomposition and execution

• Improving simulation performance and automated code analysis tools

• Question of how multi-core systems will be modelled - these are extremely complex, technology moving very quickly

• Question of contention when use cores, does putting more cores on the same network = better performance?

Page 40: Towards Semi-Automatic Prediction of Wavefront Code ... fileIntroduction • Performance Prediction - the why. • Current Techniques - the how. • Introduction to the Warwick Performance

Thank you for your attention

Questions?