Derived Metrics with Paraver using Hardware Counters on Power … · 2014. 11. 4. · T H E U N I V...

TH

E

UN I V E

RS

IT

Y

OF

ED

I N BU

R

GH

Derived Metrics with Paraver using Hardware

Counters on Power 5 Chips

Nicholas Pattakos

October 8, 2008

Contents

1 Introduction 1

1.1 Dissertation structure . . . . . . . . . . . . . . . . . . . . . . . . . . .3

2 Profiling and paraver configuration files 4

2.1 Software performance optimisation . . . . . . . . . . . . . . . . .. . . 4

2.2 Tools for profiling serial applications . . . . . . . . . . . . . .. . . . . 5

2.3 Hardware counters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.4 Parallel profilers and libraries for parallel profiling .. . . . . . . . . . 10

2.5 Description of this project . . . . . . . . . . . . . . . . . . . . . . . .11

3 Working environment 14

3.1 The HPCx super-computing service . . . . . . . . . . . . . . . . . . . 14

3.1.1 Nodes: IBM eServer pSeries p5 575 . . . . . . . . . . . . . . . 14

3.1.2 Power5 chips . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.1.3 HPCx Interconnect . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2 Operating system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.3 Compiler, libraries and other software . . . . . . . . . . . . . . .. . . 17

3.4 HPM toolkit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.4.1 hpmcount . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.4.2 libHPM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.5 Using paraver to profile codes . . . . . . . . . . . . . . . . . . . . . . 22

3.5.1 Basic paraver tracing . . . . . . . . . . . . . . . . . . . . . . . 23

3.5.2 Paraver instrumentation . . . . . . . . . . . . . . . . . . . . . 25

3.5.3 Paraver’s UI and basic views creation . . . . . . . . . . . . . .27

3.5.4 Creating a configuration file . . . . . . . . . . . . . . . . . . . 30

4 Aim and methodology, test case programmes 32

4.1 Custom written programmes . . . . . . . . . . . . . . . . . . . . . . . 34

4.2 Stream Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.3 LAMMPS code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.4 Using paraver to obtain information on sections of a code. . . . . . . . 36

5 Results 39

5.1 64/32bit note . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

i

5.2 Existing configuration files . . . . . . . . . . . . . . . . . . . . . . . .39

5.3 AFLOPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

6 Conclusions and tools evaluation 43

6.1 Project assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

6.2 HPM evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

6.3 Paraver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

6.4 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

A Appendix 46

A.1 Source code and sample Makefile . . . . . . . . . . . . . . . . . . . . 46

A.2 Sample Paraver .cfg file . . . . . . . . . . . . . . . . . . . . . . . . . . 50

ii

List of Figures

1 Xprofile snapshot processing the profiling data of a SPEC benchmark.

Picture taken from IBM’s web site. . . . . . . . . . . . . . . . . . . . . 8

2 This is a picture of an MCM. Picture taken from wikipedia. . . .. . . . 16

3 A trace loaded to paraver . . . . . . . . . . . . . . . . . . . . . . . . . 28

4 Filtering Window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5 A created view . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

6 Timescale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

7 Two views to be used to create a derived . . . . . . . . . . . . . . . . . 32

8 Derived metric view . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

List of Tables

1 Available registers for hardware counters for some architectures. Infor-

mation taken from [7]. . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 HPCx’s interconnect main characteristics. . . . . . . . . . . . . .. . . 16

3 AFLOPS results from libHPM and Paraver for the a(i)+b(i) calculation

in the simple “Hello” code . . . . . . . . . . . . . . . . . . . . . . . . 42

iii

Acknowledgements

I would like to thank my supervisor who was always helpful andProf Jesus Labarta,

Judit Gimenez, German Llort and Harald Servat from the Barcelona Supercomputing

Centre for their tutorials and their help. Without them this dissertation would have been

impossible. I would like to thank my parents who supported methroughout this year

that I was doing my MSc in HPC.

iv

Abstract

Parallel profiling is the key to optimising a code for High Performance Computing.

Parallel profilers monitor what every element (process and/or thread) does while exe-

cuting and visualise statistics of what was monitored. Graphical representation of these

statistics make it easy for a developer to identify performance problems. Some of the

problems that can be easily identified are load imbalanced decompositions, excessive

communication overheads and poor average floating operations per second achieved.

Paraver is such a performance analysis and visualisation tool which can be used to anal-

yse MPI, threaded or mixed mode programmes. Paraver can alsoreport statistics based

on hardware counters, which are provided by the underlying hardware. This project

aimed to develop Paraver configuration files to allow certainmetrics to be analysed for

IBM Power5, based on similar metrics that already exist for Power4 architectures. Their

accuracy was verified by profiling a set of codes using both Paraver and IBM’s HPM

toolkit, which is known to be accurate.

v

1 Introduction

’The First Rule of Program Optimisation: Don’t do it. The Second Rule of

Program Optimisation (for experts only!): Don’t do it yet.’

Michael A. Jackson.

Significant progress in the available computational hardware is constantly made and

software, already available, usually needs to be modified sothat progress can be ex-

ploited. The frequency of CPUs available as well as their transistor density has steadily

increased, roughly following Moore’s law for the last few decades; but the highest fre-

quency a CPU can reach with today’s technology seems to have hit a limit that is hard

to overcome. Meanwhile the number of transistors permm2 continues to get closer

to the natural limit of miniaturisation at atomic levels following the same exponential

pace Moore’s law predicted in 1955 and is almost dictating the processor industry today.

However processing power continues to increase as new technologies evolve. Nowa-

days for example, efficient utilisation of multi-core processors, use of graphics process-

ing units for general-purpose computing, utilising the CellBroadband Engine (Cell/B.E.

or simply the cell chip) from IBM-Sony-Toshiba or utilising Field Programmable Gate

Arrays (FPGAs) can provide processing power that conventional CPU’s will probably

take some years time to deliver. At the time of writing RoadRunner, the world’s fastest

supercomputer, is one such example where unique performance is available if one can

exploit the underlying hardware. New tools and software development kits (SDKs) are

developed and provided with new hardware; these can help in either utilising new hard-

ware or adapting already existing software to new hardware.Quite often these new

tools require tools and SDKs that already exist whilst new software is often built on

top of or extend existing software. An example of such tools are software profiling

tools. They already exist and are often updated to support new architectures and fea-

tures while they are of vital importance to HPC software development and source code

fine tuning. These tools provide run time information of a programme that makes port-

ing and optimising a code for a machine a lot easier thus helping developers understand

a programme’s behaviour so that they can bring changes that will result in a programme

running significantly faster. This optimisation process can be quite difficult and actu-

ally consists an active research area. There are some technical reports on the HPCx web

page which nicely demonstrate how to use them in general or the use of such techniques

1

to improve a programme’s performance or simply port it: [1],[2], [3].

Code development is a non-trivial task and the most importantgoal is always code

correctness. No matter how well written or innovative a programme might be, it is use-

less if it does not actually do what it is supposed to. In High Performance Computing

(HPC), next to code correctness, performing efficiently is also very important. Sophis-

ticated data structures or advanced numerical algorithms will not be beneficial if poorly

implemented. Designing an application carefully is also important but it can not always

ensure that the implementation will perform as estimated. Therefore when writting new

software or when porting software to a new machine, after code verification, optimi-

sation is needed. Out of all source code lines that can be rewritten to perform better,

only some are worth tweaking. Typically, in an HPC code, onlya small fraction of the

code processes the useful calculations and takes the most time to complete. Time spent

on optimising this fraction of the code will result in greater performance improvements

than spending the same time on optimising other parts of the code. This fact applies

to both serial and parallel codes but the latter require evenmore work, because par-

allelising a code is also non-trivial and might ruin the performance of a code which

performs very well on a single process. There are several tools to assist a programmer

in improving serial performance and there are also some thatmonitor parallel execu-

tion. These performance analysis tools are very important and useful for understanding

a programme’s behaviour so as to maximise hardware utilisation. Different tools have

different features and even though their users are programmers, who usually are power

users and more capable than the average computer user, ease of use is as important as

flexibility and functionality. A lot of HPC-related optimisations on a single process

actually aim on having enough data flowing into and out of the CPU so that CPU cy-

cles wasted are minimised. This is usually the case because the progress in processing

power of microprocessors was significantly faster than the one in computer subsystems

such as memory, interconnects or storage speed.

This project is related to what was just discussed in the sense that its aim is to implement

an easy way to access advanced features of a microprocessor,features that can be used

to optimise codes running on the microprocessor. Modern microprocessors are able

to monitor execution streams and measure several features of their run time behaviour

such as the number of instructions completed. These features are not very easy to

access at a low level, thus usually being used by higher levelprogrammes that are

used in profiling codes. Profiling tools facilitate these hardware features to extend the

2

range of information provided and make an easy way for accessing these advanced

features. The hardware that we worked on for this project is not as revolutionary as the

ones previously described but the same principles should apply to them as well. The

problems encountered and the lessons learned from this project do not differ radically

from other computer systems.

1.1 Dissertation structure

The remainder of this dissertation is structured as follows. The second section dis-

cusses on software optimisation, why it is a necessity and also talks about software

commonly used for optimising code that runs spawning a single execution thread or

multiple threads. It also talks about the aim of this projectand how it is related to the

rest of what is in the section. Section three describes the hardware and software envi-

ronment under which this project took place. Section four describes the software that

was profiled to test paraver configuration files. Section five presents some of the results.

Section six is an assessment of the project, evaluates the HPM toolkit and Paraver as

well as proposes possible future work.

3

2 Profiling and paraver configuration files

2.1 Software performance optimisation

Software optimisation requires finding a bottleneck: the critical part of the code, the pri-

mary consumer of processing resources, usually processor cycles. As a rule of thumb,

improving 20% of the code is responsible for 80% of the results, as the Pareto principle

apparently applies to software optimisation as well as other non computer science re-

lated fields [4]. According to observations made on several scientific fields (economics,

human population studies), 20% of the possible causes are responsible for 80% of the

consequences ; this is known as the Pareto principle or the principle of factor sparsity

or the law of the vital few or simply the 80-20 rule. The Paretoprinciple can be applied

to resource optimisation because it is also often observed that 80% of the resources are

typically used by 20% of the operations. There are other variations of this rule as well,

one being that 90% of the resources are consumed by 10% of the consumers. This

variation of the rule is often more suitable to be applied in software optimisation by

approximating that only 10% of a code takes to complete 90% ofthe total run time.

Performance improvements are often implemented by adding code that helps improve

performance. Such optimisation often complicate a source code and make it harder to

maintain and debug. Maintainability and readability are more important than efficiency

in the early development stages. As Donald Knuth said:

"We should forget about small efficiencies, say about 97% of the time:

premature optimisation is the root of all evil." [5]

"Premature optimisation" refers to a situation where the design of a piece of software is

influenced by performance oriented decisions. This might result in a design that is not

as clean as it could have been or code that is incorrect. Theseare because the code is

complicated due to optimisation resulting from the programmer distracted by optimis-

ing. A simple and elegant design is often easier to optimise at this stage and profiling

may reveal unexpected performance problems that would not have been addressed ap-

plying premature optimisation. In practice, it is often necessary to keep performance

goals in mind when first designing software, but the developer is to balance the design

and optimisation goals. Therefore the recommended approach is to design first, code

following the design and then profile and benchmark the resulting code to see what

4

should be optimised.

2.2 Tools for profiling serial applications

For measuring a programme’s performance one can simply add timing instructions to

do so. This approach is not very adequate as it requires source code modification and

recompilation which might introduce bugs or may not be possible at all if source code

is not available. If this method is used, it only measures execution time of the sections

of the programme timed. Usually this provides little information, especially if the time

required to modify the source code and run is also taken underconsideration.

Profilers are tools relevant to this performance analysis effort. These tools provide more

detailed information than the process described, and require less effort. They measure

the behaviour of a programme. Using a profiler usually requires recompiling the pro-

gramme to be profiled with debugging symbols enabled. Profilers use a wide variety of

techniques to collect data such as hardware interrupts, code instrumentation, operating

system hooks and performance counters. Profiling tools are also useful for visualising

memory usage and identify memory leaks; but, in this dissertation, only performance

related features are under consideration as it is assumed that the programme analysed

works correctly and a profiler is used to decrease its total execution time. A profiling

tool records a stream of recorded events or a statistical summary of the events observed

which are then viewed or furthered analysed. Sometimes the former is called a trace

and the latter a profile, but they are often used interchangeably. In this paper, tracing

or profiling refers to the procedure that either collects a stream of recorded events or a

statistical summary while a profile or trace refers to any output (graphically visualised

or not) of them or any analysis performed on them. Some profiling tools do not record

all of the events that actually happen but they statistically record the current state of

the programme at regular intervals. Theoretically there isthe possibility that events

completing in less than the sampling time interval will rarely or never be sampled and

therefore might not appear at all in the profile summary. Profiles that did not record

certain sections of the code are unlikely to be created. The point is that inaccuracies

may appear when using sampling based tools; this is something to be taken under con-

sideration.

Based on the type of the output, there are two kinds of profilers, flat and call graph

5

profilers. Both of them count the frequency and duration of function calls, as the pro-

gramme being profiled runs. Flat profilers compute how much time a programme spent

in each function and how many times this function was called while they do not break

down call times based on the callee or the context. Call graph profilers show calling

frequency of the functions as well as the call-chains involved, based on the callee; ie it

shows for a function, which ones called it, which ones did it call, and how many times.

There is also an estimate of how much time was spent in each function’s subroutines.

This can suggest places where you might try to eliminate function calls that are time

consuming.

Another way to categorise profilers is by the method used to gather data. This way,

there are event based profilers, statistical profilers and instrumentation based profilers.

Instrumentation based profilers require source code modification and recompilation,

so that the instrumented programme includes calls triggering several events, that the

profilers records at run time. Usually these calls are used bythe programmer to instruct

the profiler which events are desired and should be recorded.The process of adding

the code needed to instruct the profiler is known as instrumentation. The fact that the

programmer can ask the profiler to record what the programmerwants, is an advantage

of instrumentation over other types of profiling. Instrumenting a code is flexible in both

choosing which parts of the code should be profiled as well as which events should be

trapped for these parts. An additional advantage of instrumenting codes is that since

the programmer can instruct the profiler on which parts of thecode to profile, there is

less data recorded, thus making the trace smaller and easierto manipulate. There are

two types of instrumentation. The first instructs the profiler to switch profiling on or

off while the second simply labels different sections of thecode. Apparently having to

modify a code for profiling it is not very convenient and requires recompilation.

In general three steps are required to profile a serial application:

1. If necessary, compile the application source code using some compiler flags or

adding instrumentation calls.

2. Run the executable to collect run time information and produce a data file.

3. Process the data file with the profiler and analyse its output.

Some of the most common utilities are briefly outlined in the following section. Most

of them are also available on HPCx.

6

Prof andgprof are widely known and used tools and are available on almost every unixsystem, often called traditional profiling tools. They workby statistically monitoringthe programme counter. Prof generates a statistical profileof the CPU time used by aprogramme, as well as an exact count of the number of times each function is called.Gprof generates a statistical profile of the CPU time used by a program, along with anexact count of the number of times each function is called andthe number of timeseach caller-callee pair is traversed in the programme’s call graph. Prof only producesa flat profile whereas gprof outputs call graph profiles. The CPUtime is estimated bystatistically monitoring the programme counter (PC) register. An example of a simpleprogramme with a few dummy calls calling each other is:

granularity: Each sample hit covers 4 bytes. Time: 0.58 seco nds

called/total parents

index %time self descendents called+self name index

called/total children

6.6s <spontaneous>

[1] 52.9 0.31 0.00 .__mcount [1]

-----------------------------------------------

0.19 0.05 1000/1000 .main [3]

[2] 41.4 0.19 0.05 1000 .foo1 [2]

0.05 0.00 10000000/10000000 .foo2 [5]

-----------------------------------------------

0.00 0.24 1/1 .__start [4]

[3] 41.4 0.00 0.24 1 .main [3]

0.19 0.05 1000/1000 .foo1 [2]

-----------------------------------------------

6.6s <spontaneous>

[4] 41.4 0.00 0.24 .__start [4]

0.00 0.24 1/1 .main [3]

0.00 0.00 1/1 .__C_runtime_startup [212]

0.00 0.00 1/1 .exit [362]

-----------------------------------------------

0.05 0.00 10000000/10000000 .foo1 [2]

[5] 8.6 0.05 0.00 10000000 .foo2 [5]

Xprofiler is a GUI-based AIX performance profiling tool distributed aspart of the IBM

Parallel Environment for AIX[6]. It can be used to graphically identify which functions

are the most CPU intensive in a code. It provides a graphical function call tree as well

as a text profile pertaining to the code. Xprofiler can be used to profile sequential and

parallel C, C++, Fortran 90, Fortran77 and HPF programs. To useXprofiler, you first

compile and link the program, using the -g option to create anobject file with symbol

7

table references and the -pg option to enable profiling; thenrun the program to create

run time data file(s) (one for each processor involved in the execution), and finally,

invoke the Xprofiler utility to analyse and display the profiling information gathered.

Any compiler optimisation options can be enabled. Xprofilerdoes not provide data

while the profiled programme is sleeping and therefore it cannot be used to provide

information such as I/O or communication data.

Figure 1: Xprofile snapshot processing the profiling data of aSPEC benchmark. Picture

taken from IBM’s web site.

2.3 Hardware counters

Nowadays a rich source of statistical information on program execution characteris-

tics is provided by processors through hardware counters. These are a set of special

purpose registers built in to modern microprocessors to store the counts of hardware

related activities within computer systems. Hardware counters monitor, through hard-

ware, software events related to a CPU’s arithmetic and logicunits (ALU), or all levels

of the memory hierarchy or bus activity and can be used to optimise software. When

trying to improve a programme’s performance, for example, alot of cache misses might

suggest that restructuring it to improve data locality and increase cache reuse, can im-

prove its performance. Compared to software profilers like the ones based on sampling,

8

as described before, hardware counters provide low overhead access to a wealth of de-

tailed performance information. What is more, monitoring them does not necessarily

require source code modification. Examples of information that can be obtained in-

clude branch prediction (or mis-prediction) accuracy, instructions completed per clock

tick, all-level cache misses and cache stall cycles, translation lookaside buffer (TLB)

misses, elapsed CPU cycles, executed instructions and floating-point operations. There

is one more kind of metrics which are known as derived metrics. These are calculated

by combining the values provided by hardware counters and applying basic algebraic

operations on derived metrics. Derived metrics provide quantities that are easier to un-

derstand than the simple metrics. For example, a high numberof cache load misses

per total cache loads is clearly a serious performance problem while a high number of

cache load misses is not necessarily problematic.

The number of registers available from each microprocessorfamily differs amongst

different architectures. Every microprocessor employs registers to support a number of

hardware counters simultaneously and this number is definedby the family architecture.

The mapping in which registers can be used to count each hardware counter is a one

to many relationship, since the number of counter registersis limited, whereas there is

a number of hardware metrics that can map to each register. There are usually two to

eighteen registers available on a processor. Some of the hardware counters supported

by a processor can utilise any register, while others are only available on particular reg-

isters. As a result there are a lot of rules restricting concurrent use of different events.

Consequently, not all combinations of hardware counters canbe chosen in a single ex-

periment and out of all the events that can be counted, only a few registers are available,

and only certain sets of events can be counted at a given time.As a result there is only a

limited number of sets of hardware events that can be monitored at any time. Each valid

combination of hardware counters that can be assigned to registers at the same time, is

called a hardware counter group. Hardware counter groupingis also determined by

the fact that some of the hardware counters provide similar or related information. The

events that are monitored in each hardware counters group are chosen according to these

two facts. For example, branch mis-predictions and instruction cache misses are often

related, due to the fact that a branch mis-prediction causesthe wrong instructions to be

loaded into the instruction cache, instructions that must then be replaced by the correct

ones. The replacement can cause an instruction cache miss oran instruction (ITLB)

miss. The limited number of registers to store hardware counters values often forces

9

users, investigating on a code’s performance, to conduct multiple measurements, each

time using a different group, to eventually collect all the performance metrics desired.

The types and meanings of hardware counters vary from one architecture to another

due to variations in hardware organisation. Furthermore, it is difficult to correlate the

low level performance metrics back to source code. Profilingtools that make use of

hardware counter profiling data are capable of mapping hardware counter values to

source code lines. This is usually done through instrumenting the corresponding source

code lines. They are also able to compute derived metrics. Derived metrics can convert

counters that count in cycles to time in seconds. One more possibility is to combine

hardware metrics with derived metrics to measure quantities such as total L1 cache

load/stores per second. There is a variety of hardware counter based derived metrics,

especially since each type of processor has its own set and also because there is a large

number of them that can work combined.

Table 1: Available registers for hardware counters for somearchitectures. Information

taken from [7].Processor Available hardware counters

UltraSparc II 2

Pentium III 2

AMD Athlon 4

IA-64 4

POWER4 8

Pentium 4 8

2.4 Parallel profilers and libraries for parallel profiling

There are numerous tools widely used for parallel profiling,some of which provide

graphical representations on their analysis which is more convenient to the user.

HPM Hardware Performance Monitor (HPM) toolkit is created by IBM’s ACTC [8]

and consists of two tools. One is hpmcount and the other is libHPM. They are available

for Power architectures running AIX or LINUX.

Paraver/ompitools is one of the few parallel profiling visualisation tools available to-

day [9]. It is designed to target on a combination of several features. It is a performance

10

visualisation and analysis tool that can be used to analyse MPI, OpenMP, mixed mode

(both MPI and OpenMP) and Java codes while it is able to monitor hardware counters on

several platforms and is flexible enough to handle large trace files efficiently. It is based

on a Motif GUI which should be easy to use and it should also runon any platform. It

is developed by the European Centre for Parallelism of Barcelona at the Technical Uni-

versity of Catalonia (CEPBA/UPC). Paraver is currently available for SGI IRIX, Tru64

UNIX, IBM AIX, HP-UX, Solaris and Linux.

2.5 Description of this project

As previously discussed, hardware counters can count a number of events. However

raw values reported by the hardware counters are not very meaningful to someone try-

ing to optimise a code. The power of these values is shown whenthey are used to

calculate complex quantities. This way, virtually anything, performance related, can be

measured. For example, knowing the number of clock ticks a code needed to run or

the number of instructions that were executed by the CPU whilethe code was running,

are of little use unless divided so to calculate the number ofinstructions completed per

second, metric that clearly determines the code’s performance.

Paraver can monitor hardware counters and thus provide an easy way to access them.

This can be done if paraver is properly configured to use raw hardware counter values

and compute derived metrics which will be presented in a timeline view or as a 2D

histogram. Configuring paraver takes place through its GUI but is not trivial for rea-

sons that will be explained later. The steps to compute derived metrics should be taken

carefully and the computed values should be checked to verify that its correctness. Cre-

ating the configuration files is rather simple as it is done through paraver’s GUI. Testing

on values’ correctness that paraver reports, using a configuration file, is the tricky part

of configuration files creation. This process of configuration file creation is not easy.

Fortunately, if paraver is once correctly instructed on howto compute a derived met-

ric, then the configuration can be saved in a configuration filewhich can then be used

to obtain the same derived metric for another trace. Using paraver to access hardware

counters information is easy if the procedure previously described has already taken

place at least once. The process described is not easy. Usingthese configuration files

makes it easy to obtain derived metrics information based onhardware counter values

when profiling codes, other than the one used to create the configuration file.

11

The first goal of this project is to create configuration files that will allow paraver to

easily use hardware counter information on the current HPCx system configuration. The

validity of these configuration files need to be verified by putting them in comparison

to the values obtained when profiling a set of programmes withlibHPM. The HPM

library is part of the HPM toolkit which is provided by IBM for the HPCx system and

its functionality is verified. This is why it was used as a reference to compare the

results reported from paraver using the newly created configuration files. Configuration

files for the Power4 processors were already available on HPCxand their functionality

is verified. The existing Power4 configuration files could be used for reference or as

examples. Some configuration files for the Power5 chip were already available but they

have not yet been tested, action that have taken place.

Testing the configuration files created is not very straightforward because one needs

to decide on how different the values Paraver reports can be,comparing to the ones

libHPM reports, the code will not always perform the same; even with the same code,

using the same dataset, running on the same processor. When measuring quantities that

can be estimated from the code and if the values from paraver match those form libHPM

reports, then the configuration file is well created. For example, a configuration file, that

measures multiply-add instructions per second, can be tested by profiling a simple code

specifically written to execute a predefined number of multiply-add operations. On

the other hand, there are quantities that can not be counted unless a code is executed.

For example, when measuring branch mis-predictions we can not always expect to get

the same values between subsequent runs. Another example isif one tries on verifying

derived metrics, by measuring cache misses per memory access or translation lookaside

buffer (TLB) misses. That is why a statistical approach is needed to verify the results

obtained for these quantities.

As a secondary aim, this project targets on experimenting with performance oriented

profiling. The plan is to create configuration files for paraver to compute derived metrics

on another architecture. Some codes that are widely used would be profiled on both

systems and performance on each system would go in comparison. The metrics created

for the other architecture should be the same or similar to the derived metrics created for

the HPCx system, and should be tested as well. Systems available that could have been

used for this comparison are BlueSky (IBM / BlueGene / PowerPC 440), HECTOR

(Cray / XT4 / AMD Opteron x86_64) or Ness (Sun / AMD Opteron x86_64). None of

the secondary goals were achieved as it turned out that time was insufficient to allow

12

any work to be done on another machine.

13

3 Working environment

3.1 The HPCx super-computing service

HPCx is one of the largest supercomputers in the United Kingdom and also is one of the

UK’s national super-computing facility. In the list of the top five hundred supercomput-

ers in the world it is ranked261st [10], as of June 2008, while it was the second fastest

computer in Europe. In November of 2002 it was the ninth fastest supercomputer in

the world. The machine is a cluster of IBM SMP nodes delivering15.36 TFLOP/s of

peak performance or 12.94 TFLOPs of sustained performance for the Rmax value of the

Linpack benchmark. It also has 5.12 TByte of memory availableand 72 TByte of disk

space. Finally there is a library of roughly 3584 tapes that provide a total of approxi-

mately 50 TB of tape capacity. The HPCx system is actually housed at the UK’s STFC’s

Daresbury Laboratory and operated by the HPCx Consortium. TheHPCx consortium,

namely UoE HPCX Ltd, is led by the University of Edinburgh, with the Science and

Technology Facilities Council (STFC) and IBM. EPCC provides theUniversity of Ed-

inburgh’s contribution.

3.1.1 Nodes: IBM eServer pSeries p5 575

The HPCx system uses IBM eServer pSeries p5 575 1.5 GHz nodes forthe compute,

login and disk I/O nodes. More detailed description of the system than the one that fol-

lows can be found on its web site [11] . The HPCx service is a typical shared memory

cluster and provides 160 nodes containing a total of 2560 IBM Power5 processors. Each

eServer node contains 16 1.5 GHz Power5 processors, in the form of eight Dual-Core

Module (DCM) each having two cores. In the Power5 architecture, a chip contains

two processors and four chips (8 processors) are integratedinto a multi-chip module

(MCM). Each MCM is configured with 128 MB of L3 cache and 16 GB of main mem-

ory. Two MCMs (16 processors) comprise one frame. The total main memory of 32

GB per frame is shared between the 16 processors of the frame.Each frame is a 16-

way logical partition (LPAR). The names LPAR and system frameare synonyms for

compute node on HPCx.

14

3.1.2 Power5 chips

The eServer compute nodes utilise IBM Power5 processors. ThePower5 is a 64-bit

RISC processor implementing the PowerPC instruction set architecture. It has a 1.5

GHz clock rate, and has an 8-way super-scalar architecture with a 20 cycle pipeline.

There are two floating point multiply-add units each of whichcan deliver one result

per clock cycle, giving a theoretical peak performance of 6.0 GFLOPs. In the Power5

architecture each processor has its own L1 instruction cache of 32 KB and 64KB of L1

data cache, with 128-byte lines, integrated onto one chip. Also on board the chip is the

L2 cache (instructions and data) of 1.9 MB, which is shared between the two processors.

Each processor has its own Level 1 cache, which is divided into a 32KB data cache with

128-byte lines and a 64KB instruction cache. The two processors on one chip share a

1.9MB Level 2 cache and they also share a 36MB Level 3 cache. Each node on HPCx

has a main memory of 32Gbytes. The inter node communication on HPCx is provided

by High Performance Switch (HPS) from IBM and intra node communication is via

shared memory. The following lists how many cycles are needed to fetch data from

each cache level:

L1 cache 3 cycles to retrieve data.



Main Memory 350 cycles to retrieve data.

Being a cluster of shared memory servers, each with a sophisticated multilevel cache

memory system, the use of superscalar processors, which have multiple functional units,

potentially magnifies memory access inefficiency a code might suffer. The Power5 has

two floating point units and the theoretical peak performance can only be achieved if

independent instructions can be issued to both of these units during each cycle. In

practice, however, in addition to the idle processor cyclesmentioned above (“vertical

waste”), there are cycles for which the processor is not idlebut does not utilise all

functional units also (“horizontal waste”).

Each chip contains two processors, together with the Level 1(L1) and Level 2 (L2)

cache. On this system, each processor has its own L1 instruction cache of 64KB and

L1 data cache of 32KB integrated onto one chip. The size of theon board L2 cache

15

(instructions and data) is 1.5MB, which is shared between thetwo processors. Four

chips (8 processors) are integrated into a multi-chip module (MCM) and four MCMs

(32 processors) comprise one frame. Each MCM is configured with 128MB of L3 cache

and 8GB of main memory. An MCM that holds four processors and four L3 caches is

pictured in figure 2. The total L3 cache of 512MB per frame and the total main memory

of 32GB per frame are shared between the 32 processors of the frame.

Figure 2: This is a picture of an MCM. Picture taken from wikipedia.

3.1.3 HPCx Interconnect

Inter node communication (between frames) is provided by IBM’s High Performance

Switch (HPS). Each eServer frame has two network adapters and there are two links

per adapter, making a total of four links between each of the frames and the switch net-

work. HPS is a very sophisticated interconnect, fast in bothbandwidth and latency, and

is of vital importance to the maximum sustainable performance that HPCx is capable

of. Table 2 includes some timings which were found at IBM’s website [12] and they

indicate HPS’s capabilities.

Table 2: HPCx’s interconnect main characteristics.Quantity 1.9 GHz POWER5+ p5-575

Latency 3.6µs

Bandwidth 1.88 − 5.79GB/sec

16

3.2 Operating system

The operating system running on each LPAR is IBM’s version of unix, AIX version

5.3. AIX (Advanced Interactive eXecutive) is the name givento a series of proprietary

operating systems sold by IBM for several of its computer system platforms. AIX

5L is an open standards-based OS that conforms to The Open Group’s Single UNIX

Specification Version 3[13] and it is based on UNIX System V with 4.3BSD-compatible

command and programming interface extensions. The AIX 5L 5.3 release runs on up

to 64 IBM Power or PowerPC architecture central processing units and 2TB of RAM.

3.3 Compiler, libraries and other software

For this project the following tools were also used:

• Hardware Performance Monitor toolkit (hpmcount and HPM thelibrary).

• Paraver / ompitools.

• Small test code specifically written for the purposes of thisproject.

• The parallel version of the Stream benchmark, stream_mpi.

• The LAMMPS Molecular Dynamics Simulator.

The HPCx user support web site provides information on compiling and submitting jobs

to the machine.

The system has the IBM XL for AIX compiler suite available. There-entrant versions of

the C or Fortran compilers were used as they produce thread safe binaries. In particular

the xlf90_r, xlc_r were used for programs that do not make useof the MPI library.

To compile using the MPI library, the mpxlf90_r and mpcc_r shell scripts were used.

These shell scripts compile Fortran or C/C++ programs while linking in the Partition

Manager, the Message Passing Interface (MPI) and (optionally) Low-level Applications

Programming Interface (LAPI).

All codes were compiled to address 32-bit address spaces by using the -q32 compile

option because paraver on HPCx currently works only with executables addressing 32-

bit address spaces, due to a bug with IBM’s Dynamic Probe Class Library (DPCL)

that IBM has not fixed yet. All codes were also compiled using -O2 optimisation level

17

so that the executables were not heavily optimised, thing that might lead a profiler to

confusion. On the other hand some level of optimisation is needed, for example to make

sure the binary does not contain instructions or variables that are not needed (dead code

elimination, copy and constant propagation etc).

HPCx provides two environments for the submitted jobs to execute. One is the batch

processing system and the other is the interactive execution environment. Both en-

vironments are accessed through LoadLeveler [14], the workload manager as well as

scheduler on HPCx. The first is the environment that is common in distributed systems

where submitted jobs are scheduled to run and any output is returned to the user when

execution has completed. The interactive environment allows for programs’ interactive

execution, mainly for debugging purposes. Interactive jobs are not queued. Program’s

interactive execution has exclusive access only to CPUs and not to a whole LPAR. Fi-

nally there are only two LPARs available for interactive use which means that, at most

32 CPUs can be used. If one requests some CPUs for interactive use while not enough

available, execution is cancelled. To run either interactively or through the batch sched-

uler we must uses IBM’s parallel environment (poe) and we are in need of a batch

script. The difference, when running interactively, rather than batch processing, is that

run time environment variables should be defined in one’s shell and not in the batch file

because batch file variables are ignored when running interactively.

All jobs submitted for this project requested 16 processors. Time on HPCx is charged

on multiples of nodes, sets of 16 processors, not multiples of the number of CPUs

that were actually used. HPCx’s batch scheduling system allocates full nodes; if the

requested processors are fewer than those available on the allocated nodes, more L2

and L3 cache will be available for each processor than when submitting production

runs on HPCx, where use of full CPU availability is made. This would not affect the

results of this project but sticking to realistic environments and conditions as much as

possible was preferred.

3.4 HPM toolkit

HPM was briefly described in section one and some informationspecific to HPCx and

this project is given here. One of the two utilities providedby the HPM tool kit is

hpmcount and the HPM library, or else libHPM.

18

3.4.1 hpmcount

Hpmcount is used the same way as the time command is; Typing hpmcount followed

by the name of the executable to be profiled, plus any additional options required, is

needed. This runs the executable and a trace is created upon completion. A job can be

submitted to profile a code following the usual way to submit jobs by typingllsubmit

parallel.ll . The following variables need to be included in the batch filethat is

used (in this exampleparallel.ll ):

export HPM_DIR=

"/hpcx/usr/local/packages/actc/hpct/lib/"

export HPM_INC=


export HPM_LIB=


export HPM_EVENT_SET=1

poe.real ./programme_name

The HPM_EVENT_SETvariable is used to specify the hardware counter group that

is desired. To handle them easier, a list of valid groups is usually provided by the

underlying software, supporting hardware counters. For the HPCx system, this is the

pmlist utility. The following were used as suggested by the HPCx user’s manual to

be included in any batch file:

export MP_EAGER_LIMIT=65536

export MP_SHARED_MEMORY=yes

export MEMORY_AFFINITY=MCM

export MP_TASK_AFFINITY=MCM

This is all needed to run the executable, collect the necessary data and create a report of

hardware counters values during execution along with some predefined derived metrics

that hpmcount calculates. Hpmcount profiles the whole program under consideration,

just like the time utility, instead of parts of it. Hpmcount outputs one file per process

which contains the hardware counters values recorded and some predefined derived

metrics as well as some other information.

A small note about hardware counters should be made here. Hardware counters are

private to each thread. This means that it is the library’s responsibility to ensure that

19

if a thread’s execution is suspended, the hardware countersare copied to a temporary

location and restored when the thread is switched back. Thisfact ensures that the mea-

surements are not affected by the operating system or by other threads’ noise. Most im-

portantly, although HPCx’s nodes are exclusively allocatedto jobs, one might run more

threads than the available processors to utilise simultaneous multithreading (SMT), so

it is important that this subtle issue is taken care of.

3.4.2 libHPM

Using hpmcount is easy and can give some early results very soon without much ef-

fort. However more complex analysis is usually required andcan not take place with

the hpmcount utility. As hpmcount profiles the whole code from start to end, mea-

surement contains information for parts of the code that arenot of optimising interest

such as pending IO or initialisation time. Only the computationally intense parts of the

code are the ones we need to profile, in order to identify performance bottlenecks but

monitoring these parts can not be done without instrumenting a code. Using libHPM’s

instrumentation calls, a programmer can mark regions of a code that the HPM run time

library will profile. When tracing is complete, the per trace output files contain one

report per instrumented section, rather than one report forthe whole code.

The interface provided by libHPM is rather simple. Several versions of the HPM toolkit

are available on HPCx on various paths but the one that was eventually found to fully

support all hardware groups available by Power5 processor is in this location:

/usr/local/packages/actc/hpct/lib/ .

Briefly, the interface to use libHPM is this:

hpm.h Is the header file that needs to be included.

hpminit(rank,"name"), hpmterminate(rank) Functions that initialise tracing and ter-

minate tracing.

hpmstart(instID,’sectionID’), hpmstop(instID) Functions used to identify which sec-

tions of the source code HPM will monitor.

After the hpm header file is included, tracing must be initialised by callinghpminit

immediately after any variable declarations. It does not matter when tracing terminates

20

as long as it does terminate before MPI finalise. To properly terminate tracing, hpmter-

minate should be called before the programme exits. The sections of the programme

that are of interest should be marked by callinghpmstart at the beginning of the

section, andhpmstop at the end of it.

After instrumenting a code we need to compile it against libHPM. The following flags

need to be used so that the license and the run time objects canbe loaded:

-lhpm -llicense -lpmapi

After recompiling the program, the following environment variable needs to be defined

in the batch file (or the current working shell in case of an interactive job) to specify

which hardware counters group is desired:

HPM_EVENT_SET=140.

An example of a full batch script that was used in this projectfollows:

#@ shell = /bin/bash

#@ job_name = helloHPM

#@ job_type = parallel

#@ CPUs = 16

#@ node_usage = not_shared

#@ bulkxfer = yes

#@ wall_clock_limit = 00:01:00

#@ account_no = z000

#@ output = $(job_name).$(schedd_host).$(jobid).out

#@ error = $(job_name).$(schedd_host).$(jobid).err

#@ notification = never

#@ queue

export MP_EAGER_LIMIT=65536

export MP_SHARED_MEMORY=yes

export MEMORY_AFFINITY=MCM

export MP_TASK_AFFINITY=MCM

export HPM_DIR=

"/usr/local/packages/actc/hpct/lib/"

export HPM_INC=


21

export HPM_LIB=


export HPM_EVENT_SET=128

poe.real ./hello

Finally it should be pointed out that these are the C compatible declarations for using

HPM. For fortran codes names are similar; the difference is that names mentioned in

the previous list have the prefix “f_” . More detailed documentation on using both

hpmcount and libhpm is available on IBM’s web site [15] and on the HPCx support

web site[16].

3.5 Using paraver to profile codes

Available on the HPCx web site is a FAQ about using Paraver on HPCx [17] and a begin-

ner’s guide [18]. The following, pretty informative, technical reports on using Paraver

on HPCx [2], [1] are also available. For more information refer to other documentation

that is available on paraver’s web site. Users interesting in applying their information

on HPCx should be aware that details such as installation directories, names, etc are not

always as specified in the above sources. However, by readingthese documents one can

either quickly create a trace and start analysing a code or learn how to make advanced

use of paraver.

Paraver uses IBM’s DPCL which requires a .rhosts file in the homedirectory containing

a list of every node on which the code is likely to run. Paraveris available on HPCx

but currently only works with executables that have been compiled and linked using the

-q32 compiler option which uses 32-bit addressing. This is because of IBM’s DPCL

incompatibility with 64-bit addressing.

Even though paraver is an advanced tool, and well documented, some issues were en-

countered in this project. Paraver often crashes and savingfrequently is required to

avoid losing work. There is also a peculiar problem which hasto do with compiz,

a compositing window manager installed on the machine that paraver analyses were

done. When compiz is enabled, paraver windows do not work well. There is an is-

sue with notification windows that ask for user input. The problem is that they are

not easily visible as they are only a few pixels wide and tall.However after resizing

them properly, they work well. This compositing window manager is not as solid as an

22

X-window server and should probably not be used on a production machine; neverthe-

less, it is quite common on desktop systems. Furthermore themost major issue is that

the documentation is not enough for the objectives set in this project. Apparently, the

work involved in this project is advanced and is not supposedto be performed by most

users as it is hardly described in paraver’s manuals. There are several inaccuracies in

the information found which often resulted in project stalling, until the correct infor-

mation was found. A couple of examples are mentioned later when describing how to

instrument programmes to be profiled with paraver.

3.5.1 Basic paraver tracing

In order to use paraver we need to run a code throughompitrace , paraver’s tracing

tool, and then start paraver’s graphical view and analyse the trace. For this process

no source code instrumentation is needed. A profile which might or might not include

hardware counters values, is created for the whole code. Thesteps to create a trace are

these:

1. Run the executable throughompitrace to create a temporary trace file (.mpit )

for every process.

2. Runompi2prv to combine trace files to a single paraver (.prv ) file.

3. Startparaver to visualise the generated.prv file.

As is seen in this listing, there are three main tools provided by the paraver package.

The first isompitrace which is the programme that monitors the profiled code at

run time and creates a temporary trace file for each process. The second isompi2prv

which combines all the temporary files created at run time to asingle paraver file. The

third isparaver which is used to visualise and analyse the final file that is created by

ompi2prv . Executing the code which is going to be analysed is requiredto collect

data for each process and record them to a per process trace file. Apart from its design

features that allow it to handle very big trace, there is alsoanother tool in the paraver

package that makes manipulation of big traces easier. This iscutter and can generate

horizontal or vertical cuts of an existing trace file. Cuttinga trace horizontally selects a

subset of processors whereas vertically cutting results ina trace of a time subset.

23

A very powerful feature of the paraver package lies in this process because these steps

are clearly separated. They needn’t be executed on the same computer system; each

one can be done on one of the systems paraver is available for,if this is convenient to

the user. This feature is also a necessity some times. For example, manipulating the

per process generated trace files may be slow for a desktop’s hard drive compared to a

cluster’s storage system. On the other hand if a lot of memoryis required, a desktop

often has more RAM than a cluster’s node. Finally analysing a trace file is easier to do

on a local machine than by connecting to a remote X-window server which usually is

unacceptably slow due to network latencies.

There are two versions of paraver available on HPCx. One is available on

/usr/local/packages/paraver/

and a newer version on

/usr/local/packages/paraver/newversion

To profile a code on HPCx using paraver the following lines should be added to the

batch file:

export

OMPITRACE_HOME=/usr/local/packages/paraver/newversion

export

MPTRACE_COUNTGROUP=1

The first line defines the directory where paraver is installed so that the license and

library files can be picked while the second line specifies which hardware counter group

to use. The following line:

poe.real ./hello

that is usually used to run a code must be replaced by the following command:

$OMPITRACE\_HOME/bin/ompitrace

-counters:mpi -v -r -nosw poe.real ./hello

This runs the code throughompitrace , whose arguments are briefly explained in this

listing:

counters:mpi This argument specifies that the hardware counters values should be

recorded and included in the trace.

24

v Invoke verbose output.

r This should be used if the re-entrant (_r) versions of the compilers are used.

nosw This switches off the software clock.

The re-entrant versions of the IBM compilers should always beused when compiling

code on HPCx, so the -r argument should always be passed toompitrace . The

software clock should be disabled for getting consistent times between nodes.

The first option passed toompitrace indicates that whenever an MPI call is trapped

it should trigger value recording of the hardware counters.The :mpi option is used to

specify that recording the hardware counters values shouldbe done every time a call to

the MPI library is made. Hardware counters can also be recorded when entering and

exiting user calls. This is done using the:calls option. Also a text file with the

names of the functions that will be used, needs to be created.

Trace file generation is completed by merging all the temporary trace files to a single

one that can be loaded to the paraver GUI. These two commands need to be used on

HPCx to merge the temporary files:

export OMPITRACE_HOME=

/usr/local/packages/paraver/newversion/

/usr/local/packages/paraver/newversion/bin/ompi2prv

* .mpit -s * .sym -o name.prv

The first line exports a variable so thatompi2prv can find the libraries and the license

included in the paraver package; the second command simply creates the final trace file.

3.5.2 Paraver instrumentation

The paraver package provides a user interface for the programmer to customise a code’s

trace. This is done by accessing functions to instrument a code for defining custom met-

rics and the values that they take. This can be done by callingtheompitrace_event

(event, value) function, where theevent value is used to specify which is the

defined metric the user refers to, andvalue is used to pass the value thatompitrace

should record for this event. These metrics are defined by theuser and can be anything

the user would like to include in the trace that will help later with the analysis of the

trace file. A couple of functions are also available to instruct the tracing utility to pause

25

or restart tracing. An attempt was made to use these two functions, but they did not

suit the purposes of this project; especially since we were advised by BSC specialists

to avoid using them.

Submitting an instrumented code to be profiled with paraver on HPCx is done as the

non instrumented codes are. What needs to be done is to recompile the code and in-

clude theompitracef.h which can be found in the installation directory of paraver.

As, for unknown reasons, the version of paraver that exists on HPCx did not work the

paraver package was installed at a user’s local directory which directory tree was used

to compile and link against. A sample Makefile that illustrates how it was done can

be found in the appendix. Finally, when linking a code instrumented with Paraver, the

ompitrace objects are also needed so the-lompitrace option needs to be passed to

the linker. The following lists some lines that were used in Makefiles for this project

and are enough to compile and link an instrumented code.

FF=mpxlf90_r

FFLAGS=-q32 -O2 #-qsuffix=cpp=f90

FCFLAGS=

-I/hpcx/home/z004/z004/nspattak/Paraver_new/include/

LFLAGS=

-L/hpcx/home/z004/z004/nspattak/Paraver_new/lib -lompitrace

The comment in theFFLAGSline was not needed for the case that these lines were

used, but it might be needed when compiling fortran codes. The reason for this is that

paraver is a c++ code and, although bindings for linking a fortran code exists, the IBM

compiler needs this option to look for suffixes other than thedefault ones.

A rare error was also encountered in this project when tryingto compile an instrumented

version of the stream_mpi code. The error was:

ERROR: 0031-309 Connect failed during message passing

initialisation, task 1, reason: Unable to allocate storage

This error is not unknown and is included on the HPCx FAQ where the solution was

found. The-bmaxdata:number compiler option sets the maximum size of the area

shared by the static data (both initialised and uninitialised) and the heap to size bytes.

This value is used by the system loader to set the soft ulimit.The default setting was

-bmaxdata=0 and apparently was not enough for the stream_mpi code to be com-

26

piled for both the MPI and paraver libraries. For the record,valid values of number are

0 and multiples of 0x10000000.

3.5.3 Paraver’s UI and basic views creation

Paraver’s user interface follows a simple logic. There are windows that visualise several

properties of the profile and these windows can be combined tocreate new windows

whose properties are a combination of the properties of the windows the new window

is based on. The windows that visualise properties of the trace are called “views” while

the rest of the windows are used to control the “views” windows. The “views“ windows

either show a timeline of events or a 2D statistical analysisof values. New views can

be created by modifying the number or type of events that an existing view shows.

Combining several views is done by applying basic arithmetic(add, multiply, subtract

and divide) on the values of already existing views. When a suitable view has been

created it can be saved in a configuration file. The next time the same view is desired

it can be created again automatically by simply loading the configuration file. In the

rest of this section, some familiarity with paraver is assumed as this is not a user guide.

Only the process of creating derived metrics configuration files and 2D statistical views

are described as well as the options that are of key importance to create and use these

files.

The default version of paraver’s GUI that is available on HPCx, fails to start due to

a license problem. For this project paraver was installed and executed from a user’s

directory. After a trace has been created, paraver was executed from the command

line. This starts up three windows, two of which are paraver’s main control windows.

The main windows are titled“paraver” and“Global Controller” while the

third one is the“Visualiser Module” window. The first one provides loading

and unloading of input files and the second gives access to other windows that provide

several other features paraver offers. After loading a trace there is also a window that

shows a timeline view of the whole trace. When paraver is initially invoked, there is

only a view that shows the whole trace loaded. Values of hardware counters can be

accessed by left clicking near an event in the main view. Events are marked by little

green flags in any view that has theFLAGbutton turned on. The little flags are shown

on the top of the timeline while theFLAGbutton is at the bottom of the view. Figure 3

illustrates the windows that are shown after following the instructions in this paragraph

27

.

Figure 3: A trace loaded to paraver

In order to begin creating other views we need to duplicate the initial view and start

working on the new one. This is done by right clicking on the existing view and choos-

ing clone . TheVisualiser Module controls the views that have been created

and shows the relationships several views have, which is something to be clarified later.

To customise the view created we need to use theFilter Module to filter the events

traced, so that we only show the ones we want to see. For example, to create a view

showing the instructions completed we should choose the clone of the first view, open

theFilter Module window from theGlobal Controller window and choose

to show theuser events of type instructions completed . There is one

more option next to the type field which should change from “all” to “=” so that only

this event is shown on the view. TheFilter Module is shown in figure 4.

Having selected the desired event, paraver does not update the view to depict values

of the type just chosen, until theREDRAWbutton is clicked. Often when this is done,

there is a small triangle with an exclamation mark in it, at the bottom left of the view,

that denotes that there is a problem with the colour scaling.This can be fixed by right

clicking on the view and selecting:

Scale -> Fit Y-scale -> Fit both Y-scale

Having followed these, there should be a new window showing the values recorded for

an event, like the one in figure 5, which can now be used to create a derived metric view.

28

Figure 4: Filtering Window

Before creating a derived metric view another one of paraver’s control windows should

be introduced. It is theSemantic Module that can be accessed by theGlobal

Controller window. The views created so far colour the timeline view, according

to values recorded at the next event that triggered hardwarecounter monitoring. This is

not always the desired behaviour. To illustrate this, suppose there are two events which

trigger monitoring of hardware counter values and the code that is profiled happens

between these two events, as shown in 6. The default behaviour is to colour according

to the valueV2 at timeT2. The derived metric we need to calculate might depend at the

valueV1 at timeT1 or at the weighted mean value between the two. For example one

might want each section in a view to be coloured according to the value the event had

at the beginning of each section or according to the time average between events. This

can be done by opening theSemantic Module and choosing the appropriate option

under

Thread -> Event

The following options can be used to achieve the examples described above:

Last Evt Val , Average Next Evt Val

One available option is to show the value that was recorded atthe beginning or at the

29

Figure 5: A created view

end of an interval between two events (Last Evt Val , Next Evt Val ). Another

interesting option is to automatically divide either of these values by the time period

of the interval (Avg Last Evt Val , Avg Next Evt Val ). It is important to

know these if one needs to reproduce the work done in this project as this was often the

source of mis-configured derived metric views.

3.5.4 Creating a configuration file

In order to create a derived metric view, one first needs to open two views that show the

metrics to be combined, as is shown in 7. In order to create a derived metric to calculate

the fixed point operations per cycle, which is this section’sexample, we need to have

one view showing the FXU producing a result metric, and one showing the processor

cycles. Both views were created as described in the previous section as shown.

To create the new derived metric window one selects the first metric’s window in the

Window Browser of the Visualiser Module and then presses theDerived

button. A small window opens and looks for the other view thatit should multiply the

first one with. After selecting the other view’s window name,a new view is created,

30

Figure 6: Timescale

composing the product of the previous ones and looks like figure 8.

By accessing theSemantic Module for the derived window, one can change the

operation to add, subtract, divide, maximum or minimum and assign weights to each

operand. The newly created view shows a derived metric whichis the result of the

operation chosen. To save the view to a configuration file, right click on the view and

choosesave as .

If the steps briefly described are not followed carefully, the results reported by paraver

might not be correct. No derived view should be used unless ithas been verified as

correct and that was the main goal of this project. Now that the process of creating

configuration files has been described it is easy to understand how important it is, for

anyone willing to use such features of paraver, to be aware ofthese details.

31

Figure 7: Two views to be used to create a derived

4 Aim and methodology, test case programmes

This project’s objective was to create configuration files for paraver that automatically

create views of the trace, showing values collected from thehardware counters or met-

rics derived from those counters. Such files already exist but not for the power5 pro-

cessor hardware counters. These views can also be used to obtain statistical summaries

of all these values, recorded or derived through paraver. These files were validated to

ensure that results are correct. Validation of configuration files took place by profiling

some codes using both paraver and HPM toolkit. The results are then compared for

proving whether they match or not. The problem in deciding whether the results match

or not is that some of the quantities measured are not always reproducible, neither can

they be estimateda priori.

Validating the results was easy for some of the derived metrics and harder for some

others. Derived metrics count quantities that can be estimated by looking at the source

code, are the easy ones to be verified. Examples of such quantities are floating point

operations or fused multiply-add instructions. Others hard ones are quantities that can

be estimated but are not known in advance. For example the number of data loaded

from the L3 cache can be estimated to be the total amount of a programme’s data, only

32

Figure 8: Derived metric view

if this programme does not use any of its variables twice. There are other quantities,

even harder to verify, such as cache misses or TLB misses.

When profiling parallel applications using libHPM, one text file is created for each

process traced. This file contains the difference of the hardware counters values at the

end from the values of the start, for each instrumented section. These files can be very

big, but as their being text files they can be easily manipulated, using standard unix tools

to extract values of interest when profiling an application.For this project the following

procedure was used:

1. Use a couple of commands to extract the hardware counters values for each in-

strumented section and paste them in a single text file.

2. Import this file to a spreadsheet and work on this data to create compute derived

metrics.

The commands that were used are:

for i in * .hpm; do sed -n ’38,43p;85,90p’ $i |

\ perl -np -e ’s/. * ?: * (\d * ). * ?/\1/g’ > ${i%%. * }.values ;done

paste * .values > ALL.values

One script file containing the above commands was created foreach hardware counter

33

group. Values of interest were not always at the same line in the output files amongst dif-

ferent hardware counter groups. The process of submitting acode to be traced was done

using scripts. Using several other script files, for copyingthese values to a spreadsheet

so that derived metrics could be calculated and compared to the ones paraver calculated

made it almost automated. The process of obtaining the derived metrics values for parts

of a code with paraver, is described later in this chapter.

When the spreadsheet was populated with the numbers, hpmLIB and paraver reported,

a very big amount of data. For each trace created for each processor the desired de-

rived metrics were calculated within the spreadsheet and then an average and standard

deviation was computed. These values were then compared to the ones paraver reports.

4.1 Custom written programmes

Some simple and short codes were written and used as test cases in this project. They

were profiled with both paraver and libhpm so as to verify the results paraver reported.

Although the codes created did not process useful calculations or did not provide a

service, they were useful for two reasons. The first was that learning how to use paraver

and the HPM toolkit was not straightforward, and using programs of complexity similar

to the “Hello, world” programme made it easy to learn; easierthan using production

ready codes which are quite complex. Both paraver and HPM toolkit are documented

and some technical reports are also available on HPCx, but it was not always easy to

reproduce what was documented. Simple codes are also usefulbecause they can be

written specifically for validating profiling of quantitiesthat can be estimated, such

as floating point operations completed. The computational parts of the code that was

finally used for this purpose are these:

integer:: ierror, i,j,rank

integer ,parameter:: N=2000,K=1000

real * 8 ::a(N),b(N)

a=1.123123d0

b=101.321456d0

do 10 j=1,K

34

do 20 i=1,N

b(i) = b(i) + a(i)

20 continue

10 continue

do 30 j=1,K

do 40 i=1,N

b(i) = 1.0 + a(i)

40 continue

30 continue

end program hello

The code used also included MPI calls to perform the same calculations on several

processors. This was chosen so as to facilitate all processors available on a compute

node on HPCx. Obviously this does not affect the results or theprocess of creating

configuration files, but it was preferred, as it mimics a real code better than running on

a single CPU. This programme performs a known number of floating point operations

and accesses a number of variables. Depending on the parameters N,K chosen, one can

also experiment with all levels’ cache misses while maintaining a run time that is not

too small, so that the programme does not exit too soon. Different array sizes can be

used to verify quantities that are not easy to test. For example if the arrays are small to

fit a certain cache level, then data loaded from this cache should be equal to the arrays’

sizes.

4.2 Stream Benchmark

Sustainable Memory Bandwidth in High Performance Computers (STREAM) a simple

synthetic benchmark program that measures sustainable memory bandwidth (in MB/s)

and the corresponding computation rate for simple vector kernels. It tries to measure

the effective speed of memory bandwidth of a computer by fetching data from memory

to the processor. This is often the bottleneck on modern computers. Floating point

operations per second can be severely lessened, if there area lot of cache misses, and

such a programme suffers all the L2, L3 or main memory latencies. This benchmark

uses two to three arrays and performs the following calculations:

35

c(j) = a(j) copy one array to the other.

b(j) = scalar*c(j) scales one array and stores it on the other.

c(j) = a(j) + b(j) adds two arrays and stores the result to the third array.

a(j) = b(j) + scalar*c(j) scales one array, adds it to an other and stores the result to the

third array.

Stream is a well-established benchmark and is also used in the HPC challenge bench-

mark targeting to big super-computing clusters. The MPI version of stream was used

for utilising every processor in an LPAR. In order to stress the memory system of a

computer, the benchmark operates on arrays, sufficiently large to not hold all the arrays

that it operates on. It is recommended to use RAM that is four times the size of the

available caches. On HPCx, each LPAR has 16 processors, each having 64KB of L1

data cache, 1.9MB of L2 cache for every pair of cores and 128MBof L3 cache for

every MCM (ie four chips/eight cores). In total there is2 ∗ 139366.4 = 278732.8KB

of caches available. However as this benchmark was used as a test code and not for

benchmarking the machine, several array sizes were used andsometimes fitted the L1

or L1 and L2 or all level’s caches. They were used to test that hardware counters count

cache hit or cache miss rates. Stream was also chosen becauseit is a simple code; yet it

is a fully functional real life used code.

4.3 LAMMPS code

Lammps is a molecular dynamics code that

models an ensemble of particles in a liquid, solid or gaseousstate. It can be

used to model atomic, polymeric, biological, metallic or granular systems.

as is mentioned on the HPCx web site. It is a code that scales very well on very large

processor numbers. It was used to test the paraver configuration files created using a

complicated production-ready code.

4.4 Using paraver to obtain information on sections of a code

The views created as described in the previous chapter and the configuration files that

were then saved, were used to obtain profiling information for the code sections as

36

described earlier in this chapter. By doing this, the first problem was how to trigger

hardware counter monitoring before and after the regions ofinterest. For the hello and

stream_mpi code this was done by adding barriers to mark the interesting sections of

the code. OneMPI_Barrier call was made just before the section was entered, and

one just after it finished executing. An alternative is to move these sections of the code

to a function and ask theompitrace tool to monitor them just before this function

is called, and immediately after it exits. The second methodwas used for triggering

hardware counters for the lammps code, as the first method wasnot applicable. One

text file was created and the name of the function that processthe main calculation was

written in it. This file was passed as an argumentompitrace so that it knew at which

function calls it should trigger hardware counting monitoring.

Unlike libHPM in which sections of interest are marked by calls to libHPM, paraver

monitors hardware counters every time an event (MPI of user function call) happens, as

was said in the section describing instrumentation with paraver. Currently there is not

a way to mimic libHPM’s behaviour in paraver neither a way to switch profiling off or

on; therefore a marking technique was applied.

A user defined event was used and several values of it corresponded to different regions

of the code. Number one was used to mark the regions of the codethat were not of

interest to us while numbers two, three, etc were used to markthe sections that were of

interest. Having done this one can then get a summary of the results for each region by

using the 2D statistical analyser paraver provides. The wayto do this, is as follows: The

trace is loaded with paraver in the usual way and a configuration file is loaded as has

already been described. Then a new view is created and is set to show the user defined

metric. TheStatistical Analyser is then opened and the cursor changes to

indicate that paraver is expecting the user to choose a section of one of the, active in the

current analysis, views. The user then chooses the whole timeline that shows the user

defined metric and a 2D statistical analysis is produced, currently containing the values

the user set the metric to. In our case that is1, 2, 3, ... and this is shown in the following

screenshot: To compute the statistics of the interesting parts, one needs to choose a

function underStatistic and then select the derived metric underData Window

and then selectRepeat . There are several functions available underStatistic ,

the most interesting ones for the purposes of this project being average and maximum.

This view can also be saved to a configuration file just like thederived metrics views.

It was decided to first change the user defined value and then trigger hardware counters

37

monitoring.

38

5 Results

5.1 64/32bit note

Even though the Power5 is a 64bit chip, all compilations werein 32bit. When the

hello code with 32-bit wide variables was profiled accessinghardware counter group

no 128, there were 50% more level 1 cache store/load instructions performed than ex-

pected. While running the simple code it was observed that when using 32bit variables

and addressing, one more operation was performed. It appears that all calculations are

performed in 64-bit arithmetic, as Power5 is a 64-bit microprocessor and storing 32-

bit floating point variables required one more instruction.Using 64-bit wide variables

decreased memory load/store operations to the estimated number of operations. Using

SIMD instructions (named VMX by IBM) would probably further improve the code’s

performance.

5.2 Existing configuration files

There were numerous configuration files that were tested for this project. We did not go

as far as to verify that the numbers Paraver calculates for these three codes are correct.

If there was more time, testing could have been completed andeven more results would

have been included here. They were not verified according to the intended workplan

but all files are properly defined. This was verified by manually looking at the hardware

counter values recorded by paraver and libHPM, which match.Opening text files to

look at libHPM’s results and then comparing them through eyeinspection with the

values Paraver reports is time consuming and error prone. However it was done as the

intended testing plan did not work out. The plan was to compile tables for each test

case containing values from both Paraver and libHPM. These values should be either

the same or similar, depending on the quantities measured. Testing was not completed

as was planned because we were not able to create such tables.This was because of a

number of reasons which vary from case to case. If one uses these configuration files,

one should be carefull with scales (especially time scales), with the visual representation

of the data in the views and how the hardware counter values are used to compute the

derived metrics. The list following shows the derived metrics that were tested. It lists

the name of the configuration file with the hardware counter group it requires and a

39

short description:

aflops, 137MFLOP/s.

data_L25_modified,50# of lines from L2.5 that were modified.

data_from_L25_shared,50# of lines from L2.5 that are shared.

data_from_L275_modified,50# of lines from L2.75 that were modified.

data_from_L275_shared,50# of lines from L2.75 that were shared.

data_loaded_from_L3,129Data loaded from L3 cache.

data_loaded_from_lmem,129Data loaded from local memory.

dataTLB_misses,43# of data TLB misses.

Fixed_point_op_per_cyc,144FXU produced a result per processor cycles.

FMA_percentage,137FPU executed one multiply add instruction per aflops.

Instructions_per_cycle,most instructions completed per cycle.

Instr_per_load_store,128 Instructions per load store.

Instr_per_run_cycle,most Instructions completed per run cycles.

L1_load_misses,142load from L1 misses.

L1misses_per_us,142L1 misses per microsecond.

L1_store_misses,142l1 store misses.

Loads_stores_perTLBmiss,130# of loads per stores per TLB miss.

Loads_per_load_miss,43# of loads per load miss.

Loads_per_TLB_miss,43# of loads per TLB bugs.

Local_L2_load_traffic,128 # Local L2 load traffic.

Local_L3_load_traffic,131 Local L3 load traffic.

Local_memory_load_traf,131 Local memory load traffic.

MIPS,all Millions of instructions per sec.

Stores_per_store_miss,44# of stores per store miss.

40

%TLB_misses_per_run_cy,130%TLB misses per run cycle.

TLB_misses_per_run_cy,130TLB misses per run cycle.

Total_FP_Load_Store_Op,141Total FP load store operations.

Total_L1_misses,142Total L1 misses.

Total_loads_from_L2,128 Total loads from local L2 cache.

Total_loads_from_L3,131 Total loads from local L3 cache.

Total_Loads_From_local_mem,131Total loads from local memory.

All of these configuration files read the correct values from the hardware counters and

the derived metrics are properly defined. This was verified bycomparing the raw values

reported from libHPM and from Paraver by hand. These values were verified only for

the simple code, as it was the one that this could be done manually. The other two pro-

grammes could not be checked this way; only by comparing the tables reported by Par-

aver’sStatistical Analyser to the same quantities computed using libHPM’s

output. These quantities were computed in a spreadsheet file, as was described in an

eralier chapter. When loading some of the other configurationfiles, minor adjustments,

such as scaling, might be necessary. We were not able to determine exactly which con-

figuration files are ready to use and which are not. This is so because one configuration

file appeared to be correct while profiling the simple code while it was not correct when

profiling thestream_mpi code or vice versa. One common problem was that the

configuration files calculated incorrectly scaled metrics and we had to change the scale.

5.3 AFLOPS

One of the first derived metric we worked on, was the algebraicfloating point oper-

ations per second. The floating point operations per second achieved is one of the

most interesting metrics and serves well as an example of thetesting process we fol-

lowed. This is a tricky metric as the Power5 chip does not provide a hardware counter

that counts FLOPs. Instead the FLOPs rate can algebraicly becomputed by using

thePM_FPU_FIN, PM_FPU_STF, PM_FPU_1FLOPhardware counters. The Power5

chip provides thePM_FPU_FINhardware counter which counts how many instructions

the floating point unit produced. As the Power5 processor supports fused multiply-add

41

instruction, such operations are counted as a single instruction, which they are. From

a performance perspective however this instruction countsfor two floating point opera-

tions. To complicate things more, this counter also counts floating point stores. As non-

FMA flops are counted byPM_FPU_1FLOPand floating point stores byPM_FPU_STF

the actual flop count is given by this expression:

2*(PM_FPU_FIN- PM_FPU_STF) - PM_FPU_1FLOP

This can also be found in [2]. It should be noted here that there is no single counter

group that contains all these three counters. To overcome this the following approxima-

tion can be used:

PM_FPU_1FLOP+ 2 * PM_FPU_FMA

It is this formula that the aflops configuration file uses. Table 3 shows the average value

that was recorded by libHPM for each counter of all 16 processors and then computes

the aflops derived metric. The last line shows the aflops metric as Paraver calculates it.

a(i)+b(i) aflops calculated (FLOPs)

libHPM

PM_FPU_1FLOP 2000010.31

PM_FPU_FMA 0

(FPU_1FLOP+ 2 * FPU_FMA) * e−6 2

Paraver 2

Table 3: AFLOPS results from libHPM and Paraver for the a(i)+b(i) calculation in the

simple “Hello” code

The aflops.cfg and the results.odf are included in the electronically submitted archive

and is also included in the appendix. The aflops.cfg file can beused in paraver to create

a view that is suitable for the statistical analyser to produce results similar to the ones

shown here. The results.odf contains all the measurements that this section discusses

about.

42

6 Conclusions and tools evaluation

6.1 Project assessment

Only a few of the aims of this project were accomplished sincethe project experienced

time shortage. Delivering more configuration files was a major objective for this project

which was not fully achieved as time constraints did not allow. Were they created and

tested they would be of great use to users who are willing to use paraver for optimising

their software on HPCx. However the ones tested are some of theones that are more

likely to be used and now they are available. Another objective of this project was to

create the same, or similar, configuration files for a clusterof a different architecture

and use them to compare the performance of several codes on different architectures.

The original work plan was to finish creating and testing the configuration files for

the Power5 architecture by the end of July. Apparently this is not the way it actually

went down and there are some reasons for that. In the very beginning of this project,

a problem was the fact that the available documentation, concerning the environment

we worked on, was not accurate. This resulted in a lot of time spent on trying to find

out where is paraver or the HPM toolkit installed and gettingsome profiling results

from them. Even when some directories were found, the versions of the tools that

these directories contained either did not fully support the Power5 microprocessors or

they where not used correctly. Using the HPM toolkit suffered the most from this

documentation issues.

Another reason that this project was so much delayed is that this project relied heavily

on instructions that we would be given from paraver’s authors. Eventhough they were

very helpful and they did give us the instructions we needed,this process was not fin-

ished before early July. The author was inexperienced in profiling, thing that did not

help this project at all. Additionally the fact that he did not know how to use paraver,

a quite hard to use tool, contributed in major research stalling. His inexperience and

paraver’s user interface combined together magnified the effects brought on the project.

These problems could have been avoided if it was already known how these tools can

be used for the needs of this project, before this project began. No matter how painful

this trial and error procedure was, it still was an educational process that gave results

which can also benefit other people to attempt working on profiling code.

43

It should be clear by now that the work for this project is a lotand has a lot of parameters

one should take care of. It is the author’s view that if this dissertation can help people

use HPM toolkit and paraver without troubles like the ones encountered in this project

then this project should be considered successful, more than simply delivering some

configuration files. The value of the work done turns out to liein the fact that both

tools can be tricky to use, while the actual process of creating configuration files is

easy. As computer hardware is outdated every few years, being able to use a powerful,

even though not perfect, tool like paraver to create configuration files is as important as

having the specific files for an architecture.

6.2 HPM evaluation

The HPM toolkit is pretty straightforward to use. If there were not multiple versions

on HPCx, then it could have been used within minutes. This would be the case if the

available documentation on the HPCx web site was updated. Hopefully one having read

this writing, can now quickly do so without searching multiple directory trees. As it is

backed up by IBM it is solid and it integrates very well with theHPCx environment.

If there were some more functionality to be added, then probably a way to define derived

metrics defined by the user is a feature that would be very useful. One more step could

be taken by creating a GUI to graphically view the values recorded, which would be

nice.

6.3 Paraver

Paraver is a very powerful tool with a lot of features which isalso given free of charge

even though it is not under any of the licenses approved by theOpen Source Initiative. It

is a great tool that can profile parallel codes and can easily provide a lot of information

about a code’s run time behaviour. This information would otherwise be difficult to be

obtained. Special thanks need to go to the Barcelona Supercomputing Centre (BSC),

namely Prof Jesus Labarta, Judit Gimenez, German Llort and Harald Servat for their

tutorials and their support.

There are several things that could be improved because paraver is not a perfect pro-

gramme like any other programme. To begin with, its stability needs to be seriously

44

improved. When it was run for this project it crashed very often and it occasionally

caused the operating system to crash. Maybe its GUI layout should change so that it is

easier to use. As it is, it is very difficult to understand whatthe available options, which

are a lot, are or it is hard to find which option implements a feature the user is thinking

of. This results in a user thinking paraver can not do something because he can not find

how to do it, while it actually can be done. After all a utilitythat has uses a GUI is

supposed to be easy to use. It is the writer’s belief that if itwas an open source code it

could greatly be assisted by an open sourced development model.

One feature that can be improved is the fact that the windows showing timeline views

of a profile or windows that contain values found in the trace are not always updated as

the user analyses a profile and the user needs to explicitly press theREDRAWbutton so

that paraver will update the values displayed in a view. The same applies when one tries

to resize a window, when one needs to manually ask paraver to rescale the view to fit

the window. There is no apparent reason why this is not done automatically. As regards

hardware counter usage, an other feature that could be addedand would make it easier

to use is to automatically pick or disable options dependingon other options. It would

also be very helpful if there was a way to manually trigger hardware counter recording

when instrumenting a code, just like libHPM does. Finally a command line interface

could also be of use to automate several tasks done repetitively on several profile files.

6.4 Future work

Future work obviously might involve creating and testing more configuration files for

HPCx. This probably could be done after questioning current users who are interested

in using these configuration files so as they can give a clear picture of their needs.

Their answers could indicate which other derived metrics would be the most useful.

Another work that could be done is to create configuration files for other architectures

and compare how several widely available codes perform on different architectures.

Moving in a slightly different direction than using paraverone could create programmes

or scripts to manipulate HPM’s output to obtain profile information based on hardware

counter information.

45

A Appendix

A.1 Source code and sample Makefile

This is the source code of the two codes that were used for the tests in this dissertation.Lammps’ source code, the third code used, is too long to be included here. Insteadonly the file that was instrumented is included here. There were two versions for eachprogramme, one instrumented using HPMlib and one using Paraver. There are also twosample Makefiles that can be used to compile a programme that is instrumented witheither HPMlib or Paraver.

Hello-libHPM

program hello

implicit none

include ’mpif.h’

#include "f_hpm.h"

integer:: ierror, i,j,rank


real * 8 ::a(N),b(N)

call MPI_INIT(ierror)

call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)

call f_hpminit(rank, ’test4’)

a=1.123123d0

b=101.321456d0

print * , ’Hello, starting ’,N,’ adds of a_i+b_i’,K,’ times.’

call f_hpmstart(80, ’bna’)

do 10 j=1,K

do 20 i=1,N

b(i) = b(i) + a(i)

20 continue

10 continue

call f_hpmstop(80)

call MPI_BARRIER(MPI_COMM_WORLD,ierror)

46

print * , ’Starting ’,N,’ adds of 1+a(i)’,K,’ times.’

call f_hpmstart(90, ’an1’)

do 30 j=1,K

do 40 i=1,N

b(i) = 1.0 + a(i)

40 continue

30 continue

call f_hpmstop(90)


call f_hpm_terminate(rank)

call MPI_FINALIZE(ierror)

end program hello

Hello, Paraver

program hello

implicit none

include ’mpif.h’

include ’ompitracef.h’

integer:: ierror,i,j


real * 8::a(N),b(N)

call MPI_INIT(ierror)

call OMPITRACE_EVENT(6000019, 1)

a=1.123123d0

b=101.321456d0

print * , ’Hello, starting ’,N,’ adds of a_i+b_i’,K,’ times.’



do 10 j=1,K

do 20 i=1,N

b(i) = b(i) + a(i)

47

20 continue

10 continue



print * , ’Starting ’,N,’ adds of 1+a(i)’,K,’ times.’



do 30 j=1,K

do 40 i=1,N

b(i) = b(i) + a(i)

40 continue

30 continue



call MPI_FINALIZE(ierror)

end program hello

Makefile Paraver

FF=mpxlf90_r

SRC=test4.f90

EXE=hello

# Compiling and linking 32bit applications, some optimizat ions are welcome.

FFLAGS=-q32 -O2 #-qsuffix=cpp=f90 .NO MORE NEEDED.

# Compilation only flags. Now using paraver include dirs.

FCFLAGS=-I/hpcx/home/z004/z004/nspattak/Paraver_new /include/

# Link only flags. Now using paraver libs and ompitrace libra ry.

LFLAGS=-L/hpcx/home/z004/z004/nspattak/Paraver_new/ lib -lompitrace

#

# No need to edit below this line

#

.SUFFIXES:

.SUFFIXES: .f90 .o

48

OBJ= $(SRC:.f90=.o)

.f90.o:

$(FF) $(FFLAGS) $(FCFLAGS) -c $<

all: $(EXE)

$(EXE): $(OBJ)

$(FF) $(FFLAGS) $(LFLAGS) -o $@ $(OBJ)

$(OBJ): $(MF)

tar:

tar cvf $(EXE).tar $(MF) $(SRC) * .prv * .pcf inter.ll parallel.ll

clean:

rm -f $(OBJ) $(EXE) core

49

A.2 Sample Paraver .cfg file

The aflops.cfg file is included here as an example of a configuration file.

aflops.cfg

ConfigFile.Version: 3.4

ConfigFile.NumWindows: 3

ConfigFile.BeginDescription

Algebraic floating point operations.

This is the same as MFlop/s if floating divides and square roo ts are small.

aflops = (FPU executed one flop instruction

+ (2 * FPU executed mult-add instruction)) * 0.000001

Uses counter group 137

ConfigFile.EndDescription

################################################### ####################

< NEW DISPLAYING WINDOW FPU executed mult-add instruction >

################################################### ####################

window_name FPU executed mult-add instruction

window_type single

window_id 1

window_position_x 250

window_position_y 138

window_width 600

window_height 300

window_flags_enabled true

window_units Nanoseconds

window_maximum_y 18.000000

window_scale_relative 1.000000

window_object appl { 1, { All } }

window_begin_time_relative 0.000000

window_pos_to_disp 80

window_pos_of_x_scale 18

window_pos_of_y_scale 80

window_number_of_row 12

window_click_options 1 0 1 1 1 0

window_click_info 0 0 0 0 0

window_expanded true

window_open false

window_selected_functions { 14, { {cpu, Active Thd}, {appl , Adding},

{task, Adding}, {thread, Next Evt Val}, {node, Adding},

{system, Adding}, {workload, Adding}, {from_obj, All},

50

{to_obj, All}, {tag_msg, All}, {size_msg, All}, {bw_msg, A ll},

{evt_type, =}, {evt_value, All} } }

window_compose_functions { 9, { {compose_cpu, As Is}, {com pose_appl, As Is},

{compose_task, As Is}, {compose_thread, As Is},

{compose_node, As Is}, {compose_system, As Is},

{compose_workload, As Is}, {topcompose1, As Is}, {topcomp ose2, As Is} } }

window_analyzer_executed 0

window_analyzer_info 0.000000 0.000000 0 0

window_filter_module evt_type 1 42001054

################################################### #####################

< NEW DISPLAYING WINDOW FPU executed 1 flop instruction >

################################################### #####################

window_name FPU executed 1 flop instruction

window_type single

window_id 2



window_width 600

window_height 300

window_flags_enabled true













window_open false

window_selected_functions { 14, { {cpu, Active Thd}, {appl , Adding},

{task, Adding}, {thread, Next Evt Val}, {node, Adding},

{system, Adding}, {workload, Adding}, {from_obj, All},

{to_obj, All}, {tag_msg, All}, {size_msg, All},

{bw_msg, All}, {evt_type, =}, {evt_value, All} } }



51


{compose_workload, As Is}, {topcompose1, As Is}, {topcomp ose2, As Is} } }



window_filter_module evt_type 1 42000056

################################################### ######################

< NEW DISPLAYING WINDOW Algebraic floating point operation s >

################################################### ######################

window_name Algebraic floating point operations

window_type composed

window_id 3

window_factors 2.000000 1.000000

window_operation add

window_identifiers 1 2



window_width 600

window_height 300

window_comm_lines_enabled false

window_color_mode window_in_null_gradient_mode



window_minimum_y 4.000000











window_open true




{compose_workload, As Is}, {topcompose1, As Is},

{topcompose2, As Is} } }


52


53

References

[1] A.G. Sunderland, A.R. Porter, “Profiling Parallel Performance Using Vampir and Paraver”.

http://www.hpcx.ac.uk/research/hpc/technical_reports/HPCxTR0704.pdf

[2] J.M. Bull, “Single Node Performance Analysis of Applications on HPCx”.


[3] Michael Holden, “Performance Optimisation of an Environmental Modelling Code (POL-

COMS)“.


[4] What is the Pareto principle ?

http://www.gassner.co.il/pareto/

[5] Knuth, Donald.Structured Programming with go to Statements ACM Journal Computing Surveys,

Vol 6, No. 4, Dec. 1974. p.268.

[6] IBM Parallel Environment web page.

http://www-03.ibm.com/systems/p/software/pe/index.html

[7] Wikipedia - Hardware Performance Counters

http://en.wikipedia.org/wiki/Hardware_performance_counter

[8] Advanced Computing Technology Center.

https://domino.research.ibm.com/comm/research_projects.nsf/pages/

actc.index.html

[9] Paraver home page.

http://www.cepba.upc.es/paraver/

[10] HPCx in the top 500 list in June 2008.

http://www.top500.org/site/systems/2217

[11] IBM p5 575 web page.

http://www.ibm.com/systems/p/hardware/highend/575/index.html

[12] IBM High Performance Switch on System p5 575 Server, Performance White Paper.

ftp://ftp.software.ibm.com/common/ssi/sa/wh/n/psw03008usen/PSW03008USEN.PDF

[13] AIX 5L for POWER Version 5.3.

http://www-03.ibm.com/systems/p/os/aix/v53/index.html

[14] Tivoli Workload Scheduler LoadLeveler.

http://www-03.ibm.com/systems/clusters/software/loadleveler/

54

[15] HPM user guide on IBM’s web page.

https://domino.research.ibm.com/comm/research_projects.nsf/pages/

actc.hardwareperf2.html

[16] HPM user guide on HPCx user support web site.

http://www.hpcx.ac.uk/support/documentation/IBMdocuments/HPM.html

[17] Paraver FAQ on HPCx web site.

http://www.hpcx.ac.uk/support/FAQ/paraver.html

[18] HPC-Europa: A Beginner’s Guide to using Paraver on HPCx.

http://www.hpcx.ac.uk/support/FAQ/ParaverGuide.pdf

55

Derived Metrics with Paraver using Hardware Counters on Power … · 2014. 11. 4. · T H E U N I V...

Documents

Transcript of Derived Metrics with Paraver using Hardware Counters on Power … · 2014. 11. 4. · T H E U N I V...