HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

85
HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS Prof. Thomas Sterling Department of Computer Science Louisiana State University March 1, 2011

description

Prof. Thomas Sterling Department of Computer Science Louisiana State University March 1, 2011. HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS. Contact Info. Steven R. Brandt [email protected] AIM: RegexGuy. Links. - PowerPoint PPT Presentation

Transcript of HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

Page 1: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS

PERFORMANCE MEASUREMENT & ANALYSIS

Prof. Thomas SterlingDepartment of Computer ScienceLouisiana State UniversityMarch 1, 2011

Page 2: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

Contact Info

• Steven R. Brandt• [email protected]• AIM: RegexGuy

Page 3: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

Links

• http://cct.lsu.edu/~sbrandt/csc7600l15demos.zip • X-Ming:

– http://www.straightrunning.com/XmingNotes/– Scroll down, click on Xming public release and install

• Putty:– http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html– Click on putty.exe and save to the desktop

Page 4: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

4

Topics

• Introduction

• Measuring System Operation

• Gprof

• Perfsuite

• PAPI

• Tau & PAPI

• Benchmarks b_eff

• MPI Tracing with PMPI

• Tau & MPI

• Summary – Material for the Test

Page 5: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

5

Topics

• Introduction

• Measuring System Operation

• Gprof

• Perfsuite

• PAPI

• Tau & PAPI

• Benchmarks b_eff

• MPI Tracing with PMPI

• Tau & MPI

• Summary – Material for the Test

Page 6: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

Opening Remarks

• Up until now, 2 strategies for measuring performance:– 1) wall-clock time for user applications

– 2) benchmarks for comparing• Machines of different type• Machines of different scale

• But, we have identified factors that contribute to system operational performance, e.g.:– Effective use of parallelism

– Cache behavior

• To make better use of HPC systems, need to measure operational behavior– How the system is performing during application execution

– What are the application demands and bottlenecks

• Focus on SMP class system operation during this Segment– Next Segment: measuring MPP & cluster behavior

6

Page 7: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

What you’ll Need to Know

• This is a skills-oriented lecture

• Understand the kinds and levels of metrics of system and processor operation that you can measure

• Know the kinds of tools that can expose valuable parameters of system & application operation– Hardware counters– Software instrumentation, data acquisition, and presentation

• Learn the basics of how to use specific tools when running your application code– Gprof– Perfsuite– PAPI– TAU

7

Page 8: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

Final initial comments(yes, I know that’s an oxymoron)

• We are only going to scratch the surface today– Try to get the basic ideas

• This will expose you to a range of concepts, strategies, and tools– Lots of details will be left to future discussions

• Over the next weeks, we will extend our abilities in using these tools– But don’t hesitate to read through the documentation– Hey, try some things out for yourself– You’ve got a sandbox to play in (Arete)

8

Page 9: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

9

Topics

• Introduction

• Measuring System Operation

• Gprof

• Perfsuite

• PAPI

• Tau & PAPI

• Benchmarks b_eff

• MPI Tracing with PMPI

• Tau & MPI

• Summary – Material for the Test

Page 10: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

Hardware Counters

• Each processor has the ability to monitor events of various kinds

• Small set of registers used to count events. Very processor specific.

MP

L1L2

MP

L1L2

L3

MP

L1L2

MP

L1L2

L3

M1 M2 Mn

Controller

S

S

NIC NICUSBPeripherals

JTAGEthernet

PCI-e

10

Page 11: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

Philip J. Mucci, “Performance Analysis Tools and PAPI” UTK ICL

11

Page 12: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

Philip J. Mucci, “Performance Analysis Tools and PAPI” UTK ICL

12

Page 13: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

Hardware Events

• Floating point operations, Multiplies, Adds, Multiply-Adds, etc.

• L1/L2 cache hits/misses (see http://en.wikipedia.org/wiki/CPU_cache)

• Translation Lookaside Buffer hits/misses (virtual to physical address translation table)

• Branch prediction counters (pipelined systems must guess the next instruction to fetch)

13

Page 14: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

A Goal: Optimization

• Compile Time:– Various levels enabled by compiler options– Examine Compiler Output

• Run Time (Performance Analysis):– Instrument code or execution to produce a trace– Tools to analyze trace:

• Standard/basic tool is gprof, but there are many others• Note: Java Hot-Spot environment collects data about

execution and uses it to optimize a program as it runs

14

Page 15: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

Performance Analysis Tools

• Widely Ported Low-Level Interface to hardware counters: PAPI (Performance API): Supports AIX, Linux, Solaris, and even Windows! http://icl.cs.utk.edu/papi/custom/index.html?lid=62&slid=96

• Many tools built on PAPI– Perfsuite (NCSA), psrun command– TAU (University of Oregon)– etc. etc.

• Useful for:– Finding performance bottlenecks– Identifying cache problems (badly sized arrays)

15

Page 16: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

time

• A simple Unix command to give resource usage.

• Runs a specified program

• time [options] command [arguments …]• Gives timing statistics about program run

– The elapsed real time between invocation and termination– User CPU time– System CPU time

• See: man time

16

Page 17: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

top

• Gives an overview of system process status and resource usage

• Provides a dynamic realtime view of a running system– System summary information– Currently managed tasks– Updates every few (e.g. 5) seconds

• top –hv | -bcisS –d delay –n iterations –p pid [, pid …]

• See: man top

17

Page 18: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

Basic Tools• Time$ time du -s /usr > /dev/null 2>&1

real 0m34.274suser 0m0.082ssys 0m0.957s

• top/ps

top - 11:29:40 up 49 min, 2 users, load average: 0.32, 0.26, 0.25Tasks: 125 total, 3 running, 121 sleeping, 0 stopped, 1 zombieCpu(s): 4.5%us, 0.3%sy, 0.0%ni, 94.7%id, 0.2%wa, 0.3%hi, 0.0%si, 0.0%stMem: 1030940k total, 1013376k used, 17564k free, 124616k buffersSwap: 2104472k total, 32k used, 2104440k free, 411968k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 4136 sbrandt 15 0 35208 15m 10m S 6 1.5 0:03.35 gnome-terminal 3761 root 16 0 82676 50m 12m R 3 5.0 1:02.82 X 5195 sbrandt 16 0 2176 1172 852 R 1 0.1 0:00.03 top 3487 root 17 0 1820 572 496 S 0 0.1 0:00.25 hald-addon-stor 3930 sbrandt 16 0 99.8m 40m 14m S 0 4.0 0:36.27 beagled

18

Page 19: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

19

Topics

• Introduction

• Measuring System Operation

• Gprof

• Perfsuite

• PAPI

• Tau & PAPI

• Benchmarks b_eff

• MPI Tracing with PMPI

• Tau & MPI

• Summary – Material for the Test

Page 20: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

gprof : quick overview• gprof

– a utility which profiles procedures in programs, available in most Unix systems.

• gprof provides information about :– An index for each procedure

– Parent of each procedures

– The percentage of CPU time utilized by a procedure and its calls.

– Breakdown of time used by the procedure and its descendents

– Number of times a procedure was called.

– direct descendents of each procedure

• To use gprof :

• compile the source code with a –pg option

• running the executable created generates an output file gmon.out for serial programs.

– For serial programs: gprof exe gmon.out

– For parallel programs, set env variable GMON_OUT_PREFIX:gprof exe gmon.out.*

20

Page 21: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

GPROF: one minute tutorial

• Steps to use gprof:– gcc -pg -g -o prog prog.c– ./prog– gprof prog gmon.out

• More reading:

http://www.cs.utah.edu/dept/old/texinfo/as/gprof.html• Finds subroutines where the most time is spent• Cannot tell you why some routines are more costly than others. Need more

information...

21

Page 22: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

Demo of gprof

22

Page 23: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

23

Topics

• Introduction

• Measuring System Operation

• Gprof

• Perfsuite

• PAPI

• Tau & PAPI

• Benchmarks b_eff

• MPI Tracing with PMPI

• Tau & MPI

• Summary – Material for the Test

Page 24: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

Philip J. Mucci, “Performance Analysis Tools and PAPI” UTK ICL

24

Page 25: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

Using psrunpsrun cmd (e.g. psrun du -s /usr)

– This test will measure performance counters used by the du command. No special compilation of ls is required for this to work.

psprocess cmd.* (e.g. psprocess du.*.xml)– At the bottom of this file, you will see summary events about

numerous counters.

25

Page 26: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

Demo of psrun

26

Page 27: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

27

Topics

• Introduction

• Measuring System Operation

• Gprof

• Perfsuite

• PAPI

• Tau & PAPI

• Benchmarks b_eff

• MPI Tracing with PMPI

• Tau & MPI

• Summary – Material for the Test

Page 28: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

Philip J. Mucci, “Performance Analysis Tools and PAPI” UTK ICL

28

Page 29: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

Philip J. Mucci, “Performance Analysis Tools and PAPI” UTK ICL

29

Page 30: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

Philip J. Mucci, “Performance Analysis Tools and PAPI” UTK ICL

30

Page 31: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

Philip J. Mucci, “Performance Analysis Tools and PAPI” UTK ICL

31

Page 32: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

Philip J. Mucci, “Performance Analysis Tools and PAPI” UTK ICL

32

Page 33: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

Philip J. Mucci, “Performance Analysis Tools and PAPI” UTK ICL

33

Page 34: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

Philip J. Mucci, “Performance Analysis Tools and PAPI” UTK ICL

34

Page 35: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

Philip J. Mucci, “Performance Analysis Tools and PAPI” UTK ICL

35

Page 36: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

Philip J. Mucci, “Performance Analysis Tools and PAPI” UTK ICL

36

Page 37: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

Philip J. Mucci, “Performance Analysis Tools and PAPI” UTK ICL

37

Page 38: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

Philip J. Mucci, “Performance Analysis Tools and PAPI” UTK ICL

38

Page 39: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

Philip J. Mucci, “Performance Analysis Tools and PAPI” UTK ICL

39

Page 40: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

Philip J. Mucci, “Performance Analysis Tools and PAPI” UTK ICL

40

Page 41: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

Philip J. Mucci, “Performance Analysis Tools and PAPI” UTK ICL

41

Page 42: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

By hand: Verifying the PAPI Version

// When hand-instrumenting you need to check

#include <papi.h>

...

/* Verifying PAPI Version */

int v = PAPI_library_init(PAPI_VER_CURRENT);

if(v != PAPI_VER_CURRENT) {

fprintf(stderr,"Bad PAPI version\n");

exit(2);

}

42

Page 43: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

Use "papi_avail -a" to identify counters

Link with -lpapi

By Hand: Measuring PAPI Counters

#include "papi.h"

#define NUM 3int events[NUM] ={

PAPI_FP_OPS, PAPI_TOT_INS, PAPI_L1_DCM};

int main(int argc,char *argv) { int i; int r; long_long values[NUM]; r=PAPI_start_counters(events,NUM);

...

r=PAPI_stop_counters(values,NUM); printf("end ret=%d\n",r); for(i=0;i<NUM;i++) { printf("ctr[%d]: %f\n",i,

(double)values[i]); }}

43

Page 44: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

Demo: Hand instrumentation with PAPI

44

Page 45: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

Statistical profiling

• profil() - Unix command to examine program to periodically examine program counter. Identify subroutines where code spends most time.

• Used by Gprof

• PAPI_profil() - Emulates profil(), but looks at a specific hardware counter. Identifies file/line where code spends most time.

45

Page 46: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

Using psrun to find hot spots

• gcc -g -o cmd cmd.c

• psrun -C -c papi_profile_cycles.xml cmd

– "-C" Instructs papi to use xml configurations that are in the install path rather than current directory.

– "-c papi_profile_cycles.xml" Use the named config file rather than the default.

– "papi_profile_cycles.xml" directs papi to collect file/line data.

• psprocess cmd.*.xml

– display results

46

Page 47: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

Demo : 2nd Demo of psrun

47

Page 48: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

48

Topics

• Introduction

• Measuring System Operation

• Gprof

• Perfsuite

• PAPI

• Tau & PAPI

• Benchmarks b_eff

• MPI Tracing with PMPI

• Tau & MPI

• Summary – Material for the Test

Page 49: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

Philip J. Mucci, “Performance Analysis Tools and PAPI” UTK ICL

49

Page 50: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

Philip J. Mucci, “Performance Analysis Tools and PAPI” UTK ICL

50

Page 51: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

Philip J. Mucci, “Performance Analysis Tools and PAPI” UTK ICL

51

Page 52: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

Measuring PAPI Counters with TAU

* Set up environment

- Select counters

- export TAU_METRICS=TIME:PAPI_FP_OPS:PAPI_TOT_INS

- Select TAU makefile

- export TAU_MAKEFILE=${TAU}/lib/Makefile.tau-papi-pdt

- export TAU_MAKEFILE=${TAU}/lib/Makefile.tau-papi-mpi-pdt-trace

* Compile with special TAU compiler:

- e.g. tau_cc.sh cmd.c

* Run your code

* Use pprof to read trace files: profile.*

52

Page 53: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

More TAU options...• Diagnostic:

– export TAU_OPTIONS=-optKeepFiles

– Examine instrumented code (if you want to):

• eg. tau_cc.sh cmd.c

• vi cmd.inst.c

• Throttling:

– export TAU_THROTTLE=1

– export TAU_THROTTLE_NUMCALLS=400000

– export TAU_THROTTLE_PERCALL=3000

• Exploring Data Graphically:

– Windows users:

• Start Xming

• Enable X11 forwarding on Putty

– Linux users:

• ssh -X [email protected]

– Run paraprof

53

Page 54: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

Philip J. Mucci, “Performance Analysis Tools and PAPI” UTK ICL

54

Page 55: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

Philip J. Mucci, “Performance Analysis Tools and PAPI” UTK ICL

55

Page 56: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

Philip J. Mucci, “Performance Analysis Tools and PAPI” UTK ICL

56

Page 57: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

Demo of TAU

57

Page 58: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

Topics

• Introduction

• Measuring System Operation

• Gprof

• Perfsuite

• PAPI

• Tau & PAPI

• Performance Characteristics

• Benchmarks b_eff

• MPI Tracing with PMPI

• Tau & MPI

• Summary – Material for the Test

58

Page 59: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

59

Where are we?

• Three classes of parallel computing– Capacity– Cooperative– Capability

• Three execution models– Throughput – Communicating sequential processes (message passing)– Shared memory multithreaded

• Programming formalisms– Condor– MPI– OpenMP / Pthreads

• More performance measurement– For cooperative/message passing/MPI

Page 60: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

60

What has changed? SMP to MPP• SMP – symmetric multiprocessor

– Shared memory• UMA – uniform memory access with cache coherence

– Multithreaded parallelism– Communication through main memory– Not scalable– Programming in OpenMP– DSM and PGAS provide alternative shared memory structures

• DSM – distributed shared memory (with cache coherence)• PGAS – Partitioned global address space (without cache coherence)• Both are NUMA

• MPP – massively parallel processor– Distributed memory

• NUMA – non-uniform memory access

– Concurrent sequential processes parallelism– Communication through messages between nodes– Scalable– Programming in MPI– Same for commodity clusters but usually with weaker networks

Page 61: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

61

MPI Performance Characteristics

• Latency– Time to send first bits of data across link to remote node– Does not include overhead

• Bandwidth– Rate of data transfer across link to remote node

• Buffers– System or user buffers take up time to manage capacity etc.

• Blocking versus Asynchronous– Forced ordering of computation and communication

Page 62: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

62

Performance Factors• Platform / Architecture Related:

– cpu - clock speed, number of cpus – Memory subsystem - memory and cache configuration,

memory-cache-cpu bandwidth, memory copy bandwidth – Network adapters - type, latency and bandwidth

characteristics – Operating system characteristics - many

• Network Related: – Hardware - ethernet, FDDI, switch, intermediate hardware

(routers) – Protocols - TCP/IP, UDP/IP, other – Configuration, routing, etc – Network tuning options ("no" command) – Network contention / saturation

source : http://www.llnl.gov/computing/tutorials/mpi_performance/

Page 63: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

63

Performance Factors (2)• Application Related:

– Algorithm efficiency and scalability – Communication to computation ratios – Load balance – Memory usage patterns – I/O – Message size used – Types of MPI routines used - blocking, non-blocking, point-to-

point, collective communications

• MPI Implementation Related: – Message buffering – Message passing protocols - eager, rendezvous, other – Sender-Receiver synchronization - polling, interrupt – Routine internals - efficiency of algorithm used to implement a

given routine source : http://www.llnl.gov/computing/tutorials/mpi_performance/

Page 64: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

64

Performance Impact of Message Sizes• Message size can be a very significant contributor to

MPI application performance. In most cases, increasing the message size will yield better performance.

• For communication intensive applications, algorithm modifications that take advantage of message size "economies of scale" may be worth the effort. Performance can often improve significantly within a relatively small range of message sizes.

• The following three graphs demonstrate how increasing message size can improve bandwidth for different message size ranges

Page 65: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

65

Topics

• Introduction

• Measuring System Operation

• Gprof

• Perfsuite

• PAPI

• Tau & PAPI

• Benchmarks b_eff

• MPI Tracing with PMPI

• Tau & MPI

• Summary – Material for the Test

Page 66: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

66

HPC Challenge Benchmarks

• HPC Challenge: http://icl.cs.utk.edu/hpcc/– See results tab– b_eff benchmark is a part of this larger database– more info than just HPL!

Page 67: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

67

b_eff

• Standard Benchmark – part of HPC Challenge– Provides effective bandwidth and latency

• Averages a variety of message sizes and communication patterns

• Determines an effective latency and bandwidth

• b_eff depends on:– hardware: interconnect, memory– software: MPI implementation– tuneable parameters of the os: buffers– etc.

See : http://www.hlrs.de/organization/par/services/models/mpi/b_eff/

Page 68: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

68

Effective Bandwidth Benchmark

Page 69: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

69

Example: Send/Recv, ring & random

Page 70: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

70

Demo

• running of b_eff

Page 71: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

71

Topics

• Introduction

• Measuring System Operation

• Gprof

• Perfsuite

• PAPI

• Tau & PAPI

• Benchmarks b_eff

• MPI Tracing with PMPI

• Tau & MPI

• Summary – Material for the Test

Page 72: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

72

Portable MPI Tracing: PMPI

• An API to MPI for tracing, debugging, performance measurements of MPI applications

• MPI_<command>() calls PMPI_<command>()

• MPI_Pcontrol(int)– 0: disabled– 1: enabled – Default Level– 2: flush trace buffers

Page 73: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 15 : Perf. AnalysisSpring 2008

73

Demo : Custom MPI Tracing#include <stdio.h>#include <time.h>#include <mpi.h>int sends = 0;int pcontrol = 1;

int MPI_Pcontrol(int n) {

pcontrol = n;

return PMPI_Pcontrol(n);

}

int MPI_Send( void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm ) { if(pcontrol >= 1) sends++;

return PMPI_Send(buf,count,datatype,dest,tag,comm );

}int MPI_Finalize() { if(pcontrol >= 1) { int myrank;

PMPI_Comm_rank(MPI_COMM_WORLD,&myrank);

printf("MYTRACE: sends = %d by rank = %d\n",sends,myrank); }

return PMPI_Finalize();

}

Page 74: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

74

Demo

• MPI tracing, custom implementation

Page 75: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

75

Topics

• Introduction

• Measuring System Operation

• Gprof

• Perfsuite

• PAPI

• Tau & PAPI

• Benchmarks b_eff

• MPI Tracing with PMPI

• Tau & MPI

• Summary – Material for the Test

Page 76: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

76

TAU and MPI

• Tau uses the PMPI interface to track MPI calls

• Jumpshot is used as the viewer– Shows subroutine calls and mpi calls

Page 77: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

77

TAU Performance System Architecture

EPILOG

Paraver

Page 78: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

78

TAU Measurement Options

• Parallel profiling– Function-level, block-level, statement-level– Supports user-defined events– TAU parallel profile data stored during execution– Hardware counts values– Support for multiple counters– Support for call-path profiling

• Tracing– All profile-level events– Inter-process communication events– Timestamp synchronization– Trace merging and format conversion

Page 79: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

79

How To Use TAU?

• Instrumentation– Application code and libraries– Selective instrumentation– Multiple configurations for different measurements options– Selective measurement control

• Execute “experiments” to produce performance data– Performance data generated at end or during execution

• Use analysis tools to look at performance results

Page 80: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

80

Using Tau

* Setup Environment

- export TAU_MAKEFILE=/usr/local/tau-2.20.1b2/x86_64/lib/Makefile.tau-papi-mpi-pdt-trace

* Use tau_cc.sh, tau_f90.sh, etc. to compile

* Run with mpiexec

* Post-process:

- tau_treemerge.pl

- tau2slog2 tau.trc tau.edf -o tau.slog2

Page 81: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

81

Demo

• Tau and Jumpshot

Page 82: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

82

Topics

• Introduction

• Measuring System Operation

• Gprof

• Perfsuite

• PAPI

• Tau & PAPI

• Benchmarks b_eff

• MPI Tracing with PMPI

• Tau & MPI

• Summary – Material for the Test

Page 83: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

83

Summary – Material for the Test

• Performance Counters: 11-15

• Basic Unix Utilities: 16,17,18

• Gprof: 20-21

• Perfsuite: 24,25

• PAPI: 28-46

• TAU: 49,50, 51, 53

• Performance Characteristics: 60, 61, 62, 63, 64

• Benchmarking b_eff: 67

• PMPI: 72, 73

• TAU & MPI: 78, 79, 80

Page 84: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

84

Sources

• http://www.cs.uoregon.edu/research/tau/docs.php (tau)

• http://www.llnl.gov/computing/tutorials/mpi_performance/

• http://www.netlib.org/utk/papers/mpi-book/node182.html (mpi profiling interface)

• http://www-unix.mcs.anl.gov/mpi/tutorial/perf/index.html (Gropp course)

• http://www.hlrs.de/organization/par/services/models/mpi/b_eff/ (b_eff bench)

• http://icl.cs.utk.edu/hpcc/ (hpc challenge)

Page 85: HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2009

85