HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS

PERFORMANCE MEASUREMENT & ANALYSIS

Prof. Thomas SterlingDepartment of Computer ScienceLouisiana State UniversityMarch 1, 2011

Contact Info

• Steven R. Brandt• [email protected]• AIM: RegexGuy

Links

• http://cct.lsu.edu/~sbrandt/csc7600l15demos.zip • X-Ming:

– http://www.straightrunning.com/XmingNotes/– Scroll down, click on Xming public release and install

• Putty:– http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html– Click on putty.exe and save to the desktop

CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011

4

Topics

• Introduction

• Measuring System Operation

• Gprof

• Perfsuite

• PAPI

• Tau & PAPI

• Benchmarks b_eff

• MPI Tracing with PMPI

• Tau & MPI

• Summary – Material for the Test


5

Topics

• Introduction


• Gprof

• Perfsuite

• PAPI

• Tau & PAPI



• Tau & MPI



Opening Remarks

• Up until now, 2 strategies for measuring performance:– 1) wall-clock time for user applications

– 2) benchmarks for comparing• Machines of different type• Machines of different scale

• But, we have identified factors that contribute to system operational performance, e.g.:– Effective use of parallelism

– Cache behavior

• To make better use of HPC systems, need to measure operational behavior– How the system is performing during application execution

– What are the application demands and bottlenecks

• Focus on SMP class system operation during this Segment– Next Segment: measuring MPP & cluster behavior

6


What you’ll Need to Know

• This is a skills-oriented lecture

• Understand the kinds and levels of metrics of system and processor operation that you can measure

• Know the kinds of tools that can expose valuable parameters of system & application operation– Hardware counters– Software instrumentation, data acquisition, and presentation

• Learn the basics of how to use specific tools when running your application code– Gprof– Perfsuite– PAPI– TAU

7


Final initial comments(yes, I know that’s an oxymoron)

• We are only going to scratch the surface today– Try to get the basic ideas

• This will expose you to a range of concepts, strategies, and tools– Lots of details will be left to future discussions

• Over the next weeks, we will extend our abilities in using these tools– But don’t hesitate to read through the documentation– Hey, try some things out for yourself– You’ve got a sandbox to play in (Arete)

8


9

Topics

• Introduction


• Gprof

• Perfsuite

• PAPI

• Tau & PAPI



• Tau & MPI



Hardware Counters

• Each processor has the ability to monitor events of various kinds

• Small set of registers used to count events. Very processor specific.

MP

L1L2

MP

L1L2

L3

MP

L1L2

MP

L1L2

L3

M1 M2 Mn

Controller

S

S

NIC NICUSBPeripherals

JTAGEthernet

PCI-e

10


Philip J. Mucci, “Performance Analysis Tools and PAPI” UTK ICL

11



12


Hardware Events

• Floating point operations, Multiplies, Adds, Multiply-Adds, etc.

• L1/L2 cache hits/misses (see http://en.wikipedia.org/wiki/CPU_cache)

• Translation Lookaside Buffer hits/misses (virtual to physical address translation table)

• Branch prediction counters (pipelined systems must guess the next instruction to fetch)

13


A Goal: Optimization

• Compile Time:– Various levels enabled by compiler options– Examine Compiler Output

• Run Time (Performance Analysis):– Instrument code or execution to produce a trace– Tools to analyze trace:

• Standard/basic tool is gprof, but there are many others• Note: Java Hot-Spot environment collects data about

execution and uses it to optimize a program as it runs

14


Performance Analysis Tools

• Widely Ported Low-Level Interface to hardware counters: PAPI (Performance API): Supports AIX, Linux, Solaris, and even Windows! http://icl.cs.utk.edu/papi/custom/index.html?lid=62&slid=96

• Many tools built on PAPI– Perfsuite (NCSA), psrun command– TAU (University of Oregon)– etc. etc.

• Useful for:– Finding performance bottlenecks– Identifying cache problems (badly sized arrays)

15


time

• A simple Unix command to give resource usage.

• Runs a specified program

• time [options] command [arguments …]• Gives timing statistics about program run

– The elapsed real time between invocation and termination– User CPU time– System CPU time

• See: man time

16


top

• Gives an overview of system process status and resource usage

• Provides a dynamic realtime view of a running system– System summary information– Currently managed tasks– Updates every few (e.g. 5) seconds

• top –hv | -bcisS –d delay –n iterations –p pid [, pid …]

• See: man top

17


Basic Tools• Time$ time du -s /usr > /dev/null 2>&1

real 0m34.274suser 0m0.082ssys 0m0.957s

• top/ps

top - 11:29:40 up 49 min, 2 users, load average: 0.32, 0.26, 0.25Tasks: 125 total, 3 running, 121 sleeping, 0 stopped, 1 zombieCpu(s): 4.5%us, 0.3%sy, 0.0%ni, 94.7%id, 0.2%wa, 0.3%hi, 0.0%si, 0.0%stMem: 1030940k total, 1013376k used, 17564k free, 124616k buffersSwap: 2104472k total, 32k used, 2104440k free, 411968k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 4136 sbrandt 15 0 35208 15m 10m S 6 1.5 0:03.35 gnome-terminal 3761 root 16 0 82676 50m 12m R 3 5.0 1:02.82 X 5195 sbrandt 16 0 2176 1172 852 R 1 0.1 0:00.03 top 3487 root 17 0 1820 572 496 S 0 0.1 0:00.25 hald-addon-stor 3930 sbrandt 16 0 99.8m 40m 14m S 0 4.0 0:36.27 beagled

18


19

Topics

• Introduction


• Gprof

• Perfsuite

• PAPI

• Tau & PAPI



• Tau & MPI



gprof : quick overview• gprof

– a utility which profiles procedures in programs, available in most Unix systems.

• gprof provides information about :– An index for each procedure

– Parent of each procedures

– The percentage of CPU time utilized by a procedure and its calls.

– Breakdown of time used by the procedure and its descendents

– Number of times a procedure was called.

– direct descendents of each procedure

• To use gprof :

• compile the source code with a –pg option

• running the executable created generates an output file gmon.out for serial programs.

– For serial programs: gprof exe gmon.out

– For parallel programs, set env variable GMON_OUT_PREFIX:gprof exe gmon.out.*

20


GPROF: one minute tutorial

• Steps to use gprof:– gcc -pg -g -o prog prog.c– ./prog– gprof prog gmon.out

• More reading:

http://www.cs.utah.edu/dept/old/texinfo/as/gprof.html• Finds subroutines where the most time is spent• Cannot tell you why some routines are more costly than others. Need more

information...

21


Demo of gprof

22


23

Topics

• Introduction


• Gprof

• Perfsuite

• PAPI

• Tau & PAPI



• Tau & MPI




24


Using psrunpsrun cmd (e.g. psrun du -s /usr)

– This test will measure performance counters used by the du command. No special compilation of ls is required for this to work.

psprocess cmd.* (e.g. psprocess du.*.xml)– At the bottom of this file, you will see summary events about

numerous counters.

25


Demo of psrun

26


27

Topics

• Introduction


• Gprof

• Perfsuite

• PAPI

• Tau & PAPI



• Tau & MPI




28



29



30



31



32



33



34



35



36



37



38



39



40



41


By hand: Verifying the PAPI Version

// When hand-instrumenting you need to check

#include <papi.h>

...

/* Verifying PAPI Version */

int v = PAPI_library_init(PAPI_VER_CURRENT);

if(v != PAPI_VER_CURRENT) {

fprintf(stderr,"Bad PAPI version\n");

exit(2);

}

42


Use "papi_avail -a" to identify counters

Link with -lpapi

By Hand: Measuring PAPI Counters

#include "papi.h"

#define NUM 3int events[NUM] ={

PAPI_FP_OPS, PAPI_TOT_INS, PAPI_L1_DCM};

int main(int argc,char *argv) { int i; int r; long_long values[NUM]; r=PAPI_start_counters(events,NUM);

...

r=PAPI_stop_counters(values,NUM); printf("end ret=%d\n",r); for(i=0;i<NUM;i++) { printf("ctr[%d]: %f\n",i,

(double)values[i]); }}

43


Demo: Hand instrumentation with PAPI

44


Statistical profiling

• profil() - Unix command to examine program to periodically examine program counter. Identify subroutines where code spends most time.

• Used by Gprof

• PAPI_profil() - Emulates profil(), but looks at a specific hardware counter. Identifies file/line where code spends most time.

45


Using psrun to find hot spots

• gcc -g -o cmd cmd.c

• psrun -C -c papi_profile_cycles.xml cmd

– "-C" Instructs papi to use xml configurations that are in the install path rather than current directory.

– "-c papi_profile_cycles.xml" Use the named config file rather than the default.

– "papi_profile_cycles.xml" directs papi to collect file/line data.

• psprocess cmd.*.xml

– display results

46


Demo : 2nd Demo of psrun

47


48

Topics

• Introduction


• Gprof

• Perfsuite

• PAPI

• Tau & PAPI



• Tau & MPI




49



50



51


Measuring PAPI Counters with TAU

* Set up environment

- Select counters

- export TAU_METRICS=TIME:PAPI_FP_OPS:PAPI_TOT_INS

- Select TAU makefile

- export TAU_MAKEFILE=${TAU}/lib/Makefile.tau-papi-pdt

- export TAU_MAKEFILE=${TAU}/lib/Makefile.tau-papi-mpi-pdt-trace

* Compile with special TAU compiler:

- e.g. tau_cc.sh cmd.c

* Run your code

* Use pprof to read trace files: profile.*

52


More TAU options...• Diagnostic:

– export TAU_OPTIONS=-optKeepFiles

– Examine instrumented code (if you want to):

• eg. tau_cc.sh cmd.c

• vi cmd.inst.c

• Throttling:

– export TAU_THROTTLE=1

– export TAU_THROTTLE_NUMCALLS=400000

– export TAU_THROTTLE_PERCALL=3000

• Exploring Data Graphically:

– Windows users:

• Start Xming

• Enable X11 forwarding on Putty

– Linux users:

• ssh -X [email protected]

– Run paraprof

53



54



55



56


Demo of TAU

57


Topics

• Introduction


• Gprof

• Perfsuite

• PAPI

• Tau & PAPI

• Performance Characteristics



• Tau & MPI


58


59

Where are we?

• Three classes of parallel computing– Capacity– Cooperative– Capability

• Three execution models– Throughput – Communicating sequential processes (message passing)– Shared memory multithreaded

• Programming formalisms– Condor– MPI– OpenMP / Pthreads

• More performance measurement– For cooperative/message passing/MPI


60

What has changed? SMP to MPP• SMP – symmetric multiprocessor

– Shared memory• UMA – uniform memory access with cache coherence

– Multithreaded parallelism– Communication through main memory– Not scalable– Programming in OpenMP– DSM and PGAS provide alternative shared memory structures

• DSM – distributed shared memory (with cache coherence)• PGAS – Partitioned global address space (without cache coherence)• Both are NUMA

• MPP – massively parallel processor– Distributed memory

• NUMA – non-uniform memory access

– Concurrent sequential processes parallelism– Communication through messages between nodes– Scalable– Programming in MPI– Same for commodity clusters but usually with weaker networks


61

MPI Performance Characteristics

• Latency– Time to send first bits of data across link to remote node– Does not include overhead

• Bandwidth– Rate of data transfer across link to remote node

• Buffers– System or user buffers take up time to manage capacity etc.

• Blocking versus Asynchronous– Forced ordering of computation and communication


62

Performance Factors• Platform / Architecture Related:

– cpu - clock speed, number of cpus – Memory subsystem - memory and cache configuration,

memory-cache-cpu bandwidth, memory copy bandwidth – Network adapters - type, latency and bandwidth

characteristics – Operating system characteristics - many

• Network Related: – Hardware - ethernet, FDDI, switch, intermediate hardware

(routers) – Protocols - TCP/IP, UDP/IP, other – Configuration, routing, etc – Network tuning options ("no" command) – Network contention / saturation

source : http://www.llnl.gov/computing/tutorials/mpi_performance/


63

Performance Factors (2)• Application Related:

– Algorithm efficiency and scalability – Communication to computation ratios – Load balance – Memory usage patterns – I/O – Message size used – Types of MPI routines used - blocking, non-blocking, point-to-

point, collective communications

• MPI Implementation Related: – Message buffering – Message passing protocols - eager, rendezvous, other – Sender-Receiver synchronization - polling, interrupt – Routine internals - efficiency of algorithm used to implement a

given routine source : http://www.llnl.gov/computing/tutorials/mpi_performance/


64

Performance Impact of Message Sizes• Message size can be a very significant contributor to

MPI application performance. In most cases, increasing the message size will yield better performance.

• For communication intensive applications, algorithm modifications that take advantage of message size "economies of scale" may be worth the effort. Performance can often improve significantly within a relatively small range of message sizes.

• The following three graphs demonstrate how increasing message size can improve bandwidth for different message size ranges


65

Topics

• Introduction


• Gprof

• Perfsuite

• PAPI

• Tau & PAPI



• Tau & MPI



66

HPC Challenge Benchmarks

• HPC Challenge: http://icl.cs.utk.edu/hpcc/– See results tab– b_eff benchmark is a part of this larger database– more info than just HPL!


67

b_eff

• Standard Benchmark – part of HPC Challenge– Provides effective bandwidth and latency

• Averages a variety of message sizes and communication patterns

• Determines an effective latency and bandwidth

• b_eff depends on:– hardware: interconnect, memory– software: MPI implementation– tuneable parameters of the os: buffers– etc.

See : http://www.hlrs.de/organization/par/services/models/mpi/b_eff/


68

Effective Bandwidth Benchmark


69

Example: Send/Recv, ring & random


70

Demo

• running of b_eff


71

Topics

• Introduction


• Gprof

• Perfsuite

• PAPI

• Tau & PAPI



• Tau & MPI



72

Portable MPI Tracing: PMPI

• An API to MPI for tracing, debugging, performance measurements of MPI applications

• MPI_<command>() calls PMPI_<command>()

• MPI_Pcontrol(int)– 0: disabled– 1: enabled – Default Level– 2: flush trace buffers


73

Demo : Custom MPI Tracing#include <stdio.h>#include <time.h>#include <mpi.h>int sends = 0;int pcontrol = 1;

int MPI_Pcontrol(int n) {

pcontrol = n;

return PMPI_Pcontrol(n);

}

int MPI_Send( void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm ) { if(pcontrol >= 1) sends++;

return PMPI_Send(buf,count,datatype,dest,tag,comm );

}int MPI_Finalize() { if(pcontrol >= 1) { int myrank;

PMPI_Comm_rank(MPI_COMM_WORLD,&myrank);

printf("MYTRACE: sends = %d by rank = %d\n",sends,myrank); }

return PMPI_Finalize();

}


74

Demo

• MPI tracing, custom implementation


75

Topics

• Introduction


• Gprof

• Perfsuite

• PAPI

• Tau & PAPI



• Tau & MPI



76

TAU and MPI

• Tau uses the PMPI interface to track MPI calls

• Jumpshot is used as the viewer– Shows subroutine calls and mpi calls


77

TAU Performance System Architecture

EPILOG

Paraver


78

TAU Measurement Options

• Parallel profiling– Function-level, block-level, statement-level– Supports user-defined events– TAU parallel profile data stored during execution– Hardware counts values– Support for multiple counters– Support for call-path profiling

• Tracing– All profile-level events– Inter-process communication events– Timestamp synchronization– Trace merging and format conversion


79

How To Use TAU?

• Instrumentation– Application code and libraries– Selective instrumentation– Multiple configurations for different measurements options– Selective measurement control

• Execute “experiments” to produce performance data– Performance data generated at end or during execution

• Use analysis tools to look at performance results


80

Using Tau

* Setup Environment

- export TAU_MAKEFILE=/usr/local/tau-2.20.1b2/x86_64/lib/Makefile.tau-papi-mpi-pdt-trace

* Use tau_cc.sh, tau_f90.sh, etc. to compile

* Run with mpiexec

* Post-process:

- tau_treemerge.pl

- tau2slog2 tau.trc tau.edf -o tau.slog2


81

Demo

• Tau and Jumpshot


82

Topics

• Introduction


• Gprof

• Perfsuite

• PAPI

• Tau & PAPI



• Tau & MPI



83

Summary – Material for the Test

• Performance Counters: 11-15

• Basic Unix Utilities: 16,17,18

• Gprof: 20-21

• Perfsuite: 24,25

• PAPI: 28-46

• TAU: 49,50, 51, 53

• Performance Characteristics: 60, 61, 62, 63, 64

• Benchmarking b_eff: 67

• PMPI: 72, 73

• TAU & MPI: 78, 79, 80


84

Sources

• http://www.cs.uoregon.edu/research/tau/docs.php (tau)

• http://www.llnl.gov/computing/tutorials/mpi_performance/

• http://www.netlib.org/utk/papers/mpi-book/node182.html (mpi profiling interface)

• http://www-unix.mcs.anl.gov/mpi/tutorial/perf/index.html (Gropp course)

• http://www.hlrs.de/organization/par/services/models/mpi/b_eff/ (b_eff bench)

• http://icl.cs.utk.edu/hpcc/ (hpc challenge)


85

HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS

Documents

Transcript of HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS