HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS
description
Transcript of HIGH PERFORMANCE COMPUTING : MODELS, METHODS, & MEANS PERFORMANCE MEASUREMENT & ANALYSIS
HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS
PERFORMANCE MEASUREMENT & ANALYSIS
Prof. Thomas SterlingDepartment of Computer ScienceLouisiana State UniversityMarch 1, 2011
Contact Info
• Steven R. Brandt• [email protected]• AIM: RegexGuy
Links
• http://cct.lsu.edu/~sbrandt/csc7600l15demos.zip • X-Ming:
– http://www.straightrunning.com/XmingNotes/– Scroll down, click on Xming public release and install
• Putty:– http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html– Click on putty.exe and save to the desktop
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
4
Topics
• Introduction
• Measuring System Operation
• Gprof
• Perfsuite
• PAPI
• Tau & PAPI
• Benchmarks b_eff
• MPI Tracing with PMPI
• Tau & MPI
• Summary – Material for the Test
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
5
Topics
• Introduction
• Measuring System Operation
• Gprof
• Perfsuite
• PAPI
• Tau & PAPI
• Benchmarks b_eff
• MPI Tracing with PMPI
• Tau & MPI
• Summary – Material for the Test
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
Opening Remarks
• Up until now, 2 strategies for measuring performance:– 1) wall-clock time for user applications
– 2) benchmarks for comparing• Machines of different type• Machines of different scale
• But, we have identified factors that contribute to system operational performance, e.g.:– Effective use of parallelism
– Cache behavior
• To make better use of HPC systems, need to measure operational behavior– How the system is performing during application execution
– What are the application demands and bottlenecks
• Focus on SMP class system operation during this Segment– Next Segment: measuring MPP & cluster behavior
6
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
What you’ll Need to Know
• This is a skills-oriented lecture
• Understand the kinds and levels of metrics of system and processor operation that you can measure
• Know the kinds of tools that can expose valuable parameters of system & application operation– Hardware counters– Software instrumentation, data acquisition, and presentation
• Learn the basics of how to use specific tools when running your application code– Gprof– Perfsuite– PAPI– TAU
7
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
Final initial comments(yes, I know that’s an oxymoron)
• We are only going to scratch the surface today– Try to get the basic ideas
• This will expose you to a range of concepts, strategies, and tools– Lots of details will be left to future discussions
• Over the next weeks, we will extend our abilities in using these tools– But don’t hesitate to read through the documentation– Hey, try some things out for yourself– You’ve got a sandbox to play in (Arete)
8
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
9
Topics
• Introduction
• Measuring System Operation
• Gprof
• Perfsuite
• PAPI
• Tau & PAPI
• Benchmarks b_eff
• MPI Tracing with PMPI
• Tau & MPI
• Summary – Material for the Test
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
Hardware Counters
• Each processor has the ability to monitor events of various kinds
• Small set of registers used to count events. Very processor specific.
MP
L1L2
MP
L1L2
L3
MP
L1L2
MP
L1L2
L3
M1 M2 Mn
Controller
S
S
NIC NICUSBPeripherals
JTAGEthernet
PCI-e
10
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
Philip J. Mucci, “Performance Analysis Tools and PAPI” UTK ICL
11
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
Philip J. Mucci, “Performance Analysis Tools and PAPI” UTK ICL
12
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
Hardware Events
• Floating point operations, Multiplies, Adds, Multiply-Adds, etc.
• L1/L2 cache hits/misses (see http://en.wikipedia.org/wiki/CPU_cache)
• Translation Lookaside Buffer hits/misses (virtual to physical address translation table)
• Branch prediction counters (pipelined systems must guess the next instruction to fetch)
13
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
A Goal: Optimization
• Compile Time:– Various levels enabled by compiler options– Examine Compiler Output
• Run Time (Performance Analysis):– Instrument code or execution to produce a trace– Tools to analyze trace:
• Standard/basic tool is gprof, but there are many others• Note: Java Hot-Spot environment collects data about
execution and uses it to optimize a program as it runs
14
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
Performance Analysis Tools
• Widely Ported Low-Level Interface to hardware counters: PAPI (Performance API): Supports AIX, Linux, Solaris, and even Windows! http://icl.cs.utk.edu/papi/custom/index.html?lid=62&slid=96
• Many tools built on PAPI– Perfsuite (NCSA), psrun command– TAU (University of Oregon)– etc. etc.
• Useful for:– Finding performance bottlenecks– Identifying cache problems (badly sized arrays)
15
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
time
• A simple Unix command to give resource usage.
• Runs a specified program
• time [options] command [arguments …]• Gives timing statistics about program run
– The elapsed real time between invocation and termination– User CPU time– System CPU time
• See: man time
16
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
top
• Gives an overview of system process status and resource usage
• Provides a dynamic realtime view of a running system– System summary information– Currently managed tasks– Updates every few (e.g. 5) seconds
• top –hv | -bcisS –d delay –n iterations –p pid [, pid …]
• See: man top
17
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
Basic Tools• Time$ time du -s /usr > /dev/null 2>&1
real 0m34.274suser 0m0.082ssys 0m0.957s
• top/ps
top - 11:29:40 up 49 min, 2 users, load average: 0.32, 0.26, 0.25Tasks: 125 total, 3 running, 121 sleeping, 0 stopped, 1 zombieCpu(s): 4.5%us, 0.3%sy, 0.0%ni, 94.7%id, 0.2%wa, 0.3%hi, 0.0%si, 0.0%stMem: 1030940k total, 1013376k used, 17564k free, 124616k buffersSwap: 2104472k total, 32k used, 2104440k free, 411968k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 4136 sbrandt 15 0 35208 15m 10m S 6 1.5 0:03.35 gnome-terminal 3761 root 16 0 82676 50m 12m R 3 5.0 1:02.82 X 5195 sbrandt 16 0 2176 1172 852 R 1 0.1 0:00.03 top 3487 root 17 0 1820 572 496 S 0 0.1 0:00.25 hald-addon-stor 3930 sbrandt 16 0 99.8m 40m 14m S 0 4.0 0:36.27 beagled
18
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
19
Topics
• Introduction
• Measuring System Operation
• Gprof
• Perfsuite
• PAPI
• Tau & PAPI
• Benchmarks b_eff
• MPI Tracing with PMPI
• Tau & MPI
• Summary – Material for the Test
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
gprof : quick overview• gprof
– a utility which profiles procedures in programs, available in most Unix systems.
• gprof provides information about :– An index for each procedure
– Parent of each procedures
– The percentage of CPU time utilized by a procedure and its calls.
– Breakdown of time used by the procedure and its descendents
– Number of times a procedure was called.
– direct descendents of each procedure
• To use gprof :
• compile the source code with a –pg option
• running the executable created generates an output file gmon.out for serial programs.
– For serial programs: gprof exe gmon.out
– For parallel programs, set env variable GMON_OUT_PREFIX:gprof exe gmon.out.*
20
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
GPROF: one minute tutorial
• Steps to use gprof:– gcc -pg -g -o prog prog.c– ./prog– gprof prog gmon.out
• More reading:
http://www.cs.utah.edu/dept/old/texinfo/as/gprof.html• Finds subroutines where the most time is spent• Cannot tell you why some routines are more costly than others. Need more
information...
21
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
Demo of gprof
22
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
23
Topics
• Introduction
• Measuring System Operation
• Gprof
• Perfsuite
• PAPI
• Tau & PAPI
• Benchmarks b_eff
• MPI Tracing with PMPI
• Tau & MPI
• Summary – Material for the Test
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
Philip J. Mucci, “Performance Analysis Tools and PAPI” UTK ICL
24
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
Using psrunpsrun cmd (e.g. psrun du -s /usr)
– This test will measure performance counters used by the du command. No special compilation of ls is required for this to work.
psprocess cmd.* (e.g. psprocess du.*.xml)– At the bottom of this file, you will see summary events about
numerous counters.
25
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
Demo of psrun
26
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
27
Topics
• Introduction
• Measuring System Operation
• Gprof
• Perfsuite
• PAPI
• Tau & PAPI
• Benchmarks b_eff
• MPI Tracing with PMPI
• Tau & MPI
• Summary – Material for the Test
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
Philip J. Mucci, “Performance Analysis Tools and PAPI” UTK ICL
28
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
Philip J. Mucci, “Performance Analysis Tools and PAPI” UTK ICL
29
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
Philip J. Mucci, “Performance Analysis Tools and PAPI” UTK ICL
30
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
Philip J. Mucci, “Performance Analysis Tools and PAPI” UTK ICL
31
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
Philip J. Mucci, “Performance Analysis Tools and PAPI” UTK ICL
32
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
Philip J. Mucci, “Performance Analysis Tools and PAPI” UTK ICL
33
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
Philip J. Mucci, “Performance Analysis Tools and PAPI” UTK ICL
34
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
Philip J. Mucci, “Performance Analysis Tools and PAPI” UTK ICL
35
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
Philip J. Mucci, “Performance Analysis Tools and PAPI” UTK ICL
36
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
Philip J. Mucci, “Performance Analysis Tools and PAPI” UTK ICL
37
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
Philip J. Mucci, “Performance Analysis Tools and PAPI” UTK ICL
38
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
Philip J. Mucci, “Performance Analysis Tools and PAPI” UTK ICL
39
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
Philip J. Mucci, “Performance Analysis Tools and PAPI” UTK ICL
40
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
Philip J. Mucci, “Performance Analysis Tools and PAPI” UTK ICL
41
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
By hand: Verifying the PAPI Version
// When hand-instrumenting you need to check
#include <papi.h>
...
/* Verifying PAPI Version */
int v = PAPI_library_init(PAPI_VER_CURRENT);
if(v != PAPI_VER_CURRENT) {
fprintf(stderr,"Bad PAPI version\n");
exit(2);
}
42
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
Use "papi_avail -a" to identify counters
Link with -lpapi
By Hand: Measuring PAPI Counters
#include "papi.h"
#define NUM 3int events[NUM] ={
PAPI_FP_OPS, PAPI_TOT_INS, PAPI_L1_DCM};
int main(int argc,char *argv) { int i; int r; long_long values[NUM]; r=PAPI_start_counters(events,NUM);
...
r=PAPI_stop_counters(values,NUM); printf("end ret=%d\n",r); for(i=0;i<NUM;i++) { printf("ctr[%d]: %f\n",i,
(double)values[i]); }}
43
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
Demo: Hand instrumentation with PAPI
44
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
Statistical profiling
• profil() - Unix command to examine program to periodically examine program counter. Identify subroutines where code spends most time.
• Used by Gprof
• PAPI_profil() - Emulates profil(), but looks at a specific hardware counter. Identifies file/line where code spends most time.
45
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
Using psrun to find hot spots
• gcc -g -o cmd cmd.c
• psrun -C -c papi_profile_cycles.xml cmd
– "-C" Instructs papi to use xml configurations that are in the install path rather than current directory.
– "-c papi_profile_cycles.xml" Use the named config file rather than the default.
– "papi_profile_cycles.xml" directs papi to collect file/line data.
• psprocess cmd.*.xml
– display results
46
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
Demo : 2nd Demo of psrun
47
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
48
Topics
• Introduction
• Measuring System Operation
• Gprof
• Perfsuite
• PAPI
• Tau & PAPI
• Benchmarks b_eff
• MPI Tracing with PMPI
• Tau & MPI
• Summary – Material for the Test
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
Philip J. Mucci, “Performance Analysis Tools and PAPI” UTK ICL
49
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
Philip J. Mucci, “Performance Analysis Tools and PAPI” UTK ICL
50
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
Philip J. Mucci, “Performance Analysis Tools and PAPI” UTK ICL
51
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
Measuring PAPI Counters with TAU
* Set up environment
- Select counters
- export TAU_METRICS=TIME:PAPI_FP_OPS:PAPI_TOT_INS
- Select TAU makefile
- export TAU_MAKEFILE=${TAU}/lib/Makefile.tau-papi-pdt
- export TAU_MAKEFILE=${TAU}/lib/Makefile.tau-papi-mpi-pdt-trace
* Compile with special TAU compiler:
- e.g. tau_cc.sh cmd.c
* Run your code
* Use pprof to read trace files: profile.*
52
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
More TAU options...• Diagnostic:
– export TAU_OPTIONS=-optKeepFiles
– Examine instrumented code (if you want to):
• eg. tau_cc.sh cmd.c
• vi cmd.inst.c
• Throttling:
– export TAU_THROTTLE=1
– export TAU_THROTTLE_NUMCALLS=400000
– export TAU_THROTTLE_PERCALL=3000
• Exploring Data Graphically:
– Windows users:
• Start Xming
• Enable X11 forwarding on Putty
– Linux users:
• ssh -X [email protected]
– Run paraprof
53
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
Philip J. Mucci, “Performance Analysis Tools and PAPI” UTK ICL
54
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
Philip J. Mucci, “Performance Analysis Tools and PAPI” UTK ICL
55
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
Philip J. Mucci, “Performance Analysis Tools and PAPI” UTK ICL
56
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
Demo of TAU
57
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
Topics
• Introduction
• Measuring System Operation
• Gprof
• Perfsuite
• PAPI
• Tau & PAPI
• Performance Characteristics
• Benchmarks b_eff
• MPI Tracing with PMPI
• Tau & MPI
• Summary – Material for the Test
58
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
59
Where are we?
• Three classes of parallel computing– Capacity– Cooperative– Capability
• Three execution models– Throughput – Communicating sequential processes (message passing)– Shared memory multithreaded
• Programming formalisms– Condor– MPI– OpenMP / Pthreads
• More performance measurement– For cooperative/message passing/MPI
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
60
What has changed? SMP to MPP• SMP – symmetric multiprocessor
– Shared memory• UMA – uniform memory access with cache coherence
– Multithreaded parallelism– Communication through main memory– Not scalable– Programming in OpenMP– DSM and PGAS provide alternative shared memory structures
• DSM – distributed shared memory (with cache coherence)• PGAS – Partitioned global address space (without cache coherence)• Both are NUMA
• MPP – massively parallel processor– Distributed memory
• NUMA – non-uniform memory access
– Concurrent sequential processes parallelism– Communication through messages between nodes– Scalable– Programming in MPI– Same for commodity clusters but usually with weaker networks
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
61
MPI Performance Characteristics
• Latency– Time to send first bits of data across link to remote node– Does not include overhead
• Bandwidth– Rate of data transfer across link to remote node
• Buffers– System or user buffers take up time to manage capacity etc.
• Blocking versus Asynchronous– Forced ordering of computation and communication
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
62
Performance Factors• Platform / Architecture Related:
– cpu - clock speed, number of cpus – Memory subsystem - memory and cache configuration,
memory-cache-cpu bandwidth, memory copy bandwidth – Network adapters - type, latency and bandwidth
characteristics – Operating system characteristics - many
• Network Related: – Hardware - ethernet, FDDI, switch, intermediate hardware
(routers) – Protocols - TCP/IP, UDP/IP, other – Configuration, routing, etc – Network tuning options ("no" command) – Network contention / saturation
source : http://www.llnl.gov/computing/tutorials/mpi_performance/
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
63
Performance Factors (2)• Application Related:
– Algorithm efficiency and scalability – Communication to computation ratios – Load balance – Memory usage patterns – I/O – Message size used – Types of MPI routines used - blocking, non-blocking, point-to-
point, collective communications
• MPI Implementation Related: – Message buffering – Message passing protocols - eager, rendezvous, other – Sender-Receiver synchronization - polling, interrupt – Routine internals - efficiency of algorithm used to implement a
given routine source : http://www.llnl.gov/computing/tutorials/mpi_performance/
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
64
Performance Impact of Message Sizes• Message size can be a very significant contributor to
MPI application performance. In most cases, increasing the message size will yield better performance.
• For communication intensive applications, algorithm modifications that take advantage of message size "economies of scale" may be worth the effort. Performance can often improve significantly within a relatively small range of message sizes.
• The following three graphs demonstrate how increasing message size can improve bandwidth for different message size ranges
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
65
Topics
• Introduction
• Measuring System Operation
• Gprof
• Perfsuite
• PAPI
• Tau & PAPI
• Benchmarks b_eff
• MPI Tracing with PMPI
• Tau & MPI
• Summary – Material for the Test
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
66
HPC Challenge Benchmarks
• HPC Challenge: http://icl.cs.utk.edu/hpcc/– See results tab– b_eff benchmark is a part of this larger database– more info than just HPL!
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
67
b_eff
• Standard Benchmark – part of HPC Challenge– Provides effective bandwidth and latency
• Averages a variety of message sizes and communication patterns
• Determines an effective latency and bandwidth
• b_eff depends on:– hardware: interconnect, memory– software: MPI implementation– tuneable parameters of the os: buffers– etc.
See : http://www.hlrs.de/organization/par/services/models/mpi/b_eff/
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
68
Effective Bandwidth Benchmark
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
69
Example: Send/Recv, ring & random
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
70
Demo
• running of b_eff
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
71
Topics
• Introduction
• Measuring System Operation
• Gprof
• Perfsuite
• PAPI
• Tau & PAPI
• Benchmarks b_eff
• MPI Tracing with PMPI
• Tau & MPI
• Summary – Material for the Test
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
72
Portable MPI Tracing: PMPI
• An API to MPI for tracing, debugging, performance measurements of MPI applications
• MPI_<command>() calls PMPI_<command>()
• MPI_Pcontrol(int)– 0: disabled– 1: enabled – Default Level– 2: flush trace buffers
CSC 7600 Lecture 15 : Perf. AnalysisSpring 2008
73
Demo : Custom MPI Tracing#include <stdio.h>#include <time.h>#include <mpi.h>int sends = 0;int pcontrol = 1;
int MPI_Pcontrol(int n) {
pcontrol = n;
return PMPI_Pcontrol(n);
}
int MPI_Send( void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm ) { if(pcontrol >= 1) sends++;
return PMPI_Send(buf,count,datatype,dest,tag,comm );
}int MPI_Finalize() { if(pcontrol >= 1) { int myrank;
PMPI_Comm_rank(MPI_COMM_WORLD,&myrank);
printf("MYTRACE: sends = %d by rank = %d\n",sends,myrank); }
return PMPI_Finalize();
}
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
74
Demo
• MPI tracing, custom implementation
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
75
Topics
• Introduction
• Measuring System Operation
• Gprof
• Perfsuite
• PAPI
• Tau & PAPI
• Benchmarks b_eff
• MPI Tracing with PMPI
• Tau & MPI
• Summary – Material for the Test
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
76
TAU and MPI
• Tau uses the PMPI interface to track MPI calls
• Jumpshot is used as the viewer– Shows subroutine calls and mpi calls
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
77
TAU Performance System Architecture
EPILOG
Paraver
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
78
TAU Measurement Options
• Parallel profiling– Function-level, block-level, statement-level– Supports user-defined events– TAU parallel profile data stored during execution– Hardware counts values– Support for multiple counters– Support for call-path profiling
• Tracing– All profile-level events– Inter-process communication events– Timestamp synchronization– Trace merging and format conversion
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
79
How To Use TAU?
• Instrumentation– Application code and libraries– Selective instrumentation– Multiple configurations for different measurements options– Selective measurement control
• Execute “experiments” to produce performance data– Performance data generated at end or during execution
• Use analysis tools to look at performance results
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
80
Using Tau
* Setup Environment
- export TAU_MAKEFILE=/usr/local/tau-2.20.1b2/x86_64/lib/Makefile.tau-papi-mpi-pdt-trace
* Use tau_cc.sh, tau_f90.sh, etc. to compile
* Run with mpiexec
* Post-process:
- tau_treemerge.pl
- tau2slog2 tau.trc tau.edf -o tau.slog2
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
81
Demo
• Tau and Jumpshot
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
82
Topics
• Introduction
• Measuring System Operation
• Gprof
• Perfsuite
• PAPI
• Tau & PAPI
• Benchmarks b_eff
• MPI Tracing with PMPI
• Tau & MPI
• Summary – Material for the Test
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
83
Summary – Material for the Test
• Performance Counters: 11-15
• Basic Unix Utilities: 16,17,18
• Gprof: 20-21
• Perfsuite: 24,25
• PAPI: 28-46
• TAU: 49,50, 51, 53
• Performance Characteristics: 60, 61, 62, 63, 64
• Benchmarking b_eff: 67
• PMPI: 72, 73
• TAU & MPI: 78, 79, 80
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2011
84
Sources
• http://www.cs.uoregon.edu/research/tau/docs.php (tau)
• http://www.llnl.gov/computing/tutorials/mpi_performance/
• http://www.netlib.org/utk/papers/mpi-book/node182.html (mpi profiling interface)
• http://www-unix.mcs.anl.gov/mpi/tutorial/perf/index.html (Gropp course)
• http://www.hlrs.de/organization/par/services/models/mpi/b_eff/ (b_eff bench)
• http://icl.cs.utk.edu/hpcc/ (hpc challenge)
CSC 7600 Lecture 13 : Perf. AnalysisSpring 2009
85