Download - © 2011 Pittsburgh Supercomputing Center Getting the Most Out of the TeraGrid SGI Altix UV Systems Mahin Mahmoodi Raghu Reddy TeraGrid 11 Conference July.

© 2011 Pittsburgh Supercomputing Center

Getting the Most Out of the TeraGrid SGI Altix UV Systems

Mahin Mahmoodi

Raghu Reddy

TeraGrid 11 Conference

July 18, 2011

Salt Lake City

© 2011 Pittsburgh Supercomputing Center 2

Outline

• Blacklight memory BW and latency w.r.t processor-core mapping

• GRU environment variable• Portable performance evaluation tools on

Blacklight– Case study: PSC Hybrid Benchmark– PAPI– IPM– SCALASCA– TAU


Blacklight memory BW and latency with respect to processor-core mapping


Blacklight per blade/processor/core Memory Layout

Node: 1 blade + 1 HUB

L1: 64KB per core

L2: 256KB per core

QPI

BLADE (2 Processors) 128 GB local memory

ProcessorProcessor (socket)

8 cores

L3L3: Last Level Cache = 24 MB

L1,L2

L3

HUB

QPI


Blacklight Node Pair Architecture

“node pair”

NUMAlink-5“node”

UVHub

Intel Nehalem

EX-8

Intel Nehalem

EX-8

QPI

64 GBRAM

64 GBRAM

UVHub

Intel Nehalem

EX-8

Intel Nehalem

EX-8

QPI

64 GBRAM

64 GBRAM


HPCC Stream benchmark• Memory bandwidth

is the rate that data can be read from or stored into processor memory

by processor• Stream measures

sustainable main memory bandwidth (in MB/s) and the corresponding computation rate for simple vector kernel.

• Compute a = b + α c (SAXPY)

Where b and c are two vectors of random 64-bit floating-point values for a given scalar value of α.

• Problem size

The STREAM benchmark is specifically designed to work with datasets much larger than the available cache on any given system, so that the results are more indicative of the performance of very large, vector style applications.

• Design purpose

It is designed to stress local memory bandwidth. The vectors may be allocated in an aligned manner such that no communication is required to perform the computation.


Blacklight Memory Bandwidth w.r.t Process-core Mapping

• HPCC-stream used for memory BW (MB/s)

Single Triad (per core) 5

Star Triad per core, (per socket) 2.37, (8 * 2.37)

Cores/socket 8

Speed up/socket 3.792


Effect of –openmp and omplace in Stream Benchmark Bandwidth• -openmp is the compilation flag• omplace is the run time command for an OpenMP code

to ensure that threads do not migrate across the cores

FunctionRate

(MB/S)-openmp

Rate (MB/S)

-openmp & omplace

Rate (MB/s)

Rate (MB/s)

omplace

Copy 840.06 4363.83 4205.21 4186.23

Scale 728.38 3946.61 3957.36 3968.78

Add 970.93 4934.30 5007.15 4977.13

Triad 979.62 4998.90 5017.49 4995.37

Take home message: If compiled with OpenMP be sure to use omplaceExample: mpirun -np 16 omplace –nt 4 ./myhybrid


Modified STREAMs Benchmark

The notation is: blk-stride-arraysize

g: giga word; unit is in word (8 bytes)

blkstride

Arraysize

Function 200M-200M-200M

(MB/s)

8-8-200M

(MB/s)

1-1-200M

(MB/s)

Strided CP 4200 5175 2145

Random CP 4200 1050 288

Single core modified streams is benchmarked


Remote Memory Access• Modified Stream code is benchmarked• Data is initialized on thread 0 and resides on thread 0• Data accessed by thread <n> (remote access)• Block=blk, Stride=S, Arraysize=n

Accessingthread

BW (MB/s)blk=200MS=200Mn=200M

BW (MB/s)blk=8S=8

n=200M

BW (MB/s)blk=1S=8

n=200M

0 1826.18 1624.53 557.39

8 1410.17 1376.59 463.20

16 594.83 641.88 187.24

24 673.43 622.25 188.44

32 541.75 534.93 156.57

48 481.22 459.93 140.14

0, 1, 2, 3, 4, 5, 6, 7 8, 9, 10,11,12, 13, 14, 15

16,17,18,19,20,21,22,23 24,25,26,27,28,29,30,31

HUB

HUB

QPI

QPI


HPCC Ping-pong Benchmark• Latency

Time required to send an 8-byte message from one process to another

• What does ping pong benchmark mean?

The ping pong benchmark is executed on two processes. From the client process a message (ping) is sent to the server process and then bounced back to the client (pong). MPI standard blocking send and receive is used. The ping-pong patterns are done in a loop. To achieve the communication time of one message, the total communication time is measured on the client process and divided by twice the loop length. Additional startup latencies are masked out by starting the measurement after one non-measured ping-pong. The benchmark in hpcc uses 8 byte messages and loop length = 8 for benchmarking the communication latency. The benchmark is repeated 5 times and the shortest latency is reported. To measure the communication bandwidth, 2,000,000 byte messages with loop length 1 are repeated twice.

• How is ping pong measured on more than 2 processors?

The ping-pong benchmark reports the maximum latency and minimum bandwidth for a number of non-simultaneous ping-pong tests. The ping-pongs are performed between as many as possible (there is an upper bound on the time it takes to complete this test) distinct exclusive pairs of processors.

Reference: http://icl.cs.utk.edu/hpcc/faq/index.html


Blacklight Latency with Respect to process-core Mapping

• HPCC-pingpong used for latency measurement

• Ranks send and recv 8-byte msg one at a time

Cores msg length (byte)

MPI Latency(microseconds)

1024 8 1.6 - 2.0


GRU Environment Variable


Global Reference Unit (GRU) Hardware Overview

UV HUB

Nehalem EX4,6,or 8 cores

Nehalem EX4,6,or 8 cores

QPI

QPI QPI

2 GRU Chiplets per HUB NUMALINK 5

Memory DIMMS Memory DIMMS

• GRU is a coprocessor that resides in HUB (node controller) of a UV system• GRU provide high BW & low latency socket communication• SGI MPT library uses GRU features for optimizing node communication


Run-time tuning with GRU in PSC Hybrid Benchmark

• Setting GRU_RESOURCE_FACTOR variable at run-time may improve the communication time.

• That is: ‘setenv GRU_RESOURCE_FACTOR <n>’, n=2,4,6,8• All runs are on 64 cores• (rank, thread): (64, 1), (8, 8), (8,4)

No-GRU GRU=2 GRU=4 GRU=6 GRU=8

64R1T

8R8T

16R4T

64R1T

8R8T

16R4T

64R1T

8R8T

16R4T

64R1T

8R8T

16R4T

64R1T

8R8T

16R4T

174 184 170 167 143 134 169 141 130 169 140 131 169 140 136

93 101 88 87 60 53 89 57 48 89 56 50 88 57 55

Walltime

CommTime


Effect of GRU on HPCC Pingpong BW

• HPCC-pingpong is used in two runs• Following environments are set in one of the

runs:setenv MPI_GRU_CBS 0setenv GRU_RESOURCE_FACTOR 4 setenv MPI_BUFFER_MAX 2048

Cores msg(bytes)

BW (MB/s)No-GRU

BW (MB/s)GRU

1024 2,000,000 1109.5 2663.6


Case study: PSC Hybrid Benchmark


A Case Study: PSC Hybrid Benchmark Code (Laplace Solver)

• Code uses MPI and OpenMP libraries to parallelize the solution of partial differential equation (PDE)

• Tests MPI/OpenMP performance of code on NUMA system

• Computation: each process is assigned the task of updating the entries on the part of the array it owns

• Communication: Each processor communicates with two neighbors only at block boundaries in order to receive values of neighbor points which are owned by another processor

• No collective communication

• Communications are simplified by allocating an overlap area at each processor for sorting the values to be received from neighbor


The Laplace Equation

• To solve the equation, we want to find T(x,y) in the grid points subject to the following initial boundary conditions:– Initial T at top and left boundaries is 0.– T varies linearly from 0 to 100 along the right and bottom

boundaries.• Solution method is Known as:

The Point Jacobi Iteration T=

0

T=0

T=100

T=

100


The Point Jacobi Iteration

• In this iterative method, value of each T(i,j) is replaced by the average of four neighbors until the convergence criteria are met.

• T(i,j) = 0.25 * [T(i-1,j)+T(i+1,j)+T(i,j-1)+T(i,j+1)]

T(i+1,j)T(i,j)

T(i,j+1)

T(i,j-1)

T(i-1,j)


Data DecompositionIn PSC Laplace benchmark

• 1D block, row-wise block partition is used• Each processor (PEs):

compute Jacobi points in

its block and

communicate those

with neighbor(s) only

at block boundaries.

PE0

PE1

PE2

PE3


Portable performance evaluation tools on Blacklight


Portable Performance Evaluation Tools on blacklight

Goals:• Give an over view of the programming tools

suite available on blacklight• Explain the functionality of individual tools• Teach how to use the tools effectively

– Capabilities– Basic use– Hybrid profiling analysis– Reducing the profiling overhead– Common environment variables


Available Open Source Performance Evaluation Tools on Blacklight

• PAPI• IPM• SCALASCA• TAU• module avail <tool> to view the available

versions • module load <tool> bring into the

environment

eg: module load tau


What is PAPI?

• Middleware to provide a consistent programming interface for the hardware performance counter found in most major micro-processors.

• Countable hardware events:PRESET: platform neutral events

NATIVE : platform dependent events

Derived: preset events can be derived from multiple native events.

Multiplexed: events can be multiplexed if counters are limited.


PAPI Utilities• Utilities are available in PAPI bin directory. Load the module first to

append it to the PATH or use the absolute path to the utility

Example:

% module load papi

% Which papi_avail

/usr/local/packages/PAPI/usr/4.1.3/bin/papi_avail

• Execute the utilities in compute nodes as mmtimer is not available in login nodes.

• Use <utility> -h for more information

Example:

% papi_cost –h

It computes min / max / mean / std. deviation for PAPI start/stop pairs; for PAPI reads, and for PAPI_accums.

Usage: cost [options] [parameters] …


PAPI Utilities Cont.

• Execute papi_avail for PAPI preset events

% papi_avail

……

Name Code Avail Deriv Description (Note)

PAPI_L1_DCM 0x80000000 Yes No Level 1 data cache misses

PAPI_L2_DCM 0x80000002 Yes Yes Level 2 data cache misses

• Execute papi_native_avail for available native events

%papi_native_avail

…….

Event Code Symbol | Long Description |

0x40000005 LAST_LEVEL_CACHE_REFERENCES | This is an alias for LLC_REFERENCE || S

• Execute papi_event_chooser to select a compatible set of events that can be counted simultaneously.

% papi_event_chooser

Usage: papi_event_chooser NATIVE|PRESET evt1 evt2 ...

% papi_event_chooser PRESET PAPI_FP_OPS, PAPI_L1_DCM

event_chooser.c PASSED


PAPI High-level Interface

• Meant for application programmers wanting coarse-grained measurements

• Calls the lower level API• Allows only PAPI preset events• Easier to use and less setup (less additional code) than

low-level• Supports 8 calls in C or Fortran:


PAPI High-level Example

#include "papi.h” #define NUM_EVENTS 2 long_long values[NUM_EVENTS]; unsigned int

Events[NUM_EVENTS]={PAPI_TOT_INS,PAPI_TOT_CYC};

/* Start the counters */ PAPI_start_counters((int*)Events,NUM_EVENTS);

/* What we are monitoring… */ do_work();

/* Stop counters and store results in values */ retval = PAPI_stop_counters(values,NUM_EVENTS);


Low-level Interface

• Increased efficiency and functionality over the high level PAPI interface

• Obtain information about the executable, the hardware, and the memory environment

• Multiplexing• Callbacks on counter overflow• Profiling• About 60 functions


PAPI Low-level Example

#include "papi.h”#define NUM_EVENTS 2int Events[NUM_EVENTS]={PAPI_FP_INS,PAPI_TOT_CYC};int EventSet;long_long values[NUM_EVENTS];/* Initialize the Library */retval = PAPI_library_init(PAPI_VER_CURRENT);/* Allocate space for the new eventset and do setup */retval = PAPI_create_eventset(&EventSet);/* Add Flops and total cycles to the eventset */retval = PAPI_add_events(EventSet,Events,NUM_EVENTS);/* Start the counters */retval = PAPI_start(EventSet);

do_work(); /* What we want to monitor*/

/*Stop counters and store results in values */retval = PAPI_stop(EventSet,values);


Example: FLOPS with PAPI callsprogram mflops_example

implicit none

#include 'fpapi.h'

integer :: i

double precision :: a, b, c

integer, parameter :: n = 1000000

integer (kind=8) :: flpops = 0

integer :: check

real (kind=4) :: real_time = 0., proc_time = 0., mflops = 0.

a = 1.e-8

b = 2.e-7

c = 3.e-6

call PAPIF_flops(real_time, proc_time, flpops, mflops, check)

print *, "first: ", flpops, proc_time, mflops, check

do i = 1, n

a = a + b * c

end do

call PAPIF_flops(real_time, proc_time, flpops, mflops, check)

print *, "second: ", flpops, proc_time, mflops, check

print *, 'sum = ', a

end program mflops_example

Compilation:% module load papi% ifort -fpp $PAPI_INC -o mflops mflops_example.f $PAPI_LIB

Execution:module load papi./ a.out

Output: flpops, proc_time, mflops, `checkfirst: 0 0.0000000E+00 0.0000000E+00 0 second: 1000009 1.4875773E-03 672.2400 0 sum = 6.100000281642980E-007


IPM: Integrated Performance Monitoring

• Lightweight and easy to use• Profiles only MPI code (not serial, not OpenMP)• Profiles only MPI routines (not computational routines)• Accesses hardware performance counters using PAPI• Lists message size information• Provides communication topology• Reports walltime, comm%, flops, total memory usage, MPI

routines load imbalance and time breakdown • IPM-1 and IPM-2 (pre-release) are installed on blacklight• Generates text report and visual data (html-based)


How to Use IPM on backlight: basicsCompilation• module load ipm• Link your code to IPM library at compile time

eg_1: icc test.c $PAPI_LIB $IPM_LIB -lmpi

eg_2: ifort –openmp test.f90 $PAPI_LIB $IPM_LIB -lmpi

Execution• Optionally, set the run time environment variables

Example:

export IPM_REPORT=FULL

export IPM_HPM = PAPI_FP_OPS,PAPI_L1_DCM ( a List of comma separated PAPI counters)

• % module load ipm• Execute the binary normally

(This step generates an xml file for visual data)

Profiling report• Text report will be available in the batch output after the execution completes• For html-based report, run ‘ipm_parse –html <xml_file>’. Transfer the generated directory on

your Windows workstation. Click on index.html for the visual data


IPM Communication StatisticsPSC Hybrid Benchmark

Communication Event Statistics (100.00% detail, -5.4590e-03 error)

Buffer Size

Ncalls Total Time Min Time Max Time %MPI %Wall

MPI_Wait 2097152 4999814 4907.002 4.764e-08 5.658e-01 76.10 7.98

MPI_Irecv 2097152 2520000 1374.856 1.050e-06 5.639e-01 21.32 2.24

MPI_Wait 192 40000 144.849 1.376e-07 3.014e-01 2.25 0.24

MPI_Isend 2097152 2520000 17.616 2.788e-07 5.527e-01 0.27 0.03


IPM Profiling, Message Sizes

• Message size per MPI call: In 100% of comm time, 2MB msg is used in MPI_Wait and MPI_Irecv


IPM Profiling: Load Imbalance Information


SCALASCA

• Automated profile-based performance analysis• Automatic search for bottlenecks based on properties formalizing

expert knowledge– MPI wait states– Processor utilization hardware counters

Automatic performance analysis toolset

Scalable performance analysis of large-scale applications– Particularly focused on MPI & OpenMP paradigms– Analysis of communication & synchronization overheads

• Automatic and manual instrumentation capabilities• Runtime summarization and/or event trace analyses• Automatic search of event traces for pattern of inefficiency


How to Use SCALASCA on backlight: basics

• module load scalasca • Run scalasca command (% scalasca) without argument for basic usage info.• ‘scalasca –h’ shows quick reference guide (pdf document)

• Instrumentation– Prepend skin (or scalasca –instrument) to compiler/link commands

Example: skin icc –openmp test.c –lmpi (hybrid code)• Measurement & analysis

– Prepend scan (or scalasca –analyze) to the usual execution command

(This step generates epik directory)– Example: omplace –nt 4 scan –t mpirun -np 16 ./exe (optional –t for trace generation)

• Report examination– Run square (or scalasca –examine) on the generated epik measurement directory for

interactively examining the report (visual data)– Example: square epik_a.out_32x2_sum

or– Run ‘cube3_score –s’ on the epik directory for text report


Distribution of Time for Selected call tree by process/thread

Metric pane Call tree pane

process/thread pane


Distribution of Load imbalance for work_sync routine by process/thread

Color code Profiling of 64 cores, 8 threads per rank job on Blacklight


Global Computational Imbalance(not individual functions)


SCALASCA Metric On-line Description(Right click on metric)


Instruction for Scalasca Textual Report

% module load scalasca• Run cube3_score with –r flag on the

Cube file generated in the epik directory to see the text report

Example:

• Regions classification:

MPI (pure MPI functions)

OMP (pure OpenMP regions)

USR (user-level computational routines)

COM (combined USR + MPI/OpenMP)

ANY/ALL (aggregate of all regions type)

flt type max_tbc time % region

ANY 5788698 20951.46 100.00 (summary) ALL

MPI 5760322 8876.37 42.37 (summary) MPI

OMP 23384 12063.81 57.58 (summary) OMP

COM 4896 3.35 0.02 (summary) COM

USR 72 1.10 0.01 (summary) USR

MPI 2000050 16.38 0.08 MPI_Isend

MPI 1920024 7785.68 37.16 MPI_Wait

MPI 1840000 1063.18 5.07 MPI_Irecv

OMP 8800 56.31 0.27 !$omp parallel @homb.c:754

OMP 4800 8102.48 38.67 !$omp for @homb.c:758

COM 4800 3.26 0.02 work_sync

OMP 4800 3620.97 17.28 !$omp ibarrier @homb.c:765


MPI 120 11.03 0.05 MPI_Barrier

EPK 48 6.83 0.03 TRACING



MPI 40 0.01 0.00 MPI_Gather

MPI 40 0.00 0.00 MPI_Reduce

USR 24 0.00 0.00 gtimes_report

COM 24 0.00 0.00 timeUpdate

MPI 24 0.05 0.00 MPI_Finalize


OMP 24 136.24 0.65 !$omp for @homb.c:569

COM 24 0.00 0.00 initializeMatrix

USR 24 1.10 0.01 createMatrix

…

% cube3_score -r epik_homb_8x8_sum/epitome.cube


Scalasca Notable Run Time Environment Variables

• Set EPK_METRICS to colon seperated list of PAPI countersExample: setenv EPK_METRICS PAPI_TOT_INS:PAPI_FP_OPS:PAPI_L2_TCM

• Set ELG_BUFFER-SIZE to avoid intermediate flushes to diskExample: setenv ELG_BUFFER-SIZE 10000000 (bytes)

For ELG_BUFFER-SIZE, run the following command on the epik directory.% scalasca -examine –s epik_homb_8x8_sum…………………Estimated aggregate size of event trace (total_tbc): 41694664 bytesEstimated size of largest process trace (max_tbc): 5788698 bytes(Hint: When tracing set ELG_BUFFER_SIZE > max_tbc to avoid intermediate flushes or reduce requirements using file listing names of USR regions to be filtered.)

• Set EPK_FILTER to the name of filtered routines to reduce the instrumentation and measurement overhead.

Example: setenv EPK_FILTER routines_filt%cat routines_filtsumTracegmties_reportstatisticsstdoutIO


Time Spent in omp Region Is Selected & Idle Threads

Source code

Idle threads greyed-out


TAU Parallel Performance Evaluation Toolset

• Portable to essentially all computing platforms• Supported programming languages and paradigms:

Fortran, C/C++, Java, Python, MPI, OpenMP, hybrid,

multithreading• Supported instrumentation methods:

– Source code instrumentation, object and binary code, Library wrapping

• Levels of instrumentation:– routine, loop, block, IO BW & volume, memory tracking, Cuda, hardware counters, tracing

• Data analyzers: ParaProf, PerfExplorer, vampir, jumpshot• Throttling effect of frequently called small subroutines• Automatic and manual instrumentation• Interface to databases (Oracle, mysql, …)

….


How to use TAU on Blacklight: basicsStep 0

% module avail tau (shows available tau versions)

% module load tau

Step 1: Compilation• Choose a TAU Makefile stub based on the kind of profiling you wish. Available Makefile stubs are here• ls $TAU_ROOT_DIR/x86_64/lib/Makefile*• eg: Makefile.tau-icpc-mpi-pdt-openmp-opari for MPI+OpenMP code

• Optionally set TAU_OPTIONS to specify compilation specific options– Eg: setenv TAU_OPTION “"-optVerbose -optKeepFiles“ for verbose & keeping the instrumented files.– export TAU_OPTIONS=‘-optTauSelectFile=select.tau –optVerbose’ (selective instrumentation)

• Use one of TAU wrapper script to compile your code (tau_f90.sh, tau_cc.sh, or tau_cxx.sh). – Eg, tau_cc.sh foo.c (generates instrumented binary)

Step 2: Execution• Optionally, set TAU runtime environment variables for generating desired choosing metrics

– eg, setenv TAU_CALLPATH 1 (for callgraph generation) – eg, setenv (papi counters)

• Run the instrumented binary ,from step 1, normally (profile file will be generated)

Step3: Data analysis• Run pprof, where profile files reside, for text profiling• Run paraprof for visual data• Run perfExplorer for multiple set of profiling• Run Jumpshot or vampir for trace files analysis


Hybrid Code Profiled with TAU

Routines time breakdown per node/thread


Hybrid code Profiled with TAU Cont.

Routines exclusive time %, on node0 & thread0

Routines exclusive time %, on rank3 & thread4


TAU Profiling, Threads Load Imbalance in Hybrid Code MPI Routines


Reducing TAU Instrumentation & Measurement Overhead

• By default TAU throttles routines that are called more than 100,000 times and each call takes less than 10 microsecond.– TAU accumulate the timer up to 100,000 time and then stops and adds the remaining time

to the parent of routine

• Tiny routines or selected routines (selective instrumentation) can be excluded from instrumentation/measurement by TAU directives

• Methods of selective instrumentation discussed next


Selective Instrumentation Routines in TAU

• Specify a list of routines to exclude or include (case sensitive) in a text file (eg: select.tau)

• # is a wildcard in a routine name. It cannot appear in the first column.BEGIN_EXCLUDE_LISTFooBarD#EMM END_EXCLUDE_LIST

• Specify a list of routines to include for instrumentationBEGIN_INCLUDE_LISTint main(int, char **)F1F3END_INCLUDE_LIST

• Specify either an include list or an exclude list!

• Use the text file name in the compilation stage as:export TAU_OPTIONS=‘-optTauSelectFile=select.tau’


Selective Instrumentation Files in TAU

• Optionally specify a list of files to exclude or include (case sensitive), in a text file

• * and ? may be used as wildcard characters in a file nameBEGIN_FILE_EXCLUDE_LISTf*.f90Foo?.cpp END_FILE_EXCLUDE_LIST

• Specify a list of routines to include for instrumentationBEGIN_FILE_INCLUDE_LISTmain.cppfoo.f90END_FILE_INCLUDE_LIST

• Specify either an include list or an exclude list!

• Use the text file name in the compilation stage as:export TAU_OPTIONS=‘-optTauSelectFile=select.tau’(select.tau is the selective text file)


Instrumenting code section in TAU

• User instrumentation commands are placed in INSTRUMENT section• ? and * used as wildcard characters for file name, # for routine name• \ as escape character for quotes• Routine entry/exit, arbitrary code insertion• Outer-loop level instrumentation

BEGIN_INSTRUMENT_SECTIONloops file=“foo.f90” routine=“matrix#”memory file=“foo.f90” routine=“#” io routine=“matrix#”[static/dynamic] phase routine=“MULTIPLY”dynamic [phase/timer] name=“foo” file=“foo.cpp” line=22 to line=35file=“foo.f90” line = 123 code = " print *, \" Inside foo\""exit routine = “int foo()” code = "cout <<\"exiting foo\"<<endl;"END_INSTRUMENT_SECTION

• Use the text file name in the compilation stage as:export TAU_OPTIONS=‘-optTauSelectFile=select.tau’(select.tau is the selective text file)


TAU Commonly Used Run-time Environment Variables

• ‘setenv TAU_CALLPATH 1’ to obtain callpath profiling and call graph • ‘setenv TAU_CALLPATH_DEPTH <n>’ (n specifies the depth of the callpath)

• set TAU_METRICS to a comma separated list of PAPI counters for HW event counts– Example, setenv TAU_METRICS PAPI_FP_OPS:PAPI_NATIVE_<event>

• ‘setenv TAU_TRACE 1’ for trace generation• ‘setenv TAU_COMM_MATRIX 1’ for communication topology generation

• TAU_TRACK_MEMORY_LEAKS, setting to 1 turns on leak detection (for use with tau_exec –memory)

• TAU_THROTTLE, set to 1 or 0 for turning on/off the throttling– TAU_THROTTLE 100000 Specifies the number of calls before testing for throttling

– TAU_THROTTLE_PERCALL 1 Specifies value in microseconds.

(Throttle a routine if it is called over 100000 times and takes less than 10 usec of inclusive time per call)


Which Performance Tool to Use?

• IPM: low overhead tool for MPI communication statistics, message sizes, and PAPI event counts

• TAU: advanced profile and trace capability for MPI, OpenMP, Hybrid, Java, Python, etc.

Selective instrumentation reduces the

overhead.

• SCALASCA: ‘Automatic’ performance analysis tool for MPI and OpenMP routines.

Filtering out the computational routines reduces the

measurement overhead.


References

TAU• http://www.cs.uoregon.edu/research/tau/tau-usersguide.pdf• http://www.psc.edu/general/software/packages/tau/TAU-quickref.pdf• http://www.cs.uoregon.edu/research/tau/docs/newguide/bk03ch02.html

PAPI• http://icl.cs.utk.edu/papi/

SCALASCA• http://www.scalasca.org/

IPM

http://ipm-hpc.sourceforge.net/

Others:

https://www.teragrid.org/web/user-support/tau

http://www.psc.edu/general/software/packages/tau/

http://www.psc.edu/general/software/packages/ipm/

http://www.cs.uoregon.edu/research/tau/tau-usersguide.pdf

http://www.cs.uoregon.edu/research/tau/tau-usersguide.pdf

http://www.psc.edu/general/software/packages/tau/TAU-quickref.pdf

http://www.psc.edu/general/software/packages/tau/TAU-quickref.pdf

http://www.cs.uoregon.edu/research/tau/docs/newguide/bk03ch02.html

http://www.cs.uoregon.edu/research/tau/docs/newguide/bk03ch02.html

http://icl.cs.utk.edu/papi/

http://icl.cs.utk.edu/papi/

http://www.scalasca.org/

http://www.scalasca.org/