© 2011 Pittsburgh Supercomputing Center
Getting the Most Out of the TeraGrid SGI Altix UV Systems
Mahin Mahmoodi
Raghu Reddy
TeraGrid 11 Conference
July 18, 2011
Salt Lake City
© 2011 Pittsburgh Supercomputing Center 2
Outline
• Blacklight memory BW and latency w.r.t processor-core mapping
• GRU environment variable• Portable performance evaluation tools on
Blacklight– Case study: PSC Hybrid Benchmark– PAPI– IPM– SCALASCA– TAU
© 2011 Pittsburgh Supercomputing Center 3
Blacklight memory BW and latency with respect to processor-core mapping
© 2011 Pittsburgh Supercomputing Center 4
Blacklight per blade/processor/core Memory Layout
Node: 1 blade + 1 HUB
L1: 64KB per core
L2: 256KB per core
QPI
BLADE (2 Processors) 128 GB local memory
ProcessorProcessor (socket)
8 cores
L3L3: Last Level Cache = 24 MB
L1,L2
L3
HUB
QPI
© 2011 Pittsburgh Supercomputing Center 5
Blacklight Node Pair Architecture
“node pair”
NUMAlink-5“node”
UVHub
Intel Nehalem
EX-8
Intel Nehalem
EX-8
QPI
64 GBRAM
64 GBRAM
UVHub
Intel Nehalem
EX-8
Intel Nehalem
EX-8
QPI
64 GBRAM
64 GBRAM
© 2011 Pittsburgh Supercomputing Center 6
HPCC Stream benchmark• Memory bandwidth
is the rate that data can be read from or stored into processor memory
by processor• Stream measures
sustainable main memory bandwidth (in MB/s) and the corresponding computation rate for simple vector kernel.
• Compute a = b + α c (SAXPY)
Where b and c are two vectors of random 64-bit floating-point values for a given scalar value of α.
• Problem size
The STREAM benchmark is specifically designed to work with datasets much larger than the available cache on any given system, so that the results are more indicative of the performance of very large, vector style applications.
• Design purpose
It is designed to stress local memory bandwidth. The vectors may be allocated in an aligned manner such that no communication is required to perform the computation.
© 2011 Pittsburgh Supercomputing Center 7
Blacklight Memory Bandwidth w.r.t Process-core Mapping
• HPCC-stream used for memory BW (MB/s)
Single Triad (per core) 5
Star Triad per core, (per socket) 2.37, (8 * 2.37)
Cores/socket 8
Speed up/socket 3.792
© 2011 Pittsburgh Supercomputing Center 8
Effect of –openmp and omplace in Stream Benchmark Bandwidth• -openmp is the compilation flag• omplace is the run time command for an OpenMP code
to ensure that threads do not migrate across the cores
FunctionRate
(MB/S)-openmp
Rate (MB/S)
-openmp & omplace
Rate (MB/s)
Rate (MB/s)
omplace
Copy 840.06 4363.83 4205.21 4186.23
Scale 728.38 3946.61 3957.36 3968.78
Add 970.93 4934.30 5007.15 4977.13
Triad 979.62 4998.90 5017.49 4995.37
Take home message: If compiled with OpenMP be sure to use omplaceExample: mpirun -np 16 omplace –nt 4 ./myhybrid
© 2011 Pittsburgh Supercomputing Center 9
Modified STREAMs Benchmark
The notation is: blk-stride-arraysize
g: giga word; unit is in word (8 bytes)
blkstride
Arraysize
Function 200M-200M-200M
(MB/s)
8-8-200M
(MB/s)
1-1-200M
(MB/s)
Strided CP 4200 5175 2145
Random CP 4200 1050 288
Single core modified streams is benchmarked
© 2011 Pittsburgh Supercomputing Center 10
Remote Memory Access• Modified Stream code is benchmarked• Data is initialized on thread 0 and resides on thread 0• Data accessed by thread <n> (remote access)• Block=blk, Stride=S, Arraysize=n
Accessingthread
BW (MB/s)blk=200MS=200Mn=200M
BW (MB/s)blk=8S=8
n=200M
BW (MB/s)blk=1S=8
n=200M
0 1826.18 1624.53 557.39
8 1410.17 1376.59 463.20
16 594.83 641.88 187.24
24 673.43 622.25 188.44
32 541.75 534.93 156.57
48 481.22 459.93 140.14
0, 1, 2, 3, 4, 5, 6, 7 8, 9, 10,11,12, 13, 14, 15
16,17,18,19,20,21,22,23 24,25,26,27,28,29,30,31
HUB
HUB
QPI
QPI
© 2011 Pittsburgh Supercomputing Center 11
HPCC Ping-pong Benchmark• Latency
Time required to send an 8-byte message from one process to another
• What does ping pong benchmark mean?
The ping pong benchmark is executed on two processes. From the client process a message (ping) is sent to the server process and then bounced back to the client (pong). MPI standard blocking send and receive is used. The ping-pong patterns are done in a loop. To achieve the communication time of one message, the total communication time is measured on the client process and divided by twice the loop length. Additional startup latencies are masked out by starting the measurement after one non-measured ping-pong. The benchmark in hpcc uses 8 byte messages and loop length = 8 for benchmarking the communication latency. The benchmark is repeated 5 times and the shortest latency is reported. To measure the communication bandwidth, 2,000,000 byte messages with loop length 1 are repeated twice.
• How is ping pong measured on more than 2 processors?
The ping-pong benchmark reports the maximum latency and minimum bandwidth for a number of non-simultaneous ping-pong tests. The ping-pongs are performed between as many as possible (there is an upper bound on the time it takes to complete this test) distinct exclusive pairs of processors.
Reference: http://icl.cs.utk.edu/hpcc/faq/index.html
© 2011 Pittsburgh Supercomputing Center 12
Blacklight Latency with Respect to process-core Mapping
• HPCC-pingpong used for latency measurement
• Ranks send and recv 8-byte msg one at a time
Cores msg length (byte)
MPI Latency(microseconds)
1024 8 1.6 - 2.0
© 2011 Pittsburgh Supercomputing Center 13
GRU Environment Variable
© 2011 Pittsburgh Supercomputing Center 14
Global Reference Unit (GRU) Hardware Overview
UV HUB
Nehalem EX4,6,or 8 cores
Nehalem EX4,6,or 8 cores
QPI
QPI QPI
2 GRU Chiplets per HUB NUMALINK 5
Memory DIMMS Memory DIMMS
• GRU is a coprocessor that resides in HUB (node controller) of a UV system• GRU provide high BW & low latency socket communication• SGI MPT library uses GRU features for optimizing node communication
© 2011 Pittsburgh Supercomputing Center 15
Run-time tuning with GRU in PSC Hybrid Benchmark
• Setting GRU_RESOURCE_FACTOR variable at run-time may improve the communication time.
• That is: ‘setenv GRU_RESOURCE_FACTOR <n>’, n=2,4,6,8• All runs are on 64 cores• (rank, thread): (64, 1), (8, 8), (8,4)
No-GRU GRU=2 GRU=4 GRU=6 GRU=8
64R1T
8R8T
16R4T
64R1T
8R8T
16R4T
64R1T
8R8T
16R4T
64R1T
8R8T
16R4T
64R1T
8R8T
16R4T
174 184 170 167 143 134 169 141 130 169 140 131 169 140 136
93 101 88 87 60 53 89 57 48 89 56 50 88 57 55
Walltime
CommTime
© 2011 Pittsburgh Supercomputing Center 16
Effect of GRU on HPCC Pingpong BW
• HPCC-pingpong is used in two runs• Following environments are set in one of the
runs:setenv MPI_GRU_CBS 0setenv GRU_RESOURCE_FACTOR 4 setenv MPI_BUFFER_MAX 2048
Cores msg(bytes)
BW (MB/s)No-GRU
BW (MB/s)GRU
1024 2,000,000 1109.5 2663.6
© 2011 Pittsburgh Supercomputing Center 17
Case study: PSC Hybrid Benchmark
© 2011 Pittsburgh Supercomputing Center 18
A Case Study: PSC Hybrid Benchmark Code (Laplace Solver)
• Code uses MPI and OpenMP libraries to parallelize the solution of partial differential equation (PDE)
• Tests MPI/OpenMP performance of code on NUMA system
• Computation: each process is assigned the task of updating the entries on the part of the array it owns
• Communication: Each processor communicates with two neighbors only at block boundaries in order to receive values of neighbor points which are owned by another processor
• No collective communication
• Communications are simplified by allocating an overlap area at each processor for sorting the values to be received from neighbor
© 2011 Pittsburgh Supercomputing Center 19
The Laplace Equation
• To solve the equation, we want to find T(x,y) in the grid points subject to the following initial boundary conditions:– Initial T at top and left boundaries is 0.– T varies linearly from 0 to 100 along the right and bottom
boundaries.• Solution method is Known as:
The Point Jacobi Iteration T=
0
T=0
T=100
T=
100
© 2011 Pittsburgh Supercomputing Center 20
The Point Jacobi Iteration
• In this iterative method, value of each T(i,j) is replaced by the average of four neighbors until the convergence criteria are met.
• T(i,j) = 0.25 * [T(i-1,j)+T(i+1,j)+T(i,j-1)+T(i,j+1)]
T(i+1,j)T(i,j)
T(i,j+1)
T(i,j-1)
T(i-1,j)
© 2011 Pittsburgh Supercomputing Center 21
Data DecompositionIn PSC Laplace benchmark
• 1D block, row-wise block partition is used• Each processor (PEs):
compute Jacobi points in
its block and
communicate those
with neighbor(s) only
at block boundaries.
PE0
PE1
PE2
PE3
© 2011 Pittsburgh Supercomputing Center 22
Portable performance evaluation tools on Blacklight
© 2011 Pittsburgh Supercomputing Center 23
Portable Performance Evaluation Tools on blacklight
Goals:• Give an over view of the programming tools
suite available on blacklight• Explain the functionality of individual tools• Teach how to use the tools effectively
– Capabilities– Basic use– Hybrid profiling analysis– Reducing the profiling overhead– Common environment variables
© 2011 Pittsburgh Supercomputing Center 24
Available Open Source Performance Evaluation Tools on Blacklight
• PAPI• IPM• SCALASCA• TAU• module avail <tool> to view the available
versions • module load <tool> bring into the
environment
eg: module load tau
© 2011 Pittsburgh Supercomputing Center 25
What is PAPI?
• Middleware to provide a consistent programming interface for the hardware performance counter found in most major micro-processors.
• Countable hardware events:PRESET: platform neutral events
NATIVE : platform dependent events
Derived: preset events can be derived from multiple native events.
Multiplexed: events can be multiplexed if counters are limited.
© 2011 Pittsburgh Supercomputing Center 26
PAPI Utilities• Utilities are available in PAPI bin directory. Load the module first to
append it to the PATH or use the absolute path to the utility
Example:
% module load papi
% Which papi_avail
/usr/local/packages/PAPI/usr/4.1.3/bin/papi_avail
• Execute the utilities in compute nodes as mmtimer is not available in login nodes.
• Use <utility> -h for more information
Example:
% papi_cost –h
It computes min / max / mean / std. deviation for PAPI start/stop pairs; for PAPI reads, and for PAPI_accums.
Usage: cost [options] [parameters] …
© 2011 Pittsburgh Supercomputing Center 27
PAPI Utilities Cont.
• Execute papi_avail for PAPI preset events
% papi_avail
……
Name Code Avail Deriv Description (Note)
PAPI_L1_DCM 0x80000000 Yes No Level 1 data cache misses
PAPI_L2_DCM 0x80000002 Yes Yes Level 2 data cache misses
• Execute papi_native_avail for available native events
%papi_native_avail
…….
Event Code Symbol | Long Description |
0x40000005 LAST_LEVEL_CACHE_REFERENCES | This is an alias for LLC_REFERENCE || S
• Execute papi_event_chooser to select a compatible set of events that can be counted simultaneously.
% papi_event_chooser
Usage: papi_event_chooser NATIVE|PRESET evt1 evt2 ...
% papi_event_chooser PRESET PAPI_FP_OPS, PAPI_L1_DCM
event_chooser.c PASSED
© 2011 Pittsburgh Supercomputing Center 28
PAPI High-level Interface
• Meant for application programmers wanting coarse-grained measurements
• Calls the lower level API• Allows only PAPI preset events• Easier to use and less setup (less additional code) than
low-level• Supports 8 calls in C or Fortran:
© 2011 Pittsburgh Supercomputing Center 29
PAPI High-level Example
#include "papi.h” #define NUM_EVENTS 2 long_long values[NUM_EVENTS]; unsigned int
Events[NUM_EVENTS]={PAPI_TOT_INS,PAPI_TOT_CYC};
/* Start the counters */ PAPI_start_counters((int*)Events,NUM_EVENTS);
/* What we are monitoring… */ do_work();
/* Stop counters and store results in values */ retval = PAPI_stop_counters(values,NUM_EVENTS);
© 2011 Pittsburgh Supercomputing Center 30
Low-level Interface
• Increased efficiency and functionality over the high level PAPI interface
• Obtain information about the executable, the hardware, and the memory environment
• Multiplexing• Callbacks on counter overflow• Profiling• About 60 functions
© 2011 Pittsburgh Supercomputing Center 31
PAPI Low-level Example
#include "papi.h”#define NUM_EVENTS 2int Events[NUM_EVENTS]={PAPI_FP_INS,PAPI_TOT_CYC};int EventSet;long_long values[NUM_EVENTS];/* Initialize the Library */retval = PAPI_library_init(PAPI_VER_CURRENT);/* Allocate space for the new eventset and do setup */retval = PAPI_create_eventset(&EventSet);/* Add Flops and total cycles to the eventset */retval = PAPI_add_events(EventSet,Events,NUM_EVENTS);/* Start the counters */retval = PAPI_start(EventSet);
do_work(); /* What we want to monitor*/
/*Stop counters and store results in values */retval = PAPI_stop(EventSet,values);
© 2011 Pittsburgh Supercomputing Center 32
Example: FLOPS with PAPI callsprogram mflops_example
implicit none
#include 'fpapi.h'
integer :: i
double precision :: a, b, c
integer, parameter :: n = 1000000
integer (kind=8) :: flpops = 0
integer :: check
real (kind=4) :: real_time = 0., proc_time = 0., mflops = 0.
a = 1.e-8
b = 2.e-7
c = 3.e-6
call PAPIF_flops(real_time, proc_time, flpops, mflops, check)
print *, "first: ", flpops, proc_time, mflops, check
do i = 1, n
a = a + b * c
end do
call PAPIF_flops(real_time, proc_time, flpops, mflops, check)
print *, "second: ", flpops, proc_time, mflops, check
print *, 'sum = ', a
end program mflops_example
Compilation:% module load papi% ifort -fpp $PAPI_INC -o mflops mflops_example.f $PAPI_LIB
Execution:module load papi./ a.out
Output: flpops, proc_time, mflops, `checkfirst: 0 0.0000000E+00 0.0000000E+00 0 second: 1000009 1.4875773E-03 672.2400 0 sum = 6.100000281642980E-007
© 2011 Pittsburgh Supercomputing Center 33
IPM: Integrated Performance Monitoring
• Lightweight and easy to use• Profiles only MPI code (not serial, not OpenMP)• Profiles only MPI routines (not computational routines)• Accesses hardware performance counters using PAPI• Lists message size information• Provides communication topology• Reports walltime, comm%, flops, total memory usage, MPI
routines load imbalance and time breakdown • IPM-1 and IPM-2 (pre-release) are installed on blacklight• Generates text report and visual data (html-based)
© 2011 Pittsburgh Supercomputing Center 34
How to Use IPM on backlight: basicsCompilation• module load ipm• Link your code to IPM library at compile time
eg_1: icc test.c $PAPI_LIB $IPM_LIB -lmpi
eg_2: ifort –openmp test.f90 $PAPI_LIB $IPM_LIB -lmpi
Execution• Optionally, set the run time environment variables
Example:
export IPM_REPORT=FULL
export IPM_HPM = PAPI_FP_OPS,PAPI_L1_DCM ( a List of comma separated PAPI counters)
• % module load ipm• Execute the binary normally
(This step generates an xml file for visual data)
Profiling report• Text report will be available in the batch output after the execution completes• For html-based report, run ‘ipm_parse –html <xml_file>’. Transfer the generated directory on
your Windows workstation. Click on index.html for the visual data
© 2011 Pittsburgh Supercomputing Center 35
IPM Communication StatisticsPSC Hybrid Benchmark
Communication Event Statistics (100.00% detail, -5.4590e-03 error)
Buffer Size
Ncalls Total Time Min Time Max Time %MPI %Wall
MPI_Wait 2097152 4999814 4907.002 4.764e-08 5.658e-01 76.10 7.98
MPI_Irecv 2097152 2520000 1374.856 1.050e-06 5.639e-01 21.32 2.24
MPI_Wait 192 40000 144.849 1.376e-07 3.014e-01 2.25 0.24
MPI_Isend 2097152 2520000 17.616 2.788e-07 5.527e-01 0.27 0.03
© 2011 Pittsburgh Supercomputing Center 36
IPM Profiling, Message Sizes
• Message size per MPI call: In 100% of comm time, 2MB msg is used in MPI_Wait and MPI_Irecv
© 2011 Pittsburgh Supercomputing Center 37
IPM Profiling: Load Imbalance Information
© 2011 Pittsburgh Supercomputing Center 38
SCALASCA
• Automated profile-based performance analysis• Automatic search for bottlenecks based on properties formalizing
expert knowledge– MPI wait states– Processor utilization hardware counters
Automatic performance analysis toolset
Scalable performance analysis of large-scale applications– Particularly focused on MPI & OpenMP paradigms– Analysis of communication & synchronization overheads
• Automatic and manual instrumentation capabilities• Runtime summarization and/or event trace analyses• Automatic search of event traces for pattern of inefficiency
© 2011 Pittsburgh Supercomputing Center 39
How to Use SCALASCA on backlight: basics
• module load scalasca • Run scalasca command (% scalasca) without argument for basic usage info.• ‘scalasca –h’ shows quick reference guide (pdf document)
• Instrumentation– Prepend skin (or scalasca –instrument) to compiler/link commands
Example: skin icc –openmp test.c –lmpi (hybrid code)• Measurement & analysis
– Prepend scan (or scalasca –analyze) to the usual execution command
(This step generates epik directory)– Example: omplace –nt 4 scan –t mpirun -np 16 ./exe (optional –t for trace generation)
• Report examination– Run square (or scalasca –examine) on the generated epik measurement directory for
interactively examining the report (visual data)– Example: square epik_a.out_32x2_sum
or– Run ‘cube3_score –s’ on the epik directory for text report
© 2011 Pittsburgh Supercomputing Center 40
Distribution of Time for Selected call tree by process/thread
Metric pane Call tree pane
process/thread pane
© 2011 Pittsburgh Supercomputing Center 41
Distribution of Load imbalance for work_sync routine by process/thread
Color code Profiling of 64 cores, 8 threads per rank job on Blacklight
© 2011 Pittsburgh Supercomputing Center 42
Global Computational Imbalance(not individual functions)
© 2011 Pittsburgh Supercomputing Center 43
SCALASCA Metric On-line Description(Right click on metric)
© 2011 Pittsburgh Supercomputing Center 44
Instruction for Scalasca Textual Report
% module load scalasca• Run cube3_score with –r flag on the
Cube file generated in the epik directory to see the text report
Example:
• Regions classification:
MPI (pure MPI functions)
OMP (pure OpenMP regions)
USR (user-level computational routines)
COM (combined USR + MPI/OpenMP)
ANY/ALL (aggregate of all regions type)
flt type max_tbc time % region
ANY 5788698 20951.46 100.00 (summary) ALL
MPI 5760322 8876.37 42.37 (summary) MPI
OMP 23384 12063.81 57.58 (summary) OMP
COM 4896 3.35 0.02 (summary) COM
USR 72 1.10 0.01 (summary) USR
MPI 2000050 16.38 0.08 MPI_Isend
MPI 1920024 7785.68 37.16 MPI_Wait
MPI 1840000 1063.18 5.07 MPI_Irecv
OMP 8800 56.31 0.27 !$omp parallel @homb.c:754
OMP 4800 8102.48 38.67 !$omp for @homb.c:758
COM 4800 3.26 0.02 work_sync
OMP 4800 3620.97 17.28 !$omp ibarrier @homb.c:765
OMP 4800 2.41 0.01 !$omp ibarrier @homb.c:773
MPI 120 11.03 0.05 MPI_Barrier
EPK 48 6.83 0.03 TRACING
OMP 44 0.03 0.00 !$omp parallel @homb.c:465
OMP 44 121.81 0.58 !$omp parallel @homb.c:557
MPI 40 0.01 0.00 MPI_Gather
MPI 40 0.00 0.00 MPI_Reduce
USR 24 0.00 0.00 gtimes_report
COM 24 0.00 0.00 timeUpdate
MPI 24 0.05 0.00 MPI_Finalize
OMP 24 23.46 0.11 !$omp ibarrier @homb.c:601
OMP 24 136.24 0.65 !$omp for @homb.c:569
COM 24 0.00 0.00 initializeMatrix
USR 24 1.10 0.01 createMatrix
…
% cube3_score -r epik_homb_8x8_sum/epitome.cube
© 2011 Pittsburgh Supercomputing Center 45
Scalasca Notable Run Time Environment Variables
• Set EPK_METRICS to colon seperated list of PAPI countersExample: setenv EPK_METRICS PAPI_TOT_INS:PAPI_FP_OPS:PAPI_L2_TCM
• Set ELG_BUFFER-SIZE to avoid intermediate flushes to diskExample: setenv ELG_BUFFER-SIZE 10000000 (bytes)
For ELG_BUFFER-SIZE, run the following command on the epik directory.% scalasca -examine –s epik_homb_8x8_sum…………………Estimated aggregate size of event trace (total_tbc): 41694664 bytesEstimated size of largest process trace (max_tbc): 5788698 bytes(Hint: When tracing set ELG_BUFFER_SIZE > max_tbc to avoid intermediate flushes or reduce requirements using file listing names of USR regions to be filtered.)
• Set EPK_FILTER to the name of filtered routines to reduce the instrumentation and measurement overhead.
Example: setenv EPK_FILTER routines_filt%cat routines_filtsumTracegmties_reportstatisticsstdoutIO
© 2011 Pittsburgh Supercomputing Center 46
Time Spent in omp Region Is Selected & Idle Threads
Source code
Idle threads greyed-out
© 2011 Pittsburgh Supercomputing Center 47
TAU Parallel Performance Evaluation Toolset
• Portable to essentially all computing platforms• Supported programming languages and paradigms:
Fortran, C/C++, Java, Python, MPI, OpenMP, hybrid,
multithreading• Supported instrumentation methods:
– Source code instrumentation, object and binary code, Library wrapping
• Levels of instrumentation:– routine, loop, block, IO BW & volume, memory tracking, Cuda, hardware counters, tracing
• Data analyzers: ParaProf, PerfExplorer, vampir, jumpshot• Throttling effect of frequently called small subroutines• Automatic and manual instrumentation• Interface to databases (Oracle, mysql, …)
….
© 2011 Pittsburgh Supercomputing Center 48
How to use TAU on Blacklight: basicsStep 0
% module avail tau (shows available tau versions)
% module load tau
Step 1: Compilation• Choose a TAU Makefile stub based on the kind of profiling you wish. Available Makefile stubs are here• ls $TAU_ROOT_DIR/x86_64/lib/Makefile*• eg: Makefile.tau-icpc-mpi-pdt-openmp-opari for MPI+OpenMP code
• Optionally set TAU_OPTIONS to specify compilation specific options– Eg: setenv TAU_OPTION “"-optVerbose -optKeepFiles“ for verbose & keeping the instrumented files.– export TAU_OPTIONS=‘-optTauSelectFile=select.tau –optVerbose’ (selective instrumentation)
• Use one of TAU wrapper script to compile your code (tau_f90.sh, tau_cc.sh, or tau_cxx.sh). – Eg, tau_cc.sh foo.c (generates instrumented binary)
Step 2: Execution• Optionally, set TAU runtime environment variables for generating desired choosing metrics
– eg, setenv TAU_CALLPATH 1 (for callgraph generation) – eg, setenv (papi counters)
• Run the instrumented binary ,from step 1, normally (profile file will be generated)
Step3: Data analysis• Run pprof, where profile files reside, for text profiling• Run paraprof for visual data• Run perfExplorer for multiple set of profiling• Run Jumpshot or vampir for trace files analysis
© 2011 Pittsburgh Supercomputing Center 49
Hybrid Code Profiled with TAU
Routines time breakdown per node/thread
© 2011 Pittsburgh Supercomputing Center 50
Hybrid code Profiled with TAU Cont.
Routines exclusive time %, on node0 & thread0
Routines exclusive time %, on rank3 & thread4
© 2011 Pittsburgh Supercomputing Center 51
TAU Profiling, Threads Load Imbalance in Hybrid Code MPI Routines
© 2011 Pittsburgh Supercomputing Center 52
Reducing TAU Instrumentation & Measurement Overhead
• By default TAU throttles routines that are called more than 100,000 times and each call takes less than 10 microsecond.– TAU accumulate the timer up to 100,000 time and then stops and adds the remaining time
to the parent of routine
• Tiny routines or selected routines (selective instrumentation) can be excluded from instrumentation/measurement by TAU directives
• Methods of selective instrumentation discussed next
© 2011 Pittsburgh Supercomputing Center 53
Selective Instrumentation Routines in TAU
• Specify a list of routines to exclude or include (case sensitive) in a text file (eg: select.tau)
• # is a wildcard in a routine name. It cannot appear in the first column.BEGIN_EXCLUDE_LISTFooBarD#EMM END_EXCLUDE_LIST
• Specify a list of routines to include for instrumentationBEGIN_INCLUDE_LISTint main(int, char **)F1F3END_INCLUDE_LIST
• Specify either an include list or an exclude list!
• Use the text file name in the compilation stage as:export TAU_OPTIONS=‘-optTauSelectFile=select.tau’
© 2011 Pittsburgh Supercomputing Center 54
Selective Instrumentation Files in TAU
• Optionally specify a list of files to exclude or include (case sensitive), in a text file
• * and ? may be used as wildcard characters in a file nameBEGIN_FILE_EXCLUDE_LISTf*.f90Foo?.cpp END_FILE_EXCLUDE_LIST
• Specify a list of routines to include for instrumentationBEGIN_FILE_INCLUDE_LISTmain.cppfoo.f90END_FILE_INCLUDE_LIST
• Specify either an include list or an exclude list!
• Use the text file name in the compilation stage as:export TAU_OPTIONS=‘-optTauSelectFile=select.tau’(select.tau is the selective text file)
© 2011 Pittsburgh Supercomputing Center 55
Instrumenting code section in TAU
• User instrumentation commands are placed in INSTRUMENT section• ? and * used as wildcard characters for file name, # for routine name• \ as escape character for quotes• Routine entry/exit, arbitrary code insertion• Outer-loop level instrumentation
BEGIN_INSTRUMENT_SECTIONloops file=“foo.f90” routine=“matrix#”memory file=“foo.f90” routine=“#” io routine=“matrix#”[static/dynamic] phase routine=“MULTIPLY”dynamic [phase/timer] name=“foo” file=“foo.cpp” line=22 to line=35file=“foo.f90” line = 123 code = " print *, \" Inside foo\""exit routine = “int foo()” code = "cout <<\"exiting foo\"<<endl;"END_INSTRUMENT_SECTION
• Use the text file name in the compilation stage as:export TAU_OPTIONS=‘-optTauSelectFile=select.tau’(select.tau is the selective text file)
© 2011 Pittsburgh Supercomputing Center 56
TAU Commonly Used Run-time Environment Variables
• ‘setenv TAU_CALLPATH 1’ to obtain callpath profiling and call graph • ‘setenv TAU_CALLPATH_DEPTH <n>’ (n specifies the depth of the callpath)
• set TAU_METRICS to a comma separated list of PAPI counters for HW event counts– Example, setenv TAU_METRICS PAPI_FP_OPS:PAPI_NATIVE_<event>
• ‘setenv TAU_TRACE 1’ for trace generation• ‘setenv TAU_COMM_MATRIX 1’ for communication topology generation
• TAU_TRACK_MEMORY_LEAKS, setting to 1 turns on leak detection (for use with tau_exec –memory)
• TAU_THROTTLE, set to 1 or 0 for turning on/off the throttling– TAU_THROTTLE 100000 Specifies the number of calls before testing for throttling
– TAU_THROTTLE_PERCALL 1 Specifies value in microseconds.
(Throttle a routine if it is called over 100000 times and takes less than 10 usec of inclusive time per call)
© 2011 Pittsburgh Supercomputing Center 57
Which Performance Tool to Use?
• IPM: low overhead tool for MPI communication statistics, message sizes, and PAPI event counts
• TAU: advanced profile and trace capability for MPI, OpenMP, Hybrid, Java, Python, etc.
Selective instrumentation reduces the
overhead.
• SCALASCA: ‘Automatic’ performance analysis tool for MPI and OpenMP routines.
Filtering out the computational routines reduces the
measurement overhead.
© 2011 Pittsburgh Supercomputing Center 58
References
TAU• http://www.cs.uoregon.edu/research/tau/tau-usersguide.pdf• http://www.psc.edu/general/software/packages/tau/TAU-quickref.pdf• http://www.cs.uoregon.edu/research/tau/docs/newguide/bk03ch02.html
PAPI• http://icl.cs.utk.edu/papi/
SCALASCA• http://www.scalasca.org/
IPM
http://ipm-hpc.sourceforge.net/
Others:
https://www.teragrid.org/web/user-support/tau
http://www.psc.edu/general/software/packages/tau/
http://www.psc.edu/general/software/packages/ipm/
Top Related