Introduction to Scientific Computing on the IBM SP and Regatta
description
Transcript of Introduction to Scientific Computing on the IBM SP and Regatta
![Page 2: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/2.jpg)
Outline
• Friendly Users (access to Regatta)• hardware• batch queues (LSF)• compilers• libraries• MPI• OpenMP• debuggers• profilers and hardware counters
![Page 3: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/3.jpg)
Friendly Users
![Page 4: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/4.jpg)
Friendly Users
• Regatta– not presently open to general user
community– will be open to a small number of
“friendly users” to help us make sure everything’s working ok
![Page 5: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/5.jpg)
Friendly Users (cont’d)
• Friendly-user rules1. We expect the friendly-user period to
last 4-6 weeks2. No charge for CPU time!3. Must have “mature” code
– code must currently run (we don’t want to test how well the Regatta runs emacs!)
– serial or parallel
![Page 6: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/6.jpg)
Friendly Users (3)
• Friendly-user rules (cont’d)4. We want feedback!
– What did you encounter that prevented porting your code from being a “plug and play” operation?
– If it was not obvious to you, it was not obvious to some other users!
5. Timings are required for your code– use time command– report wall-clock time– web-based form for reporting results
![Page 7: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/7.jpg)
Friendly Users (4)
• Friendly-user application and report form:– first go to SP/Regatta repository: http://scv.bu.edu/SCV/IBMSP/– click on Friendly Users link at bottom of
menu on left-hand side of page– timings required for the Regatta and either
the O2k or SP (both would be great!)
![Page 8: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/8.jpg)
Hardware
![Page 9: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/9.jpg)
Hal (SP)
• Power3 processors– 375 MHz
• 4 nodes– 16 processors each– shared memory on each node– 8GB memory per node
• presently can use up to 16 procs.
![Page 10: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/10.jpg)
Hal (cont’d)
• L1 cache– 64 KB– 128-byte line– 128-way set associative
• L2 cache– 4 MB– 128-byte line– direct-mapped (“1-way” set assoc.)
![Page 11: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/11.jpg)
Twister (Regatta)
• Power4 processors– 1.3 GHz– 2 CPUs per chip (interesting!)
• 3 nodes– 32 processors each– shared memory on each node– 32GB memory per node
• presently can use up to 32 procs.
![Page 12: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/12.jpg)
Twister (cont’d)
• L1 cache– 32 KB per proc. (64 KB per chip)– 128-byte line– 2-way set associative
![Page 13: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/13.jpg)
Twister (3)
• L2 cache– 1.41 MB
• shared by both procs. on a chip
– 128-byte line– 4-to-8 way set associative– unified
• data, instructions, page table entries
![Page 14: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/14.jpg)
Twister (4)
• L3 cache– 128 MB– off-chip– shared by 8 procs.– 512-byte “blocks”
• coherence maintained at 128-bytes
– 8-way set associative
![Page 15: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/15.jpg)
Batch Queues
![Page 16: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/16.jpg)
Batch Queues
• LSF batch system• bqueues for list of queues
QUEUE_NAME PRIO STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SUSP p4-mp32 10 Open:Active 1 1 - - 0 0 0 0p4-mp16 9 Open:Active 2 1 - - 0 0 0 0p4-short 8 Open:Active 2 1 - - 0 0 0 0p4-long 7 Open:Active 16 5 - - 0 0 0 0sp-mp16 6 Open:Active 2 1 - 1 2 1 1 0sp-mp8 5 Open:Active 2 1 - - 1 0 1 0sp-long 4 Open:Active 8 2 - - 20 12 8 0sp-short 3 Open:Active 2 1 - - 0 0 0 0graveyard 2 Open:Inact - - - - 0 0 0 0donotuse 1 Open:Active - - - - 0 0 0 0
![Page 17: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/17.jpg)
Batch Queues (cont’d)
• p4 queues are on the Regatta• sp queues are on the SP (surprise!)• “long” and “short” queues are serial• for details see
http://scv.bu.edu/SCV/scf-techsumm.html– will not include Regatta info. until it’s
open to all users• bsub to submit job• bjobs to monitor job
![Page 18: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/18.jpg)
Compilers
![Page 19: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/19.jpg)
Compiler Names
• AIX uses different compiler names to perform some tasks which are handled by compiler flags on many other systems– parallel compiler names differ for
SMP, message-passing, and combined parallelization methods
![Page 20: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/20.jpg)
Compilers (cont’d)
Serial MPI OpenMP Mixed
Fortran 77 xlf mpxlf xlf_r
mpxlf_r
Fortran 90 xlf90 mpxlf90 xlf90_r
mpxlf90_r
Fortran 95 xlf95 mpxlf95 xlf95_r
mpxlf95_r C cc mpcc cc_r
mpcc_r
xlc mpxlc xlc_r
mpxlc_r
C++ xlC mpCC xlC_r
mpCC_r
gcc and g++ are also available
![Page 21: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/21.jpg)
Compilers (3)
• xlc default flags-qalias=ansi
• optimizer assumes that pointers can only point to an object of the same type (potentially better optimization)
-qlanglvl=ansi• ansi c
-qro• string literals (e.g., char *p = ”mystring”;) placed in
“read-only” memory (text segment); cannot be modified
![Page 22: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/22.jpg)
Compilers (4)
• xlc default flags (cont’d)-qroconst
• constants placed in read-only memory
![Page 23: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/23.jpg)
•cc default flags-qalias=extended
•optimizer assumes that pointers may point to object whose address is taken, regardless of type (potentially weaker
optimization) -qlanglvl=extended
•extended (not ansi) c•“compatibility with the RT compiler and classic language levels”
-qnoro•string literals (e.g., char *p = ”mystring”;) can be modified
•may use more memory than -qro
Compilers (5)
![Page 24: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/24.jpg)
•cc default flags (cont’d)-qnoroconst
•constants not placed in read-only memory
Compilers (6)
![Page 25: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/25.jpg)
Default Fortran Suffixes
xlf .fxlf90 .ff90 .f90xlf95 .ff95 .fmpxlf .fmpxlf90 .f90mpxlf95 .f
Same exceptfor suffix
![Page 26: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/26.jpg)
Compiler flags
• Specify source file suffix-qsuffix=f=f90 (lets you use xlf90
with .f90 suffix)
• 64-bit– q64– use if you need more than 2GB
![Page 27: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/27.jpg)
flags cont’d
• Presently a foible on twister (Regatta)– if compiling with -q64 and using MPI,
must compile with mp…_r compiler, even if you’re not using SMP parallelization
![Page 28: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/28.jpg)
flags (3)
• IBM optimization levels-O basic optimization-O2 same as -O-O3 more aggressive optimization-O4 even more aggressive
optimization; optimize for current architecture; IPA
-O5 aggressive IPA
![Page 29: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/29.jpg)
flags (4)
• If using O3 or below, can optimize for local hardware (done automatically for -O4 and -O5):-qarch=auto optimize for resident
architecture-qtune=auto optimize for resident
processor-qcache=auto optimize for resident
cache
![Page 30: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/30.jpg)
flags (5)
• If you’re using IPA and you get warnings about partition sizes, try -qipa=partition=large
• default data segment limit 256MB– data segment contains static, common, and
allocatable variables and arrays– can increase limit to a maximum of 2GB
with 32-bit compilation-bmaxdata:0x80000000
– can use more than 2GB data with -q64
![Page 31: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/31.jpg)
flags (6)
• -O5 does not include function inlining• function inlining flags:
-Q compiler decides what functions to inline
-Q+func1:func2 only inline specified functions
-Q -Q-func1:func2 let compiler decide, but do not inline specified functions
![Page 32: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/32.jpg)
Libraries
![Page 33: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/33.jpg)
Scientific Libraries
• Contain– Linear Algebra Subprograms– Matrix Operations – Linear Algebraic Equations – Eigensystem Analysis – Fourier Transforms, Convolutions and Correlations,
and Related Computations – Sorting and Searching – Interpolation – Numerical Quadrature – Random Number Generation
![Page 34: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/34.jpg)
Scientific Libs. Cont’d
• Documentation - go to IBM Repository: http://scv.bu.edu/SCV/IBMSP/
– click on Libraries
• ESSLSMP– for use with “SMP processors” (that’s us)– some serial, some parallel
• parallel versions use multiple threads• thread safe; serial versions may be called within
multithreaded regions (or on single thread)
– link with -lesslsmp
![Page 35: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/35.jpg)
Scientific Libs. (3)
• PESSLSMP– message-passing (MPI, PVM) -lpesslsmp -lesslsmp -lblacssmp
![Page 36: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/36.jpg)
Fast Math
• MASS library– Mathematical Acceleration SubSystem
• faster versions of some Fortran intrinsic functions– sqrt, rsqrt, exp, log, sin, cos, tan, atan, atan2,
sinh, cosh, tanh, dnint, x**y
• work with Fortran or C• differ from standard functions in last bit
(at most)
![Page 37: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/37.jpg)
Fast Math (cont’d)
• simply link to mass library:Fortran: -lmass
C: -lmass -lm
• sample approx. speedupsexp 2.4
log 1.6
sin 2.2
complex atan 4.7
![Page 38: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/38.jpg)
Fast Math (3)
• Vector routines offer even more speedup, but require minor code changes
• link to -lmassv
• subroutine calls– prefix name with vs for 4-byte reals
(single precision) and v for 8-byte reals (double precision)
![Page 39: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/39.jpg)
Fast Math (4)
• example: single-precision exponentialcall vsexp(y,x,n)– x is the input vector of length n– y is the output vector of length n
• sample speedups (single & double)
exp 9.7 6.7
log 12.3 10.4
sin 10.0 9.8
complex atan 16.7 16.5
![Page 40: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/40.jpg)
Fast Math (5)
• For details see the following file on hal:file:/usr/lpp/mass/MASS.readme
![Page 41: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/41.jpg)
MPI
![Page 42: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/42.jpg)
MPI
• MPI works differently on IBM than on other systems
• first compile code using compiler with mp prefix, e.g., mpcc– this automatically links to MPI
libraries; do not use -lmpi
![Page 43: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/43.jpg)
POE
• Parallel Operating Environment– controls parallel operation, including
running MPI code
![Page 44: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/44.jpg)
Running MPI Code
• Do not use mpirun!• poe mycode -procs 4• file re-direction:
poe mycode < myin > myout -procs 4• note: no quotes
• a useful flag: -labelio yes labels output with process number (0, 1, 2, …)– also setenv MP_LABELIO yes
![Page 45: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/45.jpg)
OpenMP
![Page 46: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/46.jpg)
SMP Compilation
• OpenMP– append compiler name with _r– use -qsmp=omp flag SGI: f77 -mp mycode.f
IBM: xlf_r -qsmp=omp mycode.f
• Automatic parallelization SGI: f77 -apo mycode.f
IBM: xlf_r -qsmp mycode.f
![Page 47: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/47.jpg)
SMP Compilation cont’d
• Listing files for auto-parallelization SGI: f77 -apo list mycode.f
IBM: xlf_r -qsmp -qreport=smplist mycode.f
![Page 48: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/48.jpg)
SMP Environment
• Per-thread stack limit– default 4MB – can be increased by using
environment variable setenv XLSMPOPTS $XLSMPOPTS\:stack=size
where size is the new size limit in bytes
![Page 49: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/49.jpg)
Running SMP
• Running is the same as on other systems, e.g.,
#!/bin/tcsh
setenv OMP_NUM_THREADS 4
mycode < myin > myout
exit
![Page 50: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/50.jpg)
OpenMP functions
• On IBM, must declare OpenMP Fortran functionsinteger OMP_GET_NUM_THREADS
(not necessary on SGI)
![Page 51: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/51.jpg)
Debuggers
![Page 52: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/52.jpg)
Debuggers
• dbx - standard command-line unix debugger
• pdbx - parallel version of dbx– editorial comment: I have used
command-line parallel debuggers. I prefer print statements.
• xldb - serial and multi-thread debugger with graphical interface
![Page 53: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/53.jpg)
Debuggers cont’d
• xpdbx - parallel debugger with graphical interface
• pedb - synonymous with xpdbx
![Page 54: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/54.jpg)
xldb
![Page 55: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/55.jpg)
xldb cont’d
• xldb mycode– window pops up with source, etc.
• group of blue bars at the top right– click on bar to open window– to minimize window, click on bar at top
to get menu, click on “minimize”
• to set breakpoint, click on source line• to navigate, see “commands” window
![Page 56: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/56.jpg)
pedb
![Page 57: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/57.jpg)
pedb cont’d
• pedb mycode– window pops up with source, etc.
• to set breakpoint, double-click on source line
• to delete breakpoint, right-click icon in left margin, select “delete”
![Page 58: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/58.jpg)
pedb cont’d (3)
• to navigate, use buttons below source listing– names are cut off; should be step over step into step return
continue halt play stop
• “tasks” (processes for MPI) are chosen using buttons in “task” window
![Page 59: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/59.jpg)
pedb cont’d (4)
• to examine data, right-click task on button in “local data” or “global data” window– Fortran will not have “global data”
• select “open variable viewer”– all current variables will be listed– use “Find” to find a specific variable
![Page 60: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/60.jpg)
Profilers andHardware Counters
![Page 61: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/61.jpg)
Profiling - prof
• Compile code with -p• run code normally• file mon.out will be created for serial run;
mon.out.0, mon.out.1, etc. for multiple-process run
• prof > prof.serial• prof -m mon.out.0 > prof.parallel.1_4
• parallel prof presently broken on twister
![Page 62: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/62.jpg)
Name %Time Seconds Cumsecs #Calls msec/call.conduct 22.5 97.00 97.00 10 9700.0.btri 7.8 33.44 130.44 189880 0.1761.kickpipes 7.7 33.37 163.81.getxyz 7.3 31.60 195.41 323 97.83.rmnmod 5.3 22.69 218.10309895200 0.0001.__mcount 3.5 15.25 233.35.putxyz 2.7 11.83 245.18 60 197.2.smatrx 2.4 10.43 255.61 189880 0.0549.sy 2.4 10.23 265.84 3024000 0.0034.getq 2.4 10.20 276.04 269 37.92.pertri 2.0 8.59 284.63 288000 0.0298
Prof (cont’d)
![Page 63: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/63.jpg)
Profiling - gprof
• Also have gprof available– more extensive profile
• compile with -pg• file gmon.out will be created• gprof >& myprof
– note that gprof output goes to std err (&)
• for multiple procs. (MPI), copy or link gmon.out.n to gmon.out, then gprof
![Page 64: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/64.jpg)
gprof (cont’d)
ngranularity: Each sample hit covers 4 bytes. Time: 435.04 seconds
called/total parents index %time self descendents called+self name index called/total children
0.00 340.50 1/1 .__start [2][1] 78.3 0.00 340.50 1 .main [1] 2.12 319.50 10/10 .contrl [3] 0.04 7.30 10/10 .force [34] 0.00 5.27 1/1 .initia [40] 0.56 3.43 1/1 .plot3da [49] 0.00 1.27 1/1 .data [73]
![Page 65: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/65.jpg)
gprof (3)
ngranularity: Each sample hit covers 4 bytes. Time: 435.04 seconds
% cumulative self self total time seconds seconds calls ms/call ms/call name 20.5 89.17 89.17 10 8917.00 10918.00 .conduct [5] 7.6 122.34 33.17 323 102.69 102.69 .getxyz [8] 7.5 154.77 32.43 .__mcount [9] 7.2 186.16 31.39 189880 0.17 0.17 .btri [10] 7.2 217.33 31.17 .kickpipes [12] 5.1 239.58 22.25 309895200 0.00 0.00 .rmnmod [16] 2.3 249.67 10.09 269 37.51 37.51 .getq [24]
![Page 66: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/66.jpg)
Xprofiler
• Graphical interface to gprof• compile with -g -pg -Ox
– Ox represents whatever level of optimization you’re using (e.g., O5)
• run code– produces gmon.out file
• type xprofiler command
![Page 67: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/67.jpg)
Hardware Counters
• HPM Toolkit– see /usr/local/hpm/doc/HPM_README.html for
documentation
• hpmcount produces hardware counter data for pre-defined or custom sets of events– presently only working on hal; should be
available on twister “soon”
• hpmcount -o counter_data mycode– writes counter data to ascii file specified
with -o
![Page 68: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/68.jpg)
Hardware Counters (cont’d)
• For MPI: poe hpmcount -o counter_data mycode -procs 4
• Each set contains data from 8 counters– some counters are duplicated between sets
• 4 data sets available– to specify a set, use -sn, where n is the
set number hpmcount -o counter_data -s2 mycode
– default is set number 1
![Page 69: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/69.jpg)
Hardware Counters (3)
• default data set (set 1)Cycles
Instructions completed
TLB misses
Stores completed
Loads completed
FPU 0 instructions
FPU 1 instructions
FMAs executed
![Page 70: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/70.jpg)
Hardware Counters (4)
• set 2 Cycles
Instructions completed
TLB misses
Loads dispatched
Load misses in L1
Stores dispatched
Store misses in L1
Load store unit idle
![Page 71: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/71.jpg)
Hardware Counters (5)
• set 3 Cycles
Instructions dispatched
Instructions completed
No Instructions completed
Instruction cache misses
FXU 0 instructions
FXU 1 instructions
FXU 2 instructions
![Page 72: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/72.jpg)
Hardware Counters (6)
• set 4 Cycles
Loads dispatched
Load misses in L1
Master generated load op not retried
Stores dispatched
Store misses in L2
Completion unit waiting on load
load store unit idle
![Page 73: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/73.jpg)
MPI Profiler
• Not officially supported by IBM– written by Bob Walkup at Watson Lab
• simply link with
-L/usr/local/mpi_trace -lmpitrace• run code normally• An ASCII trace file will appear in the
working directory for each process
mpi_profile.0, mpi_profile.1, ...
![Page 74: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/74.jpg)
MPI Profiler (cont’d)
---------------------------------------------------------------------MPI Routine #calls avg. bytes time(sec)---------------------------------------------------------------------MPI_Comm_size 1 0.0 0.000MPI_Comm_rank 1 0.0 0.000MPI_Send 240 86940000.0 30.578MPI_Bcast 80 86940000.0 24.913---------------------------------------------------------------------total communication time = 55.491 seconds.total elapsed time = 107.604 seconds.user cpu time = 106.970 seconds.system time = 0.620 seconds.maximum memory size = 253120 KBytes.
![Page 75: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/75.jpg)
MPI Profiler (3)
Message size distributions:
MPI_Send #calls avg. bytes time(sec) 3 40000.0 0.006
51 97000000.0 6.715
MPI_Bcast #calls avg. bytes time(sec) 2 820000.0 0.006
12 48086666.7 3.902
![Page 76: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/76.jpg)
mpihpm
• another of Bob Walkup’s utilities• combines mpitrace and hardware
counters• link with
-L/usr/local/mpi_trace -lmpihpm -lpmapi• presently Power4 (twister) only
![Page 77: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/77.jpg)
mpihpm (cont’d)
• counters are contained in 58 groups of 8 counters each– description of counter groups is in
/usr/local/mpi_trace/power4.ref• specify group through environment variable
setenv HPM_GROUP 53• The writer of mpihpm recommends groups
53, 56, and 58– they must be run one-at-a-time
![Page 78: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/78.jpg)
mpihpm (3)
• Run code normally• An ASCII trace file will appear for each
process in working directory
mpi_profile_group53.0 …• output same as that from mpitrace with
counter data added at bottom of report
![Page 79: Introduction to Scientific Computing on the IBM SP and Regatta](https://reader036.fdocuments.us/reader036/viewer/2022062314/56814483550346895db11c33/html5/thumbnails/79.jpg)
mpihpm (4)
--------------------------------------------------------------------------Power-4 counter report for group 53. pm_pe_bench1, PE Benchmarker group for FP analysis-------------------------------------------------------------------------- 401 FPU executed FDIV instruction (PM_FPU_FDIV) 3453 FPU executed multiply-add instruction (PM_FPU_FMA) 5357 FXU produced a result (PM_FXU_FIN) 3840 FPU produced a result (PM_FPU_FIN)98083 Processor cycles (PM_CYC) 0 FPU executed FSQRT instruction (PM_FPU_FSQRT)62065 Instructions completed (PM_INST_CMPL) 1961 FPU executing FMOV or FEST instructions (PM_FPU_FMOV_FEST)