NCCS User Forum March 20, 2012

NCCS User Forum

March 20, 2012

Breakout:

Debugging and Performance Tuning on Discover

Chongxun (Doris) [email protected]

Debugging on Discover Compilers can do some debugging work very easily and effectively Array bound checking Uninitialized variables and arraysFloating point exception catching Tools are a great help for debugging idb, gdb, DDD, Totalview NCCS User Forum, March 20, 2012*

Compiler options *

Debugging tools on Discover*

Before you start debuggingWhat tool to use?Sequential code? ddd/gdb, idbThreaded code? idb, totalview MPI or MPI/OpenMP hybrid code? totalview Got a core dump? setenv DECFORT_DUMP_flag Y (allow core dump generated)limit coredumpsize xxxx-g has to be used for debugging. Using g automatically adds O0

NCCS User Forum, March 20, 2012*

Debugging a threaded code with idbNCCS User Forum, March 20, 2012*setenv OMP_NUM_THREADS 4idb ./omp.exe

TotalView An interactive tool that lets you debug serial, multi-threaded and multi-processor programs with support for Fortran and C/C++

NCCS User Forum, July 19, 2011*

TotalViewDetails on how to configure TV for your run and use TV for serial or MPI/OpenMP jobs are in the NCCS Primer. As announced on March 2012, the base TV solution now includes Replay Engine and CUDA debugging support for no additional fee. Major features:Parallel debugging: MPI, Pthreads, OpenMP, CUDAAdvanced memory debugging with MemoryScape Reverse debugging with ReplayEngineAdd on ThreadSpotter for optimization NCCS User Forum, March 20, 2012*

Performance Analysis Typical Bottlenecks NCCS User Forum, March 20, 2012* Your ApplicationSynchronization, load balance, communication, memory usage, I/O usage System ArchitectureMemory hierarchy, network latency, processor architecture, I/O system setup

SoftwareCompiler options, libraries, runtime environment, communication protocolsInstrumentationMeasurementAnalysisOptimization

General Tuning Considerations Auto-parallelizationThrough library callsThrough compiler optionsI/O tuningTMPDIR. Run time environment Reduce number of I/O requestsUse unformatted filesUse large record sizes

NCCS User Forum, March 20, 2012* Communication tuning Balance communication Communication/Computation overlapping Run-time environment Memory and CPU tuning Cache misses (stride-1 access, padding)TLB misses (large pages, loop blocking)Page faults Loop optimization

Optimization with Compiler Options - Intel Start with reasonably good set of options and build with moreDefault is -O2. -O = -O2O3: recommended over -O2 only for codes with loops that heavily use FP calculations and process large data setsCodes with many small function calls: -ip ipoCodes with many FP operations: -fp-model fast=2Be careful of fast. fast=-O3 ipo no-prec-div static-xSSE4.2 generate optimized code specialized for the Intel processorsopenmp for OpenMP codeFall back by fp-model precise if correctness is an issueNCCS User Forum, March 20, 2012*

Optimization with Compiler Options (Contd)Intel compiler provides different report to identify performance issues -opt-report 3 -opt-report-phase=hlo(ipo, hop, ecg_swp)Vectorization report: -vec-report 3-par-report 3-openmp-report 2NCCS User Forum, March 20, 2012*

Run-time Environment Tuning Select proper process layout. Default is group round-robin. Set I_MPI_PERHOST to override default layout:I_MPI_PERHOST=nI_MPI_PERHOST=allcores : maps to cores on a node Or use mpirun/mpiexec optionsmpirun perhost n For MPI/OpenMP hybrid jobs:Set OMP_NUM_THREADS and Use perhost nSet I_MPI_PIN_DOMAIN=auto(omp)Set KMP_AFFINITY= compactNCCS User Forum, July 19, 2011*

Run-time Environment Tuning (Contd)Use scalable DAPL progress for large jobsSet I_MPI_DAPL_SCALABLE_PROGRESS variable to 1 to enable scalable algorithm for DAPL read progress engine. It offers performance advantage for large (>64) numbers of processes.Use Intel MPI lightweight statisticsSet I_MPI_STATS set to non-zero integer value to gather MPI communication statisticsManipulate with I_MPI_STATS_SCOPE to increase effectiveness of the analysisI_MPI_STATS=3I_MPI_STATS_SCOPE=collNCCS User Forum, July 19, 2011*

Run-time Environment Tuning (Contd)Adjust eager/rendezvous protocol threshold Eager sends data immediately regardless of receive request availability.Rendezvous notices receiving site on data pending and transfers when receive request is set.I_MPI_EAGER_THRESHOLD controls high level protocol switchover point. Shorter messages are sent using the eager protocol; larger ones are sent by using the more memory efficient rendezvous protocol.NCCS User Forum, July 19, 2011*

Performance Analysis Tools (Check Primer for usage!)NCCS User Forum, July 19, 2011*Interval TimersElapsed time between two timer calls time (shell), mpi_wtime,system_clock (subroutines)

Applications Profiling Tools Periodically sample the program counter gprof

Event CountersCounts number of times hardware events occurTAU

Event Tracers Complete sequences of eventsI_MPI_STATS, MpiP, TAU Intrusive. Longer run time Only profiles CPU usage. Lack of I/O and communication info time is simple to use Provide hardware performance counter info (e.g., cache misses), hardware metrics (e.g., loads per cache miss or # of page faults)Detailed information about communication timeGood for analyze scaling

Top3 ways to avoid performance problems - 1 Never, ever write your own code unless you absolutely have to Libraries, libraries, libraries! MKL library GNU Scientific Library (GSL), under /user/local/other/SLES11/gslLAPACK, under /user/local/other/SLES11/lapack PETSC, Portable, Extensible Tookit for Scientific Computation, under /user/local/other/SLES11/petscSpend time doing research, chances are you will find a package that suits your needsNCCS User Forum, March 20, 2012

*

Top3 ways to avoid performance problems - 2 Let the compiler do the work Modern compilers are much better now at optimizing most code and providing hints for your manual optimization For example, -xsse4.2 O3 no-prec-div are recommended for latest Intel processorsSpend some time reading the man page and ask around NCCS User Forum, March 20, 2012*

Top3 ways to avoid performance problems - 3 Never use more data than absolutely necessaryOnly use high precisions when necessary. A reduction in the amount of data the CPU needs ALWAYS translates to an increase in performanceAlways keep in mind that the memory subsystem and the network are the ultimate bottlenecksNCCS User Forum, March 20, 2012*

Top3 ways to avoid performance problems 3.5Finally, Make friends with Computer Scientists!

Learning even a little about modern computer architectures will result in much better code.NCCS User Forum, March 20, 2012*

***

NCCS User Forum March 20, 2012

Documents

Transcript of NCCS User Forum March 20, 2012