IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug...

106
IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer Center

Transcript of IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug...

Page 1: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

1

Software Configuration for Clusters in a Production HPC Environment

Doug Johnson, Jim Giuliani, and Troy Baer

Ohio Supercomputer Center

Page 2: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

2

Introduction• Linux clusters are becoming mature

• No longer just for post-processing.

• Rich tool environment

• Third-party adoption– LS-Dyna– Fluent– Large choice of commercial compilers– Integrators

Page 3: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

3

Tutorial Content• Development Environment

– User Environment– Compilers and Languages– Programming Models

• Application Performance Analysis– Non-intrusive– Intrusive

• System Management– Job Scheduling– System

Page 4: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

4

User Environment• Shell and Convenience Environment Variables

• Interface with Mass Storage

• Parallel Debuggers

• Languages and Compilers

Page 5: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

5

Shell and Convenience Environment Variables• Need to present users with a uniform, well designed shell

environment.

• Documentation of needed shell environment variables for different programs is critical for usability.

• Must support users shell preferences, this is a personal thing.– Forcing a shell is akin to forcing vi or emacs.

• OSC has used a run-alike version of Cray modules.

• Mixed results.– Uniform environment.– Reliability problems.

• Convenience environment variables– $TMPDIR, $USER– $MPI_FFLAGS, $MPI_C[XX]FLAGS and $MPI_LIBS– Environment variables for compiling ScaLAPACK programs.

Page 6: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

6

Interface with Mass Storage• Linux has been followed a rocky development path.• NFS Version 3 support just now becoming stable.

– Supports 64 bit filesystems– Network Lock Manager (NLM) – Asynchronous writes

• What is the hope of achieving good performance with NFS and Linux?• The following plots can show us, first a few tuning parameters.

– Modified /proc/sys/net/core/rmem_max, wmem_max, rmem_default and wmem_default.

echo 2097152 > /proc/sys/net/core/[rw]mem_max

echo 524288 > /proc/sys/net/core/[rw]mem_default– rmem_default and wmem_default set the defaults for SO_RCVBUF

and SO_SNDBUF.– Warning: On limited memory systems with many socket

communications these settings may cause memory pressures.

Page 7: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

7

TCP Performance

TCP Stream Performance

0

50

100

150

200

250

300

350

0 1000000 2000000 3000000 4000000 5000000 6000000 7000000 8000000 9000000

Block Size (bytes)

Meg

abits

/Sec

ond

HIPPI

Gigabit Ethernet

Fast Ethernet

Page 8: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

8

TCP Performance

TCP Stream Performance

0

50

100

150

200

250

300

0 50000 100000 150000 200000 250000

Block Size (bytes)

Meg

abits

/sec

ond

HIPPI

Gigabit Ethernet

Fast Ethernet

Page 9: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

9

UDP Performance./netperf -l 60 -H fe.ovl.osc.edu -i 10,2 -I 99,10 -t

UDP_STREAM -- -m 1472 -s 32768 -S 32768

UDP UNIDIRECTIONAL SEND TEST to fe.ovl.osc.edu : +/-5.0% @ 99% conf.

Socket Message Elapsed Messages

Size Size Time Okay Errors Throughput

bytes bytes secs # # 10^6bits/sec

131070 1472 59.99 3229909 0 634.03

524288 59.99 2169706 425.91

Page 10: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

10

Debugging and Parallel Programs• Developing code always introduces bugs.

• Strategic print statements sometimes are not enough.

• Postmortem analysis.– Debugger a.out core

• Parallel programs started in same NFS directory may be a problem.– Multiple processes trying to dump to the same file.

• Kernel patch to make the name of the core file unique.

Page 11: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

11

TotalView Parallel Debugger• Available from http://www.etnus.com.

• Debugger for MPI, Open MP and threaded programs on many platforms.

– Some features such as Open MP and threads not supported on some platforms.

M pi P rocessS endB u ffer

R eceiveB u f fer

tvdm ain(if M P I_C om m _rank= 0 ), tvdsvr

daem on otherw ise.

Page 12: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

12

TotalView Debugger Installation• Can be downloaded from the Etnus website.

• Will need to be installed on an NFS filesystem visible to all nodes, or installed in the same location on local disk on each node.

• Simple script-based install, follow prompts.

• Environment variables are critical!– Must be present for rsh commands.– /etc/profile.d/[] will not be evaluated.– bash and bash2 will evaluate .bashrc– csh and tcsh will evaluate .cshrc– pdksh will not evaluate any “.” files.

Page 13: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

13

TotalView Interface

Page 14: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

14

Message Queues• Message queue window provides;

– MPI_COMM_WORLD info– Size– Rank– Pending send and receives

Page 15: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

15

TotalView Features• Can have multiple object files open in the same TotalView session

(local and remote).

• Automatic attach to child processes.

• X-Window and CLI user interfaces.

• Data visualization.

Page 16: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

16

Languages and Compilers• With the wide-spread adoption of Linux the availability of commercial

compilers has increased.– Lahey Fortran 95, http://www.lahey.com– Portland Group Compiler Suite, http://www.pgroup.com– Absoft Fortran95, C, and C++, http://www.absoft.com– NAG Fortran 95, http://www.nag.com– SGI Compiler Suite for Linux IA-64, http://www.oss.sgi.com

Page 17: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

17

GCC• Ubiquitous C compiler with additional languages added over the

years, easily re-targeted to new platforms.

• Languages supported include C, C++, Fortran 77…

• Other languages are supported, but won’t be covered in this tutorial, the 3 above cover the majority of Scientific and Engineering codes.

GCC Advantages:

• Free, in the monetary and liberty sense of the word.

• Common back-end.

• Flexible– Extendable– Inline assembly– Extra language features

Page 18: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

18

GCCLanguage extensions:

• Complex

• __alignof__

• inline– Only works with -O or greater optimization level.

• Lexical scoping of functions for nested functions.

• Inline assembly.– C expressions are allowed as operands.

Page 19: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

19

GCC and EGCS• Compiler back-end is lacking in optimizations for specific

architectures.

• C++ has in the past been criticized for not tracking the C++ standard.

• Since the Cygnus EGCS/GCC integration C++ performance and conformance has significantly improved.

STL Performance:http://www.physics.ohio-state.edu/~wilkins/computing/benchmark/STC.html

Created to test the implementation of STL, from the web-site:

“To verify how efficiently C++ (and in particular STL) is compiled by the present day compilers, benchmark outputs 13 numbers computed with increasing abstrctions. In the ideal world these numbers should be the same. In the real world, however, …”

Page 20: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

20

STC Description– 0 - uses simple Fortran-like for loop. – 1 - 12 use STL style accumulate template function with plus function

object. – 1, 3, 5, 7 ,9, 11 use doubles. – 2, 4, 6, 8, 10, 12 use Double - double wrapped in a class. – 1, 2 - use regular pointers. – 3, 4 - use pointers wrapped in a class. – 5, 6 - use pointers wrapped in a reverse-iterator adaptor. – 7, 8 - use wrapped pointers wrapped in a reverse-iterator adaptor. – 9, 10 - use pointers wrapped in a reverse-iterator adaptor wrapped – in a reverse-iterator adaptor. – 11, 12 - use wrapped pointers wrapped in a reverse-iterator adaptor – wrapped in a reverse-iterator adaptor.

Page 21: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

21

G77• Fortran “Front-End” that uses the GCC “Back-End” for code

generation.

• Implements most of the Fortran 77 standard.

• Not completely integrated with GCC– No inline assembly– Warning of implicit type conversions

• No aggressive optimizations.

Page 22: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

22

Portland Group Compiler Suite• Vendor of Compilers for traditional HPC systems.

• Contracted by DOE and Intel to provide compilers for Intel ASCI Red.

• Optimizing compiler for Intel P6 core.

• Linux, Solaris and MS Windows (X86 only).

• Compiler suite includes C, C++, Fortran 77 and 90, and HPF.

• Link compatible with GCC objects and libraries.

• Includes debugger and profiler (can use GDB).

Page 23: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

23

Optimizations for Portland Compilers• Vectorizor can optimize for countable loops with large arrays.

• Use -Minfo=loop to have the compiler report what optimizations were applied to the loops, unrolling and vectorized.

• Cache size can be specified to maximize cache re-use, -Mvect:cachesize=…

• Use -Mneginfo=loop to provide information about why a loop was not a candidate for vectorization.

• Can specify number of times to unroll a loop.

• Can use -Minline to inline functions. This can improve the performance of calls to functions inside of subroutines.

– Is not useful for functions that have an execution time >> penalty for the jump.

– This option will sacrifice code compactness for efficiency.

Page 24: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

24

Optimizations for Portland Compilers (cont.)• All command line optimizations are available through directives or

pragmas.

• Can be used to enable or disable specific optimizations.

Page 25: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

25

Caveats for Portland Compilers• F77 and F90 are separate front-ends.

• Debugger cannot display floating point registers.

• Code compiled with Portland Compiler is compatible with GDB– Initial listing of code does not work.– Set break point or watch point where desired.

• Profiler can be difficult or impossible to use on parallel codes.

Page 26: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

26

Other Sources of Compiler Information• Linux Fortran web page,

http://studbolt.physast.uga.edu/templon/fortran.html

• Cygnus/FSF GCC homepage, http://gcc.gnu.org

• Scientific Applications on Linux, http://SAL.KachinaTech.COM/index.shtml

Page 27: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

27

Parallel Programming Models for ClustersThere are a number of programming models available on cluster of x86-

based systems running Linux:

• Threading

• Compiler directives

• Message passing

• Multi-level (hybrid message passing with directives)

• Parallel numerical libraries and application frameworks

Page 28: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

28

Threading• Threading is a common concurrent programming approach in which

several “threads” of execution share a memory address space.

• Because of the requirement for shared memory, threaded programs will only run on individual systems, although they typically see performance boosts when run on SMP systems.

• The most common interface to threads on UNIX-like systems such as Linux is the POSIX threads (pthreads) API, although on other systems there are numerous other interfaces available (DCE threads, Java threads, Win32 threads, etc.).

• Programming threaded applications can be extremely tedious and difficult to debug, and thus threading is not often used in HPC-oriented scientific applications.

Page 29: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

29

Compiler Directives• Compiler directives allow the relatively simple alteration of a serial

code into a parallel code by inserting directives into the serial code which act as “hints” to the compiler, telling it where to look for parallelism. The directives will show up as comments to compilers which do not support the directives.

• This approach obviously requires the availability of a compiler which supports the directives in question.

• The most commonly supported directive sets for Linux systems are:– High Performance Fortran (HPF)– OpenMP

Page 30: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

30

Compiler Directives: HPF• HPF was developed in the early 1990s as a parallel extension the the

Fortran programming language. It consists of directives which allow the programmer to distribute data arrays across multiple processors using a “owner-computes” model, as well as a library of parallel intrinsic routines.

• One HPF compiler for Linux/x86 is the Portland Group’s pghpf compiler, which supports parallelization on both a single SMP system (using shared memory) and clusters of systems (using a message passing layer over MPI). OSC has used the pghpf compiler on its 132 processor IA32 cluster and has seen scalability comparable to a hand-coded MPI program for some applications.

• Other HPF compilers for Linux/x86 include Pacific-Sierra Research’s VAST-HPF compiler and NA Software’s HPF-Plus compiler.

Page 31: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

31

Compiler Directives: OpenMP• OpenMP was developed in the late 1990s as a portable solution for

directive-based parallel programming, specifically for shared memory architectures. Like HPF, it is a collection of directives with a support library; however, unlike HPF OpenMP does not give explicit control over data placement. Also unlike HPF, OpenMP supports C and C++ as well as Fortran.

• There are several OpenMP-enabled compilers for Linux/x86, including those from the Portland Group (Fortran 77/90, C, and C++) and Kuck and Associates (C++). OSC has used both the Portland Group and Kuck compilers and found them to be acceptable, although OpenMP codes rarely scale well past two processors due to the limited memory bandwidth on four- and eight-way x86-based SMP systems.

Page 32: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

32

Message Passing and MPI• Message Passing is the most widely used approach for developing

applications on distributed memory parallel systems. In message passing, data movement between processors is achieved by explicitly calling communication routines to send data from one processor to another.

• The standard and most common used message passing library is the Message Passing Interface (MPI), originally developed in the mid 1990s. There are numerous implementations of the MPI-1.1 standard, including:

– MPICH (http://www-unix.mcs.anl.gov/mpi/mpich/) -- freely available, supports MPI over shared memory and TCP/IP as well as a number of high speed interconnects including Myrinet, implements the parallel I/O portions of the MPI-2 standard.

– LAM (http://www.mpi.nd.edu/lam/) -- freely available, supports MPI over shared memory and TCP/IP, implements much (most?) of the MPI-2 standard.

Page 33: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

33

Multi-Level/Hybrid• In clusters of SMP systems, it is sometimes advantageous to use a

hybrid of message passing and directive-based approaches, often referred to as multi-level parallel programming.

• In the multi-level approach, the domain is decomposed in a coarse manner using message passing. Within the message passing code, compiler directives are inserted to run the computationally intensive portions in parallel in a shared memory node.

• This approach works best for systems and applications where contention for interconnect interfaces becomes a hinderance to scalability; in general, it does not increase performance at low processor counts, but it extends the region of near-linear scalability beyond that possible with message passing or compiler directives alone.

Page 34: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

34

Multi-Level: MPI + OpenMP• The most common and portable way to do multi-level parallel

programming is to use MPI for the coarse-grained domain decomposition and message passing, and OpenMP for the finer-grained loop level parallelism.

• The main restriction to this programming approach is that MPI routines must not be called within OpenMP parallel regions.

• This approach also requires compilers which support both MPI and OpenMP.

Page 35: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

35

Parallel Numerical Libraries• Another approach to parallel programming is to use a parallel

numerical library or application framework which abstracts away (to some extent) the distributed memory nature of a cluster system.

• Parallel numerical libraries are libraries which perform a particular class of mathematical operations, such as the Fourier transform or matrix/vector operations, in parallel.

• Examples of paralle numerical libraries include:– FFTW (http://www.fftw.org/), which implements parallel FFTs using

either pthreads or MPI.– ScaLAPACK (http://www.netlib.org/scalapack/), which implements

parallel matrix and vector operations using MPI.

Page 36: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

36

Parallel Application Frameworks• Parallel application frameworks are similar to parallel numerical

libraries, but often include other features such as I/O, visualization, or steering capabilities.

• Parallel application frameworks tend to be aimed at a particular application domain rather than a class of mathematical operations.

• Examples of parallel application frameworks include:– Cactus (http://www.cactuscode.org/), which is a parallel toolkit for

solving general relativity and astrophysical fluid mechanics problems.– PETSc (http://www-fp.mcs.anl.gov/petsc/), which is a general purpose

parallel toolkit for solving problems modeled by partial differential equations.

Page 37: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

37

Application Performance AnalysisThe availability of tools which give users the ability to characterize the

performance of their applications is critical to the acceptance of clusters as “real” production HPC systems. Performance analysis tools fall into three broad categories:

• Timing

• Profiling

• Hardware Performance Counters

Page 38: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

38

TimingThe simplest way of determining the performance of an application is to

measure how long it takes to run.

• Timing on a per-process basis can be accomplished using the time command.

• Timing on a more arbitrary basis within an application can be done using timing routines such as the gettimeofday() system call or the MPI_Wtime() function in the MPI message passing library

Page 39: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

39

ProfilingProfiling is an approach in which the time spent in each routine is logged

and analyzed in some fashion. This allows the programmer to designate which routines are taking the most time to execute and hence are candidates for optimization. In clusters and other distributed memory parallel environments, this can be taken a step further by profiling a program’s computational routines as well as its communications patterns.

• Computation profiling– gprof– pgprof

• Communication profiling– jumpshot

Page 40: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

40

Computation Profiling: gprofgprof is the GNU profiler. To use it, you need to do the following:

• Compile and link your code with the GNU compilers (gcc, egcs, g++, g77) using the -pg option flag.

• Run your code as usual. A file called gmon.out will be created containing the profile data for that run.

• Run gprof progname gmon.out to analyze the profile data.

Page 41: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

41

Computation Profiling: gprof Exampletroy@oscbw:/home/troy/Beowulf/cdnz3d> make

g77 -O2 -pg -c cdnz3d.f -o cdnz3d.o

g77 -O2 -pg -c sdbdax.f -o sdbdax.o

g77 -O2 -pg -o cdnz3d cdnz3d.o sdbdax.o

troy@oscbw:/home/troy/Beowulf/cdnz3d> ./cdnz3d

(…gmon.out created…)

troy@oscbw:/home/troy/Beowulf/cdnz3d> gprof cdnz3d gmon.out | more

Page 42: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

42

Computation Profiling: gprof Example (con’t.)Flat profile:

Each sample counts as 0.01 seconds.

% cumulative self self total

time seconds seconds calls s/call s/call name

24.67 942.76 942.76 4100500 0.00 0.00 lxi_

23.51 1841.45 898.69 4100500 0.00 0.00 leta_

20.10 2609.66 768.21 4100500 0.00 0.00 damping_

12.64 3092.90 483.24 4100500 0.00 0.00 lzeta_

11.55 3534.28 441.38 4100500 0.00 0.00 sum_

4.12 3691.73 157.45 250 0.63 14.83 page_

2.91 3802.84 111.11 250 0.44 0.44 tmstep_

0.41 3818.62 15.78 500 0.03 0.03 bc_

0.03 3819.59 0.97 pow_dd

(…output continues…)

Page 43: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

43

Computation Profiling: pgprofpgprof is the profiler from the Portland Group compiler suite; it is

somewhat more powerful than gprof. To use it, you need to do the following:

• Compile and link your code with the Portland Group compilers (pgcc, pgCC, pgf77, pgf90, pghpf) using the -Mprof=func or -Mprof=lines options depending whether you want function-level or line-level profiling.

• Run your code as usual. A file called pgprof.out will be created containing the profile data for that run.

• Run pgprof pgprof.out to analyze the profile data.

Page 44: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

44

Computation Profiling: pgprof Exampletroy@oscbw:/home/troy/Beowulf/cdnz3d> make

pgf77 -fast -tp p6 -Mvect=assoc,cachesize:524288 -Mprof=func \ -c cdnz3d.f -o cdnz3d.o

pgf77 -fast -tp p6 -Mvect=assoc,cachesize:524288 -Mprof=func \ -c sdbdax.f -o sdbdax.o

pgf77 -fast -tp p6 -Mvect=assoc,cachesize:524288 -Mprof=func \ -o cdnz3d cdnz3d.o sdbdax.o

Linking:

troy@oscbw:/home/troy/Beowulf/cdnz3d> ./cdnz3d

(…pgprof.out created…)

troy@oscbw:/home/troy/Beowulf/cdnz3d> pgprof pgprof.out

Page 45: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

45

Computation Profiling: pgprof Example (con’t.)• pgprof will present a graphical display if it finds a functional X

display as part of the user’s environment:

Page 46: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

46

Computation Profiling: pgprof Example (con’t.)• Without a functional X display, pgprof will present a command line

interface like the following:troy@oscbw:/home/troy/Beowulf/cdnz3d> pgprof pgprof.out

Loading....

Datafile : pgprof.out

Processes : 1

pgprof> print

Time/ Function

Calls Call(%) Time(%) Cost(%) Name:

------------------------------------------------------------------------

4100500 0.00 23.43 23 lxi (cdnz3d.f:1632)

4100500 0.00 21.90 22 damping (cdnz3d.f:2319)

4100500 0.00 21.87 22 leta (cdnz3d.f:1790)

4100500 0.00 11.68 12 lzeta (cdnz3d.f:1947)

pgprof> quit

Page 47: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

47

Communication Profiling: jumpshotjumpshot is a Java-based GUI profiling tool which is included in the

MPICH implementation of MPI. It allows the programmer to profile all calls to MPI routines. To use jumpshot, you need to do the following:

• Compile your MPI code using one of the MPI compiler wrappers (mpicc, mpiCC, mpif77, mpif90) supplied with MPICH using the -mpilog option, and link using -lmpe.

• Run your MPI code as usual. A .clog file will be created (i.e. if your executable is named progname, a log file called progname.clog will be created).

• Run jumpshot on the .clog file (eg. jumpshot progname.clog)

Page 48: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

48

Communication Profiling: jumpshot Exampletroy@oscbw:/home/troy/Beowulf/mpi-c> more jumpshot.pbs

#PBS -l nodes=2:ppn=4

#PBS -N jumpshot

#PBS -j oe

cd $HOME/Beowulf/mpi-c

mpicc -mpilog nblock2.c -o nblock2 -lmpe

mpiexec ./nblock2

troy@oscbw:/home/troy/Beowulf/mpi-c> qsub jumpshot.pbs

(…nblock2.clog created…)

troy@oscbw:/home/troy/Beowulf/mpi-c> jumpshot nblock2.clog

Page 49: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

49

Communication Profiling: jumpshot Example (con’t)

Page 50: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

50

Hardware Performance CountersHardware performance counters are a way of measuring the performance

of an application or system at a very low level. This can be extremely useful for diagnosing performance problems such as cache thrashing or memory bandwidth bottlenecks. There are two ways of accessing performance counters:

• Non-invasive (command-line driven)– lperfex

• Invasive (instrumentation library)– libperf– PAPI

Page 51: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

51

Hardware Performance Counters: lperfex• OSC has developed a utility called lperfex (http://www.osc.edu

/~troy/lperfex/) to access the hardware performance counters built into newer Intel P6-based processors.

• lperfex functions much like the time command, in that it is run on other programs. However, it also gives the ability to count and report on hardware events. The default events if none are specified are floating point operations and L2 cache line loads.

• No special compilation is required, and lperfex can be used within batch jobs and with MPI programs (eg. mpiexec lperfex -y ./a.out). However, it currently does not work with multithreaded programs, such as those using OpenMP or pthreads. It also requires the use of a kernel patch which exposes the MSRs (Model Specific Registers), available at http://www.beowulf.org/software/perf-0.7.tar.gz.

Page 52: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

52

Hardware Performance Counters: lperfex Example

troy@oscbw:/home/troy/Beowulf/cdnz3d> lperfex -e 41 -y ./cdnz3d

838.239990 seconds of CPU time elapsed and 0.000000 MB of memory on oscbw.cluster.osc.edu

Event # Event Events Counted

------- ----- --------------

41 Floating point operations retired 3042728032

Statistics:

-----------

MFLOPS 65.694389

Page 53: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

53

Hardware Performance Counters: lperfex -- Events

• 0: Memory references

• 1: L1 data cache lines loaded

• 3: L1 data cache lines flushed

• 13: L2 cache lines loaded

• 14: L2 cache lines flushed

• 31: I/O transactions

• 35: Memory transactions

• 41: Floating point operations retired (counter 0 only)

• 43: Floating point exceptions handled by microcode (counter 1 only)

• 50: Instructions retired

• 51: Ops retired

• 53: Hardware interrupts received

• 67: Cycles during which the processor is not halted

Page 54: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

54

Hardware Performance Counters: libperf and PerfAPI

• lperfex is built on top of libperf, which is a user-callable library which is included with the NASA Goddard performance counters patch (http://www.beowulf.org/software/perf-0.7.tar.gz).

• It is also possible to instrument a code directly with libperf rather than use lperfex; this would be of interest if you wanted to measure the performance of a single routine rather than the entire code.

• There is an effort under the auspices of the Parallel Tools Consortium (http://www.ptools.org/) to develop a standard library for doing portable low-level performance measurement, called PerfAPI (http://icl.cs.utk.edu/projects/papi/). The current Linux/x86 release of this project fortunately uses a kernel patch which should be compatible with libperf and lperfex.

Page 55: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

55

Why Job Scheduling SoftwareIn an ideal world, users would coordinate with each other and no conflicts

would be encountered when running jobs on a cluster.

Unfortunately in real life we have limited resources (processors, memory and network interfaces)

– Users, faced with time deadlines of their own, will want to execute jobs on the cluster as it fits with their schedule

– High throughput users can swamp the whole system, if allowed– Users can check for CPU availability (system load), but how many will

check memory or network interface availability

Job scheduling system allows you to enforce a system policy– Policy can be established by management or peer review– Enforcement of policy will control what are the maximum resources

available, and in what order jobs will be allocated these resources

Page 56: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

56

OSC User Environment ConfigurationFront End System

• Designated for code development and pre/post processing

• Interactive resource limits (10 min. CPU time, 32MB memory on the front end node -- use the limit command to check this).

Compute Nodes

• Private network ensures no direct interface (i.e. rlogin directly to the compute nodes)

• Users specify what their compute requirements are and the scheduling policy allocates nodes as a resource

Page 58: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

58

Batch System Eval• Cluster management software requirements were identified and seven

batch systems were evaluated

• Two systems met basic requirements, PBS and LSF, and a side by side comparison was made with both packages

Some observations made at the time of the comparison (1999)– Not all packages of the LSF Suite have been ported to Linux– Microsoft NT apparently Platform Computing, Inc. operating system of

choice for Intel architecture, while PBS fully supports Linux– LSF was designed for clusters of systems, although not necessarily

dedicated clusters– PBS was designed for single system image systems and is evolving to

supporting clusters, specifically dedicated clusters– No significant difference in functionality between the two– PBS provides more opportunity for optimization and development

Page 59: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

59

Portable Batch System - Brief Overview• The most widely used batch queuing system for clusters

• PBS (“Portable Batch System”) from MRJ Technology Solutions (Veridian Corporation). This package was developed by MRJ for the NAS Facility at NASA Ames Research Center; it is the successor of the venerable NQS package, which was also developed at NASA Ames.

• PBS is a software system for managing system resources on workstations, SMP systems, MPPs, and vector supercomputers.

• Developed with the intent to conform with the POSIX Batch Environment Standard

• For the purposes of this tutorial, we will concentrate on how PBS may be applied to a space-shared cluster of small SMP systems (i.e. cluster systems).

Page 60: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

60

PBS Structure

SERVER SCHEDULER MOM

PBS Server

•There is one server process

•It creates and receives batch jobs

•Modifies batch jobs

•Invokes the scheduler

•Instructs moms to execute jobs

PBS Scheduler

•There is one scheduler process

•Contains the policy controlling which job is run, where and when it is run

•Communicates with the “moms” to learn about state of system

•Communicates with server to learn about the availability of jobs

PBS Machine Oriented Miniserver

•One process required for each compute node

•Places jobs into execution

•Takes instruction from the server

•Requires that each instance have its own local file system

PBS provides an Application Program Interface (API) to communicate with the server and another to interface with the moms

Page 61: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

61

How PBS Handles Jobs

Server1) Based on resource requirements place job into execution queue2) Instruct scheduler to examine queued jobs3) Instruct “mother superior” to execute the commands section of the batch script

Batch Script1) PBS directives2) Commands required for program execution

qsub command

Scheduler1) Query moms to determine available resources2) Examine queued jobs to see if any can be started and allocate resources3) Return job id and resource list to server for execution

step 2 results

mom pool

Query for available resources

Mother superior1) Execute batch commands2) Monitor resource usage of child processes and report back to server mom

mom

mommom

mom

mommom

mom

If parallel job, create remote processes on nodes allocated to this job

mom pool

Page 62: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

62

Starting and Stopping PBS Services• Recommended, but not required, starting order

– Mom– Server (generates a “are you there” ping to all moms at startup)– Scheduler

• Server– ‘-t create’ required for first startup– ‘pbs_server -t hot’ starts up server and looks for jobs currently running– ‘qterm -t quick’ kills server but leaves job running

• Mom(s)– ‘kill -9’ will leave jobs running– ‘pbs_mom -p’ let running jobs continue to run– ‘pbs_mom -r’ kill any running jobs

• Scheduler– No impact on performance

Page 63: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

63

PBS Server• Configuring the server can be separated into two parts

– Configuring the server attributes

– Configuring queues and their attributes

• Server is configured with the qmgr command while it is running • Commonly used commands:

– set, unset, print, create, delete, quit• Commands operate on three main entities

– server set/change server parameters

– node set/change properties of individual nodes

– queue set/change properties of individual queues

Usage: qmgr [-c command] -n-c Execute a single command and exit qmgr

-n No commands are executed, syntax checking only is performed

Page 64: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

64

Server AttributesDefault queue

– Declares the default queue to which jobs are submitted if a queue is not specified

– OSC cluster is structured so that all jobs go first though a routing queue called ‘batch’ and then to the specific destination queue

– For OSC cluster, ‘batch’ is the default queueset server default_queue = batch

Access Control List (ACL)Hosts: a list of hosts from which jobs may be submittedset server acl_hosts = *.osc.edu

set server acl_host_enable = True

True=turn this feature on

False=turn this feature off

Users: a list of users who may submit jobsset server acl_user = wrk001@*,wrk002@*,wrk003@*

set server acl_user_enable = True

Page 65: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

65

Server AttributesManagers

Defines which users at a specified host are granted batch system administrator privilegeset server managers = admin01@*.osc.edu

set server managers += pinky@*.osc.edu

set server managers += brain@*.osc.edu

Node PackingDefines the order in which multiple cpu cluster nodes are allocated to jobs

True: jobs are packed into the fewest possible nodes

False: jobs are scattered across the most possible nodesset server node_pack = True

Query Other NodesTrue: qstat allows you to see all jobs on the system

False: qstat only allows you to see your jobsset server query_other_jobs = True

Page 66: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

66

Server AttributesLogging

There are two types of logging• account logging

• events

Within qmgr, you set the mask that determines the level of event logging1 Error Events

2 Batch System/Server Events

4 Administration Events

8 Job Events

16 Job Resource Usage

32 Security Violations

64 Scheduler Calls

128 Debug Messages

256 Extra Debug Messages

The specified events are logically “OR-ed”set server log_events = 511 Everything turned on

set server log_events = 127 Good balance

Page 67: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

67

Queue StructurePBS defines two types of queues

Routing• Used to move jobs to other queues• Jobs cannot execute in a routing queue

Execution• A job must reside in an execution queue to be eligible to run• Job remains in this queue during execution

OSC queue configuration– One routing queue that is the entry point for all jobs– Routing queue dispatches jobs to execution queues defined by cpu time

and number of processors requested

Page 68: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

68

Queue Structure

Batch routing queue

q1 (0-5 hrs) q2 (5-10 hrs) q3 (10-20 hrs) q4 (20-40 hrs) q5 (40-160 hrs)p4 q1p4 q2p4 q3p4 q4p4 q5p4p8 q1p8 q2p8 q3p8 q4p8 q5p8

Number p16 q1p16 q2p16 .. .. ..of p32 q1p32 q2p32 .. .. ..Processors p64 q1p64 q2p64 .. .. ..

p128 q1p128 q2p128 .. .. q5p128

Time Range

•Queue division by processor count allows for management of parallel jobs

•Queue division by time allows job control for system maintenance

•Currently no OS checkpoint support for Linux

•Jobs running at system shutdown must restart from the beginning

•Structure allows queues to be turned off incrementally as downtime approaches, preventing the need to kill and restart jobs

Page 69: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

69

PBS Queue AttributesServer is configured with the qmgr command while it is running

Usage:[oscbw.osc.edu]$ qmgr

Qmgr: create|set queue queue_name attribute_name = value

see man pbs_queue_attributes for a complete list of queue attributes

Creating a queue

Before queue attributes can be set, the queue must be createdcreate queue batch

create queue short_16pe

create queue long_16pe

Page 70: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

70

Required PBS Queue AttributesQueue Type

Must be set to either execution or routingset queue batch queue_type = Routing

set queue short_16pe queue_type = Execution

EnabledLogical flag that specifies if jobs will be accepted into the queue

True - the specified queue will accept jobs

False - jobs will not be accepted into the queueset queue short_16pe enabled = True

Started– Logical flag that specifies if jobs in the queue will be processed– Good method for draining queues when system maintenance is needed

True - jobs in the queue will be processed, either routed or scheduled

False - jobs will be held in the queue

set queue short_16pe started = True

Page 71: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

71

Recommended PBS Queue AttributesMax running

– Controls how many jobs in this queue can run simultaneously– Customize this value based on hardware resources available

set queue short_16pe max_running = 4

Max user run– Controls how many jobs an individual userid can execute simultaneously

across the entire server– Helps prevent a single user from monopolizing system resources

set queue short_16pe max_user_run = 2

Priority– Sets the priority of a queue, relative to other queues

– Provides a method of giving smaller jobs quicker turnaround

set queue q5p128 Priority = 90

Page 72: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

72

Recommended PBS Queue AttributesMaximum and Minimum resources

– Limits can be placed on various resource limits– This restricts which jobs may enter the queue based on the resources

requested

Usage:

set queue short_16pe resources_max.resource = value

Look at man pbs_resources_linux to see all resource attributes for linux, man pbs_resources_###### for aix4, sp2, sunos4, unicos8

cput maximum amount of CPU time used by all processes

nodes number of nodes to be reserved

ppn number of processors to be reserved on each node

pmem maximum amount of physical memory used by any single processes

walltime maximum amount of real time during which the job can be in therunning state

Page 73: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

73

Example PBS Execution Queuecreate queue short_16pe

set queue short_16pe queue_type = Execution

set queue short_16pe Priority = 90

set queue short_16pe max_running = 8

set queue short_16pe resources_max.cput = 10:00:00

set queue short_16pe resources_max.nodect = 4

set queue short_16pe resources_max.nodes = 4:ppn=4

set queue short_16pe resources_min.nodect = 2

set queue short_16pe resources_default.cput = 05:00:00

set queue short_16pe resources_default.mem = 1900mb

set queue short_16pe resources_default.nodect = 4

set queue short_16pe resources_default.nodes = 4:ppn=4

set queue short_16pe resources_default.vmem = 1900mb

set queue short_16pe max_user_run = 4

set queue short_16pe enabled = True

set queue short_16pe started = True

Page 74: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

74

Routing Queue AttributesRoute destinations

– Specifies potential destinations to which a job may be routed– Will be processed in the order listed– Job will be sent to first queue which meets the resource requirements of

the job

create queue batchset queue batch queue_type = Routeset queue batch max_running = 4set queue batch route_destinations = short_16peset queue batch route_destinations += long_16peset queue batch enabled = Trueset queue batch started = True

Page 75: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

75

PBS SchedulerPBS implements the scheduler as a module, so that different sites can

“plug in” the scheduler that meets the specific need

The material in this tutorial will cover the default FIFO scheduler

FIFO Scheduler - Default Characteristics– All jobs in a queue will be considered for execution before the next queue

is examined– All queues are sorted by priority– Within each queue, jobs are sorted by requested CPU time (jobs can be

sorted on multiple keys)– Jobs which have been queued for more than 24 hours will be considered

starving

Page 76: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

76

PBS SchedulerConfiguring the scheduler• Configuration file read when scheduler is started• $PBS_HOME/sched_priv/sched_config• FIFO scheduler will require some customization initially, but should remain

fairly static

Format of config file• One line for each attribute

name: value { prime | non_prime | all }

• Some attributes will require a prime option• If nothing is placed after the value, the default of “all” will be assigned• Lines starting with a “#” will be interpreted as comments• When PBS is installed, an initial sched_config is created

Page 77: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

77

Scheduler Attributesstrict_fifo

Controls whether jobs will be scheduled in strict FIFO order or not

Type: booleanTrue - jobs will be run in a strict first in first out order

False - jobs will be scheduled based on resource usage

help_starving_jobsOnce a queued job has waited a certain period of time, PBS will cease

running jobs until the “starving job” can be run

Waiting period for starving job status is defined in starv_max

Type: booleanTrue - starving job support is enabled

False - starving job support is disabled

Recommendation: Turn starving jobs off or set starv_max high

Page 78: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

78

Scheduler Attributessort_by

Controls how the jobs are sorted within the queues

Type: stringno_sort - do not sort the jobs

shortest_job_first - ascending by the cput attribute (default)

longest_job_first - descending by the cput attribute

smallest_memory_first - ascending by the mem attribute

largest_memory_first - descending by the mem attribute

high_priority_first - descending by the job priority attribute

low_priority_first - ascending by the job priority attribute

large_walltime_first - descending by job walltime attribute

cmp_job_walltime_asc - ascending by job walltime attribute

fair_share - not covered here; see PBS Administrator Guide

multi_sort - sort on more than one key

Page 79: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

79

Scheduler Attributessort_by (cont)

Examples:sort_by: smallest_memory_first

sort_by: shortest_job_first

If multi_sort is set, multiple key fields are used

Each key field will be a key for the multi sort and the order of the key fields decides which sort type is used first

sort_by: multi_sort

key: shortest_job_first

key: smallest_memory_first

key: high_priority_first

starv_maxThe amount of time before a job is considered starving

Type: timemax_starve: 48:00:00

Page 80: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

80

Scheduler Attributeslog_filter

Defines level of scheduler logging

Type: number1 internal errors

2 system (server) events

4 admin events

8 job related events

16 End of Job accounting

32 security violation events

64 scheduler events

128 common debug messages

256 less needed debug messages

Example: to log internal errors, system events, admin events and scheduler events(1, 2, 4, 64)

log_filter 71

Page 81: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

81

PBS Mom• Configuring the execution server (Mom) is achieved with a

configuration file, which is read in at startup

• Configuration file locationDefault: $PBS_HOME/mom_priv/config

You can specify a different file with the ‘-c’ option when the pbs_mom daemon is started

• Configuration file contains two types of information– Initialization values– Static resources

Page 82: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

82

Initialization Values$clienthost hostname

– Adds hostname to the list of hosts which will be allowed to connect to Mom

– Both the host that runs the scheduler and the host that runs the server must be listed as a clienthost

$logevent value1 Error Events

2 Batch System/Server Events

4 Administration Events

8 Job Events

16 Job Resource Usage

32 Security Violations

64 Scheduler Calls

128 Debug Messages

256 Extra Debug Messages

The specified events are logically “or-ed”set server log_events = 511 Everything turned on

set server log_events = 127 Good balance

Page 83: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

83

Initialization Values$max_load

– Declares the load value at which the node will be marked busy– If the load value exceeds max_load, the node will be marked as busy– If a node is marked busy, no new jobs will be scheduled

$max_load 4.0

$ideal_load– Declares the load value at which the “busy” label will be removed from a

node– If the load value drops below ideal_load, the node will no longer be

marked as busy$ideal_load 3.0

$cputmultSets a factor used to adjust the cpu time used by a job. Allows adjustment of

time where the job might run on systems with different cpu performance

Page 84: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

84

Static Resources• Static resources are names and values that you assign to a given node that

identify its special characteristics• These resources can then be requested in the batch script if a job needs some

special resourcencpus 4

physmem 2009644

myrinet 2

fasteth 1

• Given the above definitions, jobs that want up to 4 cpus, 2 gig of memory, 2 myrinet (in this case myrinet is a network interface) or 1 fast ethernet interface “could” be scheduled on this node

• If a job asked for 2 fast ethernet interfaces, it could not be scheduled on this node

#PBS -l nodes=1:myrinet=2 could be scheduled on this node

#PBS -l nodes=1:myrinet=3 could not be scheduled on this node

Page 85: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

85

Prologue ScriptPBS provides the ability to run a site supplied script before and/or after

each job runs. This provides the capability to perform initialization or cleanup of resources.

• Prologue script runs prior to each job• The script name and path is

$PBS_HOME/mom_priv/prologue

• The script must be owned by root

• The script must have permissionsroot read/write/executegroup & world none

Page 86: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

86

Prologue and Epilogue ArgumentsThe prologue script is passed the following three arguments that can be

used in the script1 the job id

2 the user id under which the job executes

3 the group id under which the job executes

The epilogue script is passed these arguments plus the following six4 the job name

5 the session id

6 the requested resource limits

7 the list of resources used

8 the name of the queue in which the job resides

9 the account string (if one exisits)

Page 87: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

87

Sample Prologue Script#!/bin/csh

# Copyright 2000, The Ohio Supercomputer Center, Troy Baer

# Create TMPDIR on all the nodes

# prologue gets 3 arguments:

# 1 -- jobid

# 2 -- userid

# 3 -- grpid

setenv TMPDIR /tmp/pbstmp.$1

setenv USER $2

setenv GROUP $3

if ( -e /var/spool/pbs/aux/$1 ) then

foreach node ( `cat /var/spool/pbs/aux/$1 | uniq` )

rsh $node "mkdir $TMPDIR ; chown $USER $TMPDIR ; chgrp $GROUP $TMPDIR ;

chmod 700 $TMPDIR" >& /dev/null

end

else

mkdir $TMPDIR

chown $USER $TMPDIR

chgrp $GROUP $TMPDIR

chmod 700 $TMPDIR

endif

Page 88: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

88

Sample Epilogue Script#!/bin/csh

# Copyright 2000, The Ohio Supercomputer Center, Troy Baer

# Clear out TMPDIR

# epilogue gets 9 arguments:

# 1 -- jobid

# 2 -- userid

# 3 -- grpid

# 4 -- job name

# 5 -- sessionid

# 6 -- resource limits

# 7 -- resources used

# 8 -- queue

# 9 -- account

setenv TMPDIR /tmp/pbstmp.$1

foreach node ( `cat /var/spool/pbs/aux/$1 | uniq` )

rsh $node /bin/rm -rf $TMPDIR

end

Page 89: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

89

User Environment CustomizationInteractive limits (front end - /etc/profile.d/limits.sh)

#!/bin/sh

ulimit -t 600

#ulimit -d 65536

#ulimit -m 65536

#ulimit -l 65536

#ulimit -p 64

User environment modifications for $TMPDIR (compute nodes - /etc/profile.d/tmpdir.sh)

#!/bin/sh

# If PBS_ENVIRONMENT exists and is "PBS_BATCH" or "PBS_INTERACTIVE",

# set TMPDIR

if [ -n "$PBS_ENVIRONMENT" ]

then

if [ "$PBS_ENVIRONMENT" = PBS_BATCH -o "$PBS_ENVIRONMENT" = PBS_INTERACTIVE ]

then

export TMPDIR=/tmp/pbstmp.$PBS_JOBID

fi

fi

Page 90: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

90

Interactive Batch Job SupportPBS supports the option of an Interactive Batch Job for debugging

purposes through PBS directives

General limitations– qsub command reads standard input and passes the data to the job, which

is connected via a pseudo tty– PBS only handles standard input, output and error

Additional OSC handicap– Compute nodes are on a private network

OSC customization for interactive graphics support

A technique has been devised that gives graphics capability within an interactive PBS batch job. It takes advantage of the special “X11 forwarding” implemented by SSH.

Page 91: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

91

Interactive Batch Job Graphics Support• Only supported if user connects to front end via ssh• Utility installed in /etc/profile.d grabs the DISPLAY environment variable

from the front end session, builds and installs a new X authorization entry and forwards the display to the front end, which is sent to the user’s workstation via the X11 forwarding proxy server

• Must pass environment variables to PBS batch job with “qsub -V …”

/etc/profile.d/pbsX.sh (compute nodes only)

if [ -n "$DISPLAY" -a "$PBS_ENVIRONMENT" == "PBS_INTERACTIVE" ]then export AUTHKEY=`xauth list|grep $DISPLAY | sed "s/oscbw[0-9]*.osc.edu/node00.cluster.osc.edu/" |

head -1` export DISPLAY=`echo $DISPLAY | sed 's/oscbw01/node00.cluster'` xauth add $AUTHKEYfi

Page 92: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

92

Parallel Job ControlMPIRUN implementation

• Current default method based on rsh– Not scalable; max. of 512 processes can be started from a single node– Uses too many sockets– Under PBS Mom starts 1st process, which now reports to Mom– PBS knows nothing about processes started up by rsh since they were not

executed by Mom• Spawned processes are not children of the moms

• Moms do not have control over spawned processes

• Moms do not know about spawned processes

• Resource utilization is not reported to moms, so accounting is not accurate

• New with MPICH 1.2.0 is the MultiPurpose Daemon (MPD)– dependency on user id; still must use rsh or some mechanism to start up

daemons– Daemons start up jobs, not the moms, so still no control over processes

and incorrect accounting

Page 93: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

93

Parallel Job Control - mpiexecPBS Task Manager• In addition to the PBS API, which provides access to the PBS server, there is a

task manager interface for the moms• Based on the PSCHED API (http://parallel.nas.nasa.gov/PSCHED)

Mpiexec uses the task manager library of pbs to spawn copies of the executable on all the nodes in a pbs allocation. It is functionally equivalent to

rsh node "cd $cwd; $SHELL -c ‘cd $cwd; exec executable arguments’"

The PBS server API is used to extract resource request information and construct the resource configuration file (nodes, etc.)

We use GM, which requires information on NICs, that is constructed by mpiexec as well (PBS does not know about NICs)

Page 94: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

94

mpiexec Formatmpiexec [OPTION]... executable [args]...

-n numproc Use only the specified number of processes

-tv, -totalview Debug using totalview

-perif Allocate only one process per myrinet interface

This flag can be used to ensure maximum communication

bandwidth available to each process

-pernode Allocate only one process per compute node. For SMP

nodes, only one processor will be allocated a job.

This flag is used to implement multiple level parallelism

with MPI between nodes, and threads within a node

-config configfile Process executable and arguments are specified in

the given configuration file. This flag permits the use

of heterogeneous jobs using multiple executables,

architectures, and command line arguments.

-bg, -background Do not redirect stdin to task zero. Similar to the

"-n" flag in rsh(1).

Page 95: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

95

MPIEXECC program written at OSC for PBS and available under GPL

mpiexec

pbs_server pbs_mom(mother superior)

pbs_mom

pbs_mom

pbs_mom

pbs_mom

1,32

4

1

1) Establish connection to task manager on mother superior node

2) Query server to get host names and cpu numbers (vpn’s)

3) Instruct task manager to spawn off processes, based on info in 2

4) Mother superior instructs moms in her pool to start indicated tasks

node 0

node 1

node 2

node 3

node #

Page 96: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

96

Hardware Level Access• To connect to the individual nodes serial console line you will need a

program such as Kermit, which can be downloaded from http://www.columbia.edu/kermit/ckermit.html

• If I needed to connect to the console on your front end (more about how to configure this later), you would type,[oscbw.osc.edu]% kermit

set line /dev/ttyC0

set speed 9600

set carrier-watch off

connect

Page 97: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

97

Serial Communication Programs• Minicom is a standard package included in most Linux distributions.

• More flexible and better terminal emulation than kermit.

• Can create separate “configurations” for different nodes,– /etc/minirc.node01 can contain

pr port /dev/ttyC0

pu baudrate 9600

pu bits 8

pu parity N– Would then connect to the node by typing ‘minicom node01’

Page 98: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

98

Serial Line Console• To configure your Linux computer to have a serial console we will

want to recreate the /dev/console entry,rm -f /dev/console

mknod -m 666 /dev/console c 5 1

• Next, we will need to spawn a getty (the process that allows logins) on the proper serial line.

• This is down by adding the following line to /etc/inittab,# Spawn getty for the serial console.

S1:12345:respawn:/sbin/getty ttyS0 DT9600 vt100

• To re-examine to inittab file type ‘/sbin/init q’

• It is now possible to login over /dev/ttyS0.

Page 99: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

99

Allowing root Logins and Console Redirection• root is only allowed to log in at the tty’s defined in /etc/securetty.

• Must add an entry for serial lines– For our example, ttyS0

• Console redirection requires the following additions to /etc/lilo.confserial=0,9600n8 in the global section.

append=“console=ttyS0” in the per-image section

Page 100: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

100

Serial Line Hardware Level Access• In addition to serial consoles, newer Intel motherboards include

support for Intelligent Platform Management Interface (IPMI).

• Allows power cycling, fan monitoring and BIOS access.

• Implemented in hardware. Not dependent on, and does not effect the OS.

• No robust implementations for Linux and limited hardware support.

• Instead of the serial line connections one could install a keyboard-video-mouse switches.

– Pros: More easily understood.– Cons: Lots of extra cables, limits to the number of nodes, remote access.

Page 101: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

101

System Management and Monitoring• Performance Copilot (PCP) can be used for system-wide performance

management and monitoring.

• Hierarchical storage and manipulation of the systems data.

• Renormalization is possible.– Choose the level of detail

• Data returned to PCP are self describing.

• Designed to return performance data with minimally affecting what it is measuring.

• Command line interface called pmview.

• Can build graphical representations using pmview, theory is visual presentation of large amounts of data.

• Can monitor log-files.

• Built in logging routines for later analysis.

Page 102: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

102

PMCD and PMDA’s• Performance Collection Metric Daemon (PCMD) is the core of PCP.

• Performance Metric Domain Agents (PMDA) collect self-describing, statistics from a variety of sources.

– Disk, cpu, network, logs, switches and routers.

P C M D

P M D A

P M D A

P M D A

C lien ts(pm view ,pm in fo,logger)

Page 103: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

103

Pminfopminfo: option requires an argument -- h

Usage: pminfo [options] [metricname ...]

Options:

-a archive metrics source is a PCP log archive

-b batchsize fetch this many metrics at a time for -f or -v (default 20)

-d get and print metric description

-f fetch and print value(s) for all instances

-F fetch and print values for non-enumerable indoms too

-h host metrics source is PMCD on host

-m print PMID

-M print PMID in verbose format

-n pmnsfile use an alternative PMNS

-O time origin for a fetch from the archive

-t get and display (terse) oneline text

-T get and display (verbose) help text

-v verify mode, be quiet and only report errors

(forces other output control options off)

-Z timezone set timezone for -O

-z set timezone for -O to local time for host from -a

Page 104: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

104

Pminfodjohnson:~> pminfo -f disk.all.read_bytes

disk.all.read_bytes

value 177955

djohnson:~> pminfo -d disk.all.read_bytes

disk.all.read_bytes

Data Type: 32-bit unsigned int InDom: PM_INDOM_NULL 0xffffffff

Semantics: counter Units: Kbyte

Page 105: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

105

Pmview

Page 106: IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug Johnson, Jim Giuliani, and Troy Baer Ohio Supercomputer.

IEEE HPDC 9 Conference

106

Pmview