HPCC : Training Session – II ( Advanced) Srirangam Addepalli Huijun Zhu High Performance Computing...

HPCC : Training Session – II ( Advanced)

Srirangam AddepalliHuijun Zhu

High Performance Computing center

Jan-2011

SECTION HEADER GOES HERE

Major headline statement seton two lines here First level bullet treatment here First level bullet treatment here First level bullet treatment here

• Second level bullet treatment

• Second level bullet treatment

– Third level bullet treatment

– Third level bullet treatment

Introduction and Outline

In this session

1. Compiling and Running Serial and MPI Programs

2. Debugging serial and parallel code

3. Profiling Serial and parallel code

4. Features of Sun Grid Engine and Local Setup

5. Features of Shell

6. Shell Commands

7. Additional SGE options

8. Applications of interest

9. Questions and Contacts

Compiler OptimizationsCompiler optimizations:

Windows Linux Comment

/Od -O0 No optimization

/O1 -O1 Optimize for size

/O2 -O2 Optimize for speed and enable some optimization

/O3 -O3 Enable all optimizations as O2, and intensive loop optimizations

/QxO -xO Enables SSE3, SSE2 and SSE instruction sets optimizations for non-Intel

/Qprof-gen -prof_gen Compile the program and instrument it for a profile generating run.

/Qprof-use -prof_use May only be used after running a program that was previously compiled using prof_gen. Uses profile information during each step of the compilation process.

Compile OptimizationUNROLL

for(int i=0;i<1000;i++) {

a[i] = b[i] + c[i];}

icc -unrool=c:8 unroll.c

for(int i=0;i<1000;i+=8)

{a[i] = b[i] + c[i];a[i+1] = b[i+1] + c[i+1];a[i+2] = b[i+2] + c[i+2];a[i+3] = b[i+3] + c[i+3];a[i+4] = b[i+4] + c[i+4];] = b[i+5] + c[i+5];a[i+6] = b[i+] + c[i+6];a[i+7] = b[i+] + c[i+7];

}

Debugging

Compiling Code with debug and profiling

icc -unrool=c:8 unroll.c

Debug:icc -g -unrool=c:8 unroll.c -o test.exeicc -debug -unrool=c:8 unroll.c -o test.exe

idb test.c

Profiling Compiling Code with profiling icc -unrool=c:8 unroll.c Profile:icc -g -c -prof_gen -unrool=c:8 unroll.c – Geneates Object Files icc -g -prof_use -unrool=c:8 unroll.o – Geneates Executable Files Use - Thread Profiler - Vtune performance analyzer

Using GCC: cp /lustre/work/apps/examples/Advanced/TimeWaste.c . gcc TimeWaste.c -pg -o TimeWaste -O2 -lc ./TimeWaste gprof TimeWaste gmon.out $options -p - Flat Profile -q - Class graph -A - annotated source ( -g -pg compile options must be used)

Parallel Code

Compile with -g option to enable debugging.

1 cp /lustre/work/apps/examples/Advanced/pimpi.c . mpicc -g pimpi.c

2. mpirun -np 4 -dbg=gdb a.out

To run cpi with 2 processes, where the second process is run under the debugger, the session would look something like

mpirun -np 2 cpi -p4norem

waiting for process on hostcompute-19-15.local:

/lustre/home/saddepal/cpi compute-19-15.local 38357 -p4amslave

on the first machine and

% gdb cpi

3. Intel parallel studio is really good. We currently do not have a license but users can request a free license.

(Show idb)

Resources: Hrothgar/ JanusHrothgar Cluster:

7680 Cores. 640 Nodes. (12 Cores/Node) Intel(R) Xeon(R) @ 2.8 GHz 24 GB of memory per node DDR Infiniband for MPI communication & storage 80 TB of parallel lustre file system 86.2 Tflop peak performance. 1024 cores for serial job Top 500 list: 109 in the world. Top 10 Academic institutions in USA. 320 cores for community cluster

•JANUS Windows Cluster

40 Cores. 5 nodes ( 8 Cores/Node) 64 GB Memory Visual Studio with Interl fortran 700 GB Storage

File SystemHrothgar has multiple file systems available for your use. There are threeLustre parallel systems (physically all the same) and a physical diskscratch system on each compute node.

$HOME is backed up, persistant, with quotas.$WORK is not backed up,is persistant, with quotas.$SCRATCH and /state/partition1 are not backed up, are not persistant, without quotas.

By "not persistant" we mean will be cleared based on earliest last access time.Lustre and physical disk have different performance characteristics.Lustre has much higher bandwidth and much higher latency, so is betterfor large reads and writes. Disk is better for small reads and writes, particularly when they are intermixed. Backups of $HOME are taken at night. Removed files are gone foreverbut can be restored from the last backup if on $HOME. Critical files thatcannot be replicated should be backed up on your personal system.

File System -2Location Quota Alias/lustre/home/eraiderid 100GB $HOME/lustre/work/eraiderid 500GB $WORK/lustre/scratch/eraiderid none $SCRATCH/state/partition1/ none

Location Backed up Size/lustre/home/eraiderid yes 7 TB/lustre/work/eraiderid no 22 TB/lustre/scratch/eraiderid no 43 TB

MPI RunLet’s compile and run an MPI program.

mkdir mpicd mpicp /lustre/work/apps/examples/mpi/* .mpicc cpi.c, ormpif77 fpi.fqsub mpi.sh

$ echo $MCMDmpirun$ echo $MFILmachinefileFor mvapich2, those are mpirun_rsh and hostfile. These are SGE vars:$NSLOTS is the number of cores$SGE_CWD_PATH is the submit directory.

MPI.sh

#!/bin/bash#$ -V # export env to job:needed#$ -cwd # change to submit dir#$ -j y # ignore -e, merge#$ -S /bin/bash # the job shell#$ -N mpi # the name, qstat label#$ -o $JOB_NAME.o$JOB_ID #the job output file#$ -e $JOB_NAME.e$JOB_ID #the job error file#$ -q development #the queue, dev or normal#$ -pe fill 8 #8,16,32,etccmd="$MCMD -np $NSLOTS -$MFIL \$SGE_CWD_PATH/machinefile.$JOB_ID \$SGE_CWD_PATH/a.out"echo cmd=$cmd #this expands,$cmd #prints,runs $cmd

Sun Grid Engine

◮ There are two queues: normal (48 hrs) and serial (120 hrs).

◮ When the time is up, your job is killed, but a signal is sent 2 minutesearlier so you can prepare for restart, and requeue if you want.

◮ serial supports both parallel and serial jobs (Normal only parallel).

◮ There are no queue limits per user, so we request that you do notsubmit, say, 410 one-node jobs if you are the first person on at arestart.

◮ We ask that you either use (1) all the cores or (2) all the memory oneach node you request, so the minimum core (pe) count is 12, andincrements are 12. If you are not using all the cores, request pe 12anyway, but don’t use $NSLOTS.

◮ Interactive script /lustre/work/apps/bin/qstata.sh will showall hosts with running jobs on the system (it’s 724 lines of output).

Sun Grid Engine : RestartSGE, does not auto-restart a timed-out job. You have to tell itin your command file. The SGE headers #$ are unchanged and are omitted.

2 minutes before the job is killed, SGE will send a "usr1" signal to the job.

The "trap" command intercepts the signal, runs a script called "restart.sh", and exits with a signal 99. That signal 99 tells SGE to requeue the job.

trap "$SGE_CWD_PATH/restart.sh;exit 99" usr1 echo MASTER=$HOSTNAME #optional, prints run node echo RESTARTED=$RESTARTED #optional, 1st run or not $SGE_CWD_PATH/myjob.sh Ec=$? #these 5 lines if [ $ec == 0 ]; then #let the job finish when echo "COMPLETED" #it’s done by sending an fi #code 0 exit $ec #normal end

Serial JobsMost serial jobs won’t use 16GB of memory, run 8 at once. "-pe fill 8" for this. This example assumes you run in 8 subdirectories ofyour submit directory, but you don’t have to if files don’t conflict.

ssh $HOSTNAME "cd $PWD/r1;$HOME/bin/mys <i.dat 1>out 2>&1" &ssh $HOSTNAME "cd $PWD/r2;$HOME/bin/mys <i.dat 1>out 2>&1" &ssh $HOSTNAME "cd $PWD/r3;$HOME/bin/mys <i.dat 1>out 2>&1" &ssh $HOSTNAME "cd $PWD/r4;$HOME/bin/mys <i.dat 1>out 2>&1" &ssh $HOSTNAME "cd $PWD/r5;$HOME/bin/mys <i.dat 1>out 2>&1" &ssh $HOSTNAME "cd $PWD/r6;$HOME/bin/mys <i.dat 1>out 2>&1" &ssh $HOSTNAME "cd $PWD/r7;$HOME/bin/mys <i.dat 1>out 2>&1" &ssh $HOSTNAME "cd $PWD/r8;$HOME/bin/mys <i.dat 1>out 2>&1" &r=8 #these 8 lines countwhile [ "$r" -ge 1 ] #the ssh processesdo #and sleep the master while theysleep 60 #are running so the job won’t dier=‘ps ux | grep $HOSTNAME | grep -v grep | wc -l‘done

Application CheckpointApplication Checkpoint Supported for Serial Jobs.

cr_run – To run Application.

./a.out should be replaced with cr_run ./a.out

cr_checkpoint - - term PID ( Where PID is the processid of job)

cr_restart contextfile.pid

Serial Jobs Auto Checkpoint

#!/bin/bash#$ -V#$ -cwd#$ -S /bin/bash#$ -N GeorgeGains#$ -o $JOB_NAME.o$JOB_ID#$ -e $JOB_NAME.e$JOB_ID#$ -q khareccexport tmpdir=$SGE_CKPT_DIR/ckpt.$JOB_IDexport currcpr=`cat $tmpdir/currcpr`export ckptfile=$tmpdir/context_$JOB_ID.$currcprif ( $RESTARTED && -e $tmpdir ) thenecho "Restarting from $ckptfile" >> /tmp/restart.log/usr/bin/cr_restart $ckptfileelse/usr/bin/cr_run a.outend if

Applications -- NWChemNWChem

/lustre/work/apps/examples/nwchem/nwchem.sh is an NWChem script almost like the MPI command file. There is a defined variable, your lustre scratch area, called $SCRATCH. If NWChem is started in that directory with no scratch defined in .nw, it will use that area for scratch, as intended. Here the machinefile and input file remain in the submit directory. This requires +intel,+mvapich1,and +nwchem-5.1 in .soft. You can check before submitting, correct will show some files:

ls $NWCHEM_TOP

INFILE="siosi3.nw"

cd $SCRATCH

run="$MCMD -np $NSLOTS -$MFIL \

$SGE_CWD_PATH/machinefile.$JOB_ID \

$NWCHEM_TOP/bin/LINUX64/nwchem \

${SGE_CWD_PATH}/$INFILE"

echo run=$run

$run

Applications -- Matlab• /lustre/work/apps/examples/matlab/matlab.sh is a matlab script with a

small test job new4.m. It requires any compiler/mpi (except if using MEX) ,and +matlab in .soft.

#!/bin/sh

#$ -V

#$ -cwd

#$ -S /bin/bash

#$ -N testjob

#$ -o $JOB_NAME.o$JOB_ID

#$ -e $JOB_NAME.e$JOB_ID

#$ -q development

#$ -pe fill 8

matlab -nodisplay -nojvm -r new4

Interactive Jobs

• qlogin -q “qname” “resorce requirements”

• eg: qlogin -q HiMem -pe mpi 8

• qlogin -q normal -pe mpi 16

• Disadvantages: Will not be run in batch mode. Terminal Closes and your jobs is killed.

• Advantage: Can debug jobs more easily

Array Jobs #!sh#$ -S /bin/bash~/programs/program -i ~/data/input -o ~/results/output

Now, let’s complicate things. Assume you have input files input.1, input.2, . . . , input. 10000, and you want the output to be placed in files with a similar numbering scheme. You could use perl to generate 10000 shell scripts, submit them, then clean up the mess later. Or, you could use an array job. The modification to the previous shell script is simple:

#!sh#$ -S /bin/bash# Tell the SGE that this is an array job, with "tasks" to be numbered 1 to 10000#$ -t 1-10000# When a single command in the array job is sent to a compute node,# its task number is stored in the variable SGE_TASK_ID, # so we can use the value of that variable to get the results we want:~/programs/program -i ~/data/input.$SGE_TASK_ID -o ~/results/output.$SGE_TASK_ID

Appl Guides/Questions

Parallel Matlab Matlab toolboxes

Questions ?

• Support

• Email : [email protected]

• Phone: 806-7424378

HPCC : Training Session – II ( Advanced) Srirangam Addepalli Huijun Zhu High Performance Computing...

Documents

Transcript of HPCC : Training Session – II ( Advanced) Srirangam Addepalli Huijun Zhu High Performance Computing...