Job Submission on WestGrid Feb 15 2005 on Access Grid.

26
Job Submission on WestGrid Feb 15 2005 on Access Grid

Transcript of Job Submission on WestGrid Feb 15 2005 on Access Grid.

Page 1: Job Submission on WestGrid Feb 15 2005 on Access Grid.

Job Submission on WestGrid

Feb 15 2005on Access Grid

Job Submission on WestGrid

Feb 15 2005on Access Grid

Page 2: Job Submission on WestGrid Feb 15 2005 on Access Grid.

IntroductionIntroduction Simon Sharpe, one member of the WestGrid

support team The best way to contact us is to email

[email protected] This seminar tells you;

How to run, monitor, or cancel your jobs How to select the best site for your job How to adapt your job submission for different sites How to get your jobs running as quickly as possible

Feel free to interrupt if you have questions

Simon Sharpe, one member of the WestGrid support team

The best way to contact us is to email [email protected]

This seminar tells you; How to run, monitor, or cancel your jobs How to select the best site for your job How to adapt your job submission for different sites How to get your jobs running as quickly as possible

Feel free to interrupt if you have questions

Page 3: Job Submission on WestGrid Feb 15 2005 on Access Grid.

Getting into the QueueGetting into the Queue HPC Resources are valuable research

tools A batch queuing system is needed to

Match jobs to resources Deliver maximum bang for the research buck Distribute jobs and collect output across

parallel CPUs Ensure a fair sharing of resources

HPC Resources are valuable research tools

A batch queuing system is needed to Match jobs to resources Deliver maximum bang for the research buck Distribute jobs and collect output across

parallel CPUs Ensure a fair sharing of resources

Page 4: Job Submission on WestGrid Feb 15 2005 on Access Grid.

Getting into the QueueGetting into the Queue WestGrid compute sites use

TORQUE/Moab Based on PBS (Portable Batch System) You need just a few commands common to

WestGrid machines There are important differences in job

submission among sites you need to know about

With the diversity of WestGrid, it is possible that there is more than one machine suitable for your job

WestGrid compute sites use TORQUE/Moab Based on PBS (Portable Batch System) You need just a few commands common to

WestGrid machines There are important differences in job

submission among sites you need to know about

With the diversity of WestGrid, it is possible that there is more than one machine suitable for your job

Page 5: Job Submission on WestGrid Feb 15 2005 on Access Grid.

A Simple SampleA Simple Sample

The script file serialhello.pbs tells TORQUE how to run the C program serialhello

The script file serialhello.pbs tells TORQUE how to run the C program serialhello

This example show how to run a serial job on Glacier, which is a good choice for serial jobs

The qsub command tells TORQUE to run the job described in the script file serialhello.pbs

This example show how to run a serial job on Glacier, which is a good choice for serial jobs

The qsub command tells TORQUE to run the job described in the script file serialhello.pbs

When your job completes, TORQUE creates two new files in the current directory capturing; error out from the job standard out

When your job completes, TORQUE creates two new files in the current directory capturing; error out from the job standard out

Page 6: Job Submission on WestGrid Feb 15 2005 on Access Grid.

End of SeminarEnd of Seminar Thanks for coming

I wish it was that easy

Thanks for coming

I wish it was that easy

Page 7: Job Submission on WestGrid Feb 15 2005 on Access Grid.

HPC: One Size Does Not Fit AllHPC: One Size Does Not Fit All When the only tool you have is a

hammer, every job looks like a nail Things that affect system selection;

System dictated by executable or licensing

MPI or OpenMP Availability: How busy is the system? Amount of RAM required Speed or number of processors

When the only tool you have is a hammer, every job looks like a nail

Things that affect system selection; System dictated by executable or

licensing MPI or OpenMP Availability: How busy is the system? Amount of RAM required Speed or number of processors

Page 8: Job Submission on WestGrid Feb 15 2005 on Access Grid.

HPC: One Size Does Not Fit AllHPC: One Size Does Not Fit All

Things that affect system selection (continued); Scalability of your application Inter-processor communication

requirements Queue limits (walltime, number of

CPUs) Inertia: It is where we’ve always run it

Things that affect system selection (continued); Scalability of your application Inter-processor communication

requirements Queue limits (walltime, number of

CPUs) Inertia: It is where we’ve always run it

http://www.westgrid.ca/support/System_Statushttp://www.westgrid.ca/support/Facilitieshttp://www.westgrid.ca/support/software

Page 9: Job Submission on WestGrid Feb 15 2005 on Access Grid.

Uses of WestGrid MachinesUses of WestGrid MachinesMachine Use Interconnect CPUs

Glacier

IBM Xeon

Serial, moderate parallel MPI

GigE

Shared in node

1680

Dual CPUs/node

Matrix

HP XC Alpha

MPI Parallel Infiniband,

Shared in node

256

Dual CPUs/node

Lattice

HP SC Alpha

Moderate MPI parallel, serial

Quadrics,

Shared in node

144, 68 (G03)

Quad CPUs/node

Cortex

IBM Power5

OpenMP, MPI Parallel

Shared memory 64, 64, 4

Nexus

SGI Origin MIPS

OpenMP, MPI Parallel

Shared memory 256, 64, 64, 36, 32, 32, 8

Robson

IBM Power5

Serial, moderate MPI parallel

GigE,

Shared in node

56

Dual CPUs/node

Page 10: Job Submission on WestGrid Feb 15 2005 on Access Grid.

TORQUE and Moab CommandsTORQUE and Moab Commands

qsub script Submit this job to the queue, common options include

-l mem=1GB

-l nodes=4:ppn=2 or, on Nexus –l ncpus=4

-l walltime=06:00:00

-q queue-name

-m and –M for email notifications

showq Show me the jobs in the queue

qstat jobid Show the status of the job in the queue, common options include

-a and -an

qdel jobid Delete this job number from the queue

Page 11: Job Submission on WestGrid Feb 15 2005 on Access Grid.

Sample MPI job on GlacierSample MPI job on GlacierParallel jobs have differing degrees of parallelism

Glacier, which has a slower interconnect than other WestGrid machines, may not turn out to be the best place for your parallel job

Latency: Like the time it takes to dial and say “hello”

Bandwidth: How fast can you talk?

If your parallel job does not require intensive communications between processes, it may be worth testing on Glacier

More info on Glacier submissions at;http://www.westgrid.ca/support/programming/glacier.php

http://guide.westgrid.ca/guide-pages/jobs.html

Page 12: Job Submission on WestGrid Feb 15 2005 on Access Grid.

MPI Submission on GlacierMPI Submission on Glacier We need to tell TORQUE how many processors we need We need to tell TORQUE how many processors we need

This asks for 2 nodes and 2 processors per node (4 CPUs)

This asks for 2 nodes and 2 processors per node (4 CPUs)

Similar script to last time, but now calling program parallelized with MPI

Adding the walltime estimate helps TORQUE schedule the job Note that we can pass directives;

on the command line or in the script

Similar script to last time, but now calling program parallelized with MPI

Adding the walltime estimate helps TORQUE schedule the job Note that we can pass directives;

on the command line or in the script

This time we wait in the queue This time we wait in the queue

Page 13: Job Submission on WestGrid Feb 15 2005 on Access Grid.

Sample MPI job on MatrixSample MPI job on Matrix

Matrix is an HP XC cluster using AMD Opterons and Infiniband Interconnect

64-bit Linux

Not intended for serial work

A good home for parallel jobs

More info on Matrix submissions at;http://www.westgrid.ca/support/programming/matrix.php

Page 14: Job Submission on WestGrid Feb 15 2005 on Access Grid.

Running MPI Jobs on MatrixRunning MPI Jobs on MatrixFor Matrix, use nodes and processors/node (ppn) to tell TORQUE how many CPUs your job needs

Matrix machines have 2 CPUs/Node

A minimal TORQUE script to run a parallel MPI job on Matrix

Standard and Error output dropped into the directory we submitted from

Page 15: Job Submission on WestGrid Feb 15 2005 on Access Grid.

Sample MPI job on LatticeSample MPI job on Lattice

Lattice is an HP Alpha cluster connected with Quadrics

64-bit Tru64

Intended for parallel workFour processor shared memory

Quadrics interconnect for more than 4 processors

MPI communicates through interconnect or shared memory, as appropriate

Also being used for some serial work

More info on Lattice submissions at;http://hpc.ucalgary.ca/westgrid/running.html

http://www.westgrid.ca/support/programming/lattice.php

Page 16: Job Submission on WestGrid Feb 15 2005 on Access Grid.

Running MPI Jobs on LatticeRunning MPI Jobs on LatticeFor Lattice, use nodes and processors/node to set number of processors. Lattice has 4 processors on each node.

In this case we ask for 2 CPUs on one box and 2 on another

A minimal TORQUE script to run a parallel MPI job on Lattice

Standard and error out dropped into the directory we submitted from

Page 17: Job Submission on WestGrid Feb 15 2005 on Access Grid.

Sample Serial Job on LatticeSample Serial Job on Lattice

Lattice has a high-speed Quadrics interconnect

If your job is serial, it does not take advantage of the Quadrics interconnect

Glacier may be an alternative Having said that, many serial jobs are

run on Lattice

Lattice has a high-speed Quadrics interconnect

If your job is serial, it does not take advantage of the Quadrics interconnect

Glacier may be an alternative Having said that, many serial jobs are

run on Lattice

Page 18: Job Submission on WestGrid Feb 15 2005 on Access Grid.

Running Serial Jobs on LatticeRunning Serial Jobs on LatticeOn Lattice, we tell TORQUE to run the job described in the script file serialhello.pbs

A minimal TORQUE script to run a serial job on Lattice

Standard and error out dropped into the directory we submitted from

Page 19: Job Submission on WestGrid Feb 15 2005 on Access Grid.

Sample Parallel job on CortexSample Parallel job on Cortex

Cortex is a machine with IBM Power5 SMP processors

Running AIX

Not for serial work

A good home for large parallel applications needing shared memory and/or fast interconnection

Good for large memory jobs

More info on Cortex submissions at;http://www.westgrid.ca/support/cortex

http://www.westgrid.ca/support/programming/cortex.php

Page 20: Job Submission on WestGrid Feb 15 2005 on Access Grid.

Running Serial Jobs on CortexRunning Serial Jobs on CortexOn Cortex, we tell TORQUE to run the job described in the script file mpihello.pbs

The script which describes how we want cortex to run the parallel program mpihello

The standard output file, dropped into our working directory

Page 21: Job Submission on WestGrid Feb 15 2005 on Access Grid.

Sample Parallel Job on NexusSample Parallel Job on Nexus Nexus is a collection of SGI SMP machines Several sizes serviced by different queues. Test on smaller machines, heavy lifting on

large ones A good home for parallel jobs with intense

communication requirements and/or large memory needs

More information at;http://www.ualberta.ca/AICT/RESEARCH/PBS/index.westgrid.html

Nexus is a collection of SGI SMP machines Several sizes serviced by different queues. Test on smaller machines, heavy lifting on

large ones A good home for parallel jobs with intense

communication requirements and/or large memory needs

More information at;http://www.ualberta.ca/AICT/RESEARCH/PBS/index.westgrid.html

Page 22: Job Submission on WestGrid Feb 15 2005 on Access Grid.

Running OpenMP Jobs on NexusRunning OpenMP Jobs on Nexus

For Nexus, match ncpus with OMP_NUM_THREADS

In this case we ask for 8 CPUs on the Helios machine (8-32 CPUs)

You can try trivial OpenMP jobs from the command line. This job ran interactively on the head node.

You should not use more than 2 processors for interactive jobs.

To run jobs requiring real processing, you must submit them to TORQUE

Page 23: Job Submission on WestGrid Feb 15 2005 on Access Grid.

Sample Serial Job on RobsonSample Serial Job on Robson

Robson is a new 56 processor Power5 system

64-bit Linux Good for serial work, may be suitable

for some parallel processing. Message passing through MPI More info at;http://www.westgrid.ca/support/robson

Robson is a new 56 processor Power5 system

64-bit Linux Good for serial work, may be suitable

for some parallel processing. Message passing through MPI More info at;http://www.westgrid.ca/support/robson

Page 24: Job Submission on WestGrid Feb 15 2005 on Access Grid.

Running Serial Jobs on RobsonRunning Serial Jobs on RobsonThis is a minimal serial job submission script for Robson. It runs the executable “hello”

A more elaborate script example is available;

http://www.westgrid.ca/support/robson

Robson also runs MPI parallel jobs, as described on the above web page

TORQUE drops the Error Out (zero –length in this case) and Standard Out to the directory we submitted from

Page 25: Job Submission on WestGrid Feb 15 2005 on Access Grid.

Shortening HPC CycleShortening HPC Cycle

Try your jobs at different sites Test your process on small jobs Give realistic walltimes, memory

requirements Apply for a larger Resource

Allocation http://www.westgrid.ca/manage_rac.html

Try your jobs at different sites Test your process on small jobs Give realistic walltimes, memory

requirements Apply for a larger Resource

Allocation http://www.westgrid.ca/manage_rac.html

Page 26: Job Submission on WestGrid Feb 15 2005 on Access Grid.

SummarySummary

HPC jobs have differing requirements WestGrid provides an increasing variety of tools Use the system that is best for your job Start off simple and small Find out how well your job scales Getting help

Because of implementation differences, “man qsub” might not be your best source of help

Support pages as listed throughout this presentation Email [email protected]

HPC jobs have differing requirements WestGrid provides an increasing variety of tools Use the system that is best for your job Start off simple and small Find out how well your job scales Getting help

Because of implementation differences, “man qsub” might not be your best source of help

Support pages as listed throughout this presentation Email [email protected]