Linux Cluster Job Management Systems (SGE)

download Linux Cluster Job Management Systems (SGE)

If you can't read please download the document

Transcript of Linux Cluster Job Management Systems (SGE)

Job Management SystemsSGEv1.3Author: Anand [email protected]

Why use SGE?

Maintain order in a shared resource like queing up at a movie ticket counter rather than mobbing the counter

Apply different usage policies PhDs and Profs get better treatment than first year grads

Everyone gets a fair share of the computing resource.

What is SGE?

SGE is a distributed resource management software

Provides users the means to submit computationally demanding tasks to the SGE system for transparent distribution of the associated workload.

How does SGE work?

Users submit jobs to the Grid Engine.

Unless resources are immediately available non-interactive jobs are kept in queues until resources to execute them become available.

Jobs are passed onto the available execution hosts

Records of each jobs progress through the system are kept and reported when requested.

SGE Components

Hosts

Master (coordinate activities, hold queues)

Execution (workers)

Administration (sets up system, queues etc)

Submit (users can submit jobs from these)

Usually the master and admin host are the same machines

Queues (defined by the administrator)

User and Administrator Commands

Daemons: sge_qmaster (Master Daemon), sge_schedd (Scheduler Daemon), sge_execd (Execution Daemon) and sge_commd (Communication Daemon)

SGE Commands - qhost

What is the state of the cluster? How many nodes, type, load? What is my chance of getting a node?

[root@shark ~]# qhost

HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS

-------------------------------------------------------------------------------

global - - - - - - -

shark-c00 lx24-amd64 2 2.02 3.9G 240.8M 4.0G 0.0

shark-c02 lx24-amd64 2 2.00 3.9G 214.9M 4.0G 0.0

shark-c03 lx24-amd64 2 1.76 3.9G 215.9M 4.0G 0.0

SGE Commands - qsub

Create a jobscripts (myjob.sh)

Submit for execution

$ qsub myjob.sh

Your job 742 ("myjob.sh") has been submitted.

Simplest Job:

[vaidya@shark ~]$ cat myjob.sh

#!/bin/sh

sleep 10

date > /tmp/test1.out.txt

Variations: qsub -cwd myjob.sh

(C) Anand Vaidya [email protected]

SGE Commands - qstat

check status of your job:

qstat ; qstat -f ;

qstat -u username ; qstat -j job_id

[root@shark ~]# qstat

job-ID prior name user state submit/start at queue slots ja-task-ID

-----------------------------------------------------------------------------------------------------------------

639 0.55500 HCPDIV7 test1 r 05/17/2006 10:16:31 all.q@shark-c00 1

658 0.55500 HCPDIV1 test1 r 05/17/2006 13:37:35 all.q@shark-c00 1

694 0.55500 FCCDVI test1 r 05/17/2006 23:52:19 all.q@shark-c02 1

695 0.55500 FCCDVI1 test1 r 05/17/2006 23:52:19 all.q@shark-c02 1

SGE Commands - qstat

Status of the job is indicated by letters as:

qw - waitingt - transfering

r - runnings,S - suspended

R- restarted T- threshold

SGE Commands - qdel

Delete your job, if you wish

qdel 743

vaidya has deleted job 743

SGE Commands - qmon

qmon is a XWindows GUI tool to submit/delete/view jobs, configure SGE system

Example: Submit a job using qmon

Click the Job Submission icon.

Click the Job Script file selection icon to open a file selection box and select your script file. Then, click OK.

Click the Submit button at the bottom of the Job Submission dialog.

After a couple of seconds, you should be able to monitor your job in the Job Control dialog. Click the Job Control icon in the QMON control panel.

You first see it under Pending Jobs, and it quickly moves to Running Jobs after it gets started.

SGE Commands qsh, qtcsh

Submit a Interactive session request:

qlogin

qrsh

Ensure you have a valid XServer running on your desktop. Allow remote xclients to display on your desktop.

Submit an Interactive session request:

qsh

qtcsh

Note: using this feature needs additional configuration, may not work otherwise.

SGE Commands jobscript

sample job script:

#!/bin/bash

#

#$ -cwd

#$ -j y

#$ -S /bin/bash

#$ -V

date

sleep 10

env

date

SGE Commands jobscript

sample job script:

#!/bin/bash

#

#$ -cwd

#$ -j y

#$ -S /bin/bash

#

$MPI_DIR/mpirun -np $NSLOTS -machinefile $TMPDIR/machines myparallelprog.exe {infile.txt outfile.txt}

SGE Commands jobscript

-cwd = change to current dir before running job

-j y = merge error with stdout

-r y = code is re-runnable

-N jname = set the job name

-l h_rt = 00:30:00 run job for max of 30mins

-pe mpich Invoke parallel environment

-pe mpich-ib use infiniband parallel environment

-pe mpich-eth use ethernet parallel env

-V = carry all env variable settings

Admin Commands

Next few slides show commands useful for SGE admins (not users/researchers)

SGE Commands qconf

Show:

complexes:qconf -sc

queues:qconf -sql

PE:qconf -spl

exec host:qconf -selqconf -se c35

submit hosts:qconf -ss

admin hosts:qconf -sh

list calendarsqconf -scall

configurationqconf -sconf

user list:qconf -suserl

Scheduler conf:qconf -ssconf

SGE Commands qping

[anand@shark-c02 ~]$ qping -info shark-c01 537 execd 1

05/24/2006 21:57:34:

SIRM version: 0.1

SIRM message id: 1

start time: 05/24/2006 21:31:37 (1148477497)

run time [s]: 1768

messages in read buffer: 0

messages in write buffer: 0

nr. of connected clients: 2

status: 0

info: dispatcher: R (0.04) | OK

Monitor: disabled

LSF Commands

bsub submit a job

bstop suspend a job

bresume resume a suspended task

btop move job to top

bswitch move jobs between queues

lsgrun run a task on a set of hosts

bkill kill a job

LSF Commands

lsmon monitor load, resource availability...

lsid show lsf details (version etc)

lshosts show hosts & static info

lsload show load info for hosts

lsinfo show lsf config info

busers show user info

bacct show acct info on finished jobs

bjobs show info on jobs

bpeek show stdin/stdout of unfinished jobs

Acknowledgements & Copying

This material is based on my experience as well as material collected from SGE documentation.

This presentation can be redistributed as follows:

No commercial re-distribution: eg, as part of a for-profit CDROM or as part of your sales pitch. Seek my permission first.

Must attribute the document creator.

Share alike: If you use this document and enhance it or modify, share the modifications or the modified document

Which means I apply: Creative Commons License, http://creativecommons.org/licenses/by-nc-sa/2.5/

The End

Thanks for your time. If you have any feedback, corrections or questions please contact me: Anand Vaidya, [email protected]

This document was created with OpenOffice on Linux. email me if you want the odp file instead of the pdf

Click to edit the title text format

Click to edit the outline text format

Second Outline Level

Third Outline Level

Fourth Outline Level

Fifth Outline Level

Sixth Outline Level

Seventh Outline Level

Eighth Outline Level

Ninth Outline Level

[email protected]: http://creativecommons.org/licenses/by-nc-sa/2.5/