Job Management Systems SGE v1.4

download Job Management Systems SGE v1.4

of 32

Transcript of Job Management Systems SGE v1.4

  • 8/14/2019 Job Management Systems SGE v1.4

    1/32

    Job Management Systems

    SGEv1.4Author: Anand Vaidya

    [email protected]

  • 8/14/2019 Job Management Systems SGE v1.4

    2/32

    Why use SGE? Maintain order in a shared resource like queing

    up at a movie ticket counter rather than mobbingthe counter

    Apply different usage policies PhDs and Profsget better treatment than first year grads

    Everyone gets a fair (!) share of the computingresource.

  • 8/14/2019 Job Management Systems SGE v1.4

    3/32

    What is SGE?

    SGE is a distributed resource managementsoftware Provides users the means to submitcomputationally demanding tasks to the

    SGE system for transparent distribution ofthe associated workload.

  • 8/14/2019 Job Management Systems SGE v1.4

    4/32

    What is SGE? Layman Terms

    You have a collection of mostly idle Macs,Windows, Linux and Solaris machinesYou have plenty of computations orsimulations to run.

    Can we just use these machines to runthose computations?

    Who will manage this herd? SGE will...

  • 8/14/2019 Job Management Systems SGE v1.4

    5/32

    SGE Overview

    Users and theirdesktop/laptops

    SGEConfigsRules

    Users' jobs run here

    Users' jobs run here

  • 8/14/2019 Job Management Systems SGE v1.4

    6/32

    How does SGE work?

    Users submit jobs to the Grid Engine. Unless resources are immediatelyavailable non-interactive jobs are kept inqueues until resources to execute them

    become available.Jobs are passed onto the availableexecution hosts

    Records of each jobs progress through thesystem are kept and reported whenrequested.

  • 8/14/2019 Job Management Systems SGE v1.4

    7/32

    Sge master,

    shadows

    Sge master,

    shadows

    execd

    execd

    execd

    execdJob requestsResults,errors

    DRMAA client(applications)

  • 8/14/2019 Job Management Systems SGE v1.4

    8/32

    Supported OS Linux 32 and 64 bit

    Solaris (Sparc and x64)

    Windows (exec only)

    OSX

    AIX

    HPUX/IRIX etc

  • 8/14/2019 Job Management Systems SGE v1.4

    9/32

    SGE Components Hosts

    Master (coordinate activities, hold queues)

    Shadow Master

    Execution (workers)

    Administration (sets up system, queues etc)

    Submit (users can submit jobs from these)

  • 8/14/2019 Job Management Systems SGE v1.4

    10/32

    SGE Components Usually the master and admin host are the same

    machines Queues (defined by the administrator)

    User and Administrator Commands

    Daemons:

    sge_qmaster (Master Daemon),

    sge_schedd (Scheduler Daemon), sge_execd (Execution Daemon)

    sge_commd (Communication Daemon)

  • 8/14/2019 Job Management Systems SGE v1.4

    11/32

    4 Job Types Interactive jobs - user gets back a shell window

    Batch jobs just run once and store output forreview later

    Array jobs (aka parametric eg image rendering )

    Parallel (MPI) jobs Can't describe in one line :-(

  • 8/14/2019 Job Management Systems SGE v1.4

    12/32

    Accessing...

    GUI (qmon) Command Line / textual (qsub etc)

    Programmatic (DRMAA)

    DRMAA= Distributed Resource Management Application API where,

    API = Application Programming InterfaceCan you see the duplication? DRMA should have been sufficient...

  • 8/14/2019 Job Management Systems SGE v1.4

    13/32

    What is a job? Describes:

    What to run (program name) What environment is needed?

    What resources are needed (how many cpu, how

    much RAM etc) Email on completion?

    Send output of job to another file?

  • 8/14/2019 Job Management Systems SGE v1.4

    14/32

    Queues and Instances Queues are logical constructs, shared by all hosts

    attached to the queue and cannot run jobs Queue Instances actually reside on hosts and

    contain jobs

    Queue config shared by all instances Each instance can have unique properties,

    different from Queue

    I t lli

  • 8/14/2019 Job Management Systems SGE v1.4

    15/32

    Installing... Determine archs you will support and download

    appropriate packages.

    Unpack tarballs

    Write auto-install script

    ssh $MASTER ; $SGE_ROOT/inst_sge -m -auto

    sge-auto.conf ; /etc/init.d/sgemaster start ssh $SHADOW ; $SGE_ROOT/inst_sge -sm -auto

    sge-auto.conf; /etc/init.d/sgemaster -shadowd start

    $SGE_ROOT/inst_sge -x -auto sge-auto.conf ;psh compute /etc/init.d/sgeexecd start

    Check : qhost

    Done!

    SGE C d h t

  • 8/14/2019 Job Management Systems SGE v1.4

    16/32

    SGE Commands - qhost What is the state of the cluster? How many nodes,

    type, load? What is my chance of getting a node?[root@shark ~]# qhost

    HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTOSWAPUS

    -------------------------------------------------------------------------------

    global - - - - - - -

    shark-c00 lx24-amd64 2 2.02 3.9G 240.8M 4.0G 0.0

    shark-c02 lx24-amd64 2 2.00 3.9G 214.9M 4.0G 0.0

    shark-c03 lx24-amd64 2 1.76 3.9G 215.9M 4.0G 0.0

    SGE C d b

  • 8/14/2019 Job Management Systems SGE v1.4

    17/32

    SGE Commands - qsub Create a jobscripts (myjob.sh)

    Submit for execution$ qsub myjob.sh

    Your job 742 ("myjob.sh") has been submitted.

    Simplest Job:[vaidya@shark ~]$ cat myjob.sh

    #!/bin/sh

    sleep 10

    date > /tmp/test1.out.txt

    Variations: qsub -cwd myjob.sh

    SGE C d t t

  • 8/14/2019 Job Management Systems SGE v1.4

    18/32

    SGE Commands - qstat check status of your job:

    qstat ; qstat -f ;

    qstat -u username ; qstat -j job_id

    [root@shark ~]# qstat job-ID prior name user state submit/start at queueslots ja-task-ID

    -----------------------------------------------------------------------------------------------------------------639 0.55500 HCPDIV7 test1 r 05/17/2006 10:16:31 all.q@shark-c00

    1658 0.55500 HCPDIV1 test1 r 05/17/2006 13:37:35 all.q@shark-c00

    1

    694 0.55500 FCCDVI test1 r 05/17/2006 23:52:19 all.q@shark-c021695 0.55500 FCCDVI1 test1 r 05/17/2006 23:52:19 all.q@shark-c02

    1

    SGE C d t t

  • 8/14/2019 Job Management Systems SGE v1.4

    19/32

    SGE Commands - qstat Status of the job is indicated by letters as:

    qw - waiting t - transferingr - running s,S - suspended

    R- restarted T - threshold

    SGE Commands qdel

  • 8/14/2019 Job Management Systems SGE v1.4

    20/32

    SGE Commands - qdel Delete your job, if you wish

    qdel 743vaidya has deleted job 743

    SGE Commands qmon

  • 8/14/2019 Job Management Systems SGE v1.4

    21/32

    SGE Commands - qmon qmon is a XWindows GUI tool to

    submit/delete/view jobs, configure SGE system Example: Submit a job using qmon

    Click the Job Submission icon. Click the Job Script file selection icon to open a file selection

    box and select your script file. Then, click OK. Click the Submit button at the bottom of the Job Submission

    dialog. After a couple of seconds, you should be able to monitor your

    job in the Job Control dialog. Click the Job Control icon in theQMON control panel.

    You first see it under Pending Jobs, and it quickly moves toRunning Jobs after it gets started.

    SGE Commands qsh qtcsh

  • 8/14/2019 Job Management Systems SGE v1.4

    22/32

    SGE Commands qsh, qtcsh Submit a Interactive session request:

    qloginqrsh

    Ensure you have a valid XServer running on

    your desktop. Allow remote xclients to display onyour desktop.

    Submit an Interactive session request:

    qshqtcsh

    Note: using this feature needs additional configuration, maynot work otherwise.

    SGE Commands jobscript

  • 8/14/2019 Job Management Systems SGE v1.4

    23/32

    SGE Commands jobscript sample job script:

    #!/bin/bash

    #

    #$ -cwd

    #$ -j y

    #$ -S /bin/bash

    #$ -V

    date

    sleep 10

    env

    date

    SGE Commands jobscript

  • 8/14/2019 Job Management Systems SGE v1.4

    24/32

    SGE Commands jobscript sample job script:

    #!/bin/bash

    #

    #$ -cwd

    #$ -j y

    #$ -S /bin/bash

    #

    $MPI_DIR/mpirun -np $NSLOTS -machinefile

    $TMPDIR/machines myparallelprog.exe {infile.txt outfile.txt}

    Jobscript useful directives

  • 8/14/2019 Job Management Systems SGE v1.4

    25/32

    Jobscript useful directives -cwd = change to current dir before running job

    -j y = merge error with stdout

    -r y = code is re-runnable

    -N jname = set the job name

    -l h_rt = 00:30:00 run job for max of 30mins

    -pe mpich Invoke parallel environment

    -pe mpich-ib use infiniband parallel environment

    -pe mpich-eth use ethernet parallel env

    -V = carry all env variable settings -M [email protected] send email

    -m bes

    Jobscript useful directives

    mailto:[email protected]:[email protected]
  • 8/14/2019 Job Management Systems SGE v1.4

    26/32

    Jobscript useful directives -A acctname_to_charge

    -a [[CC]yy]MMDDhhmm[.SS] when to run

    Ad i C d

  • 8/14/2019 Job Management Systems SGE v1.4

    27/32

    Admin CommandsNext few slides show commands useful for SGE

    admins (not users/researchers)

    Ad i C d f

  • 8/14/2019 Job Management Systems SGE v1.4

    28/32

    Admin Commands - qconfIn general,

    qconf -s** to show config qconf -m** to modify config

    qconf -M** to import config from text file

    qconf -d** to delete config

    SGE Commands qconf

  • 8/14/2019 Job Management Systems SGE v1.4

    29/32

    SGE Commands qconf Show:

    complexes: qconf -sc queues: qconf -sql

    PE: qconf -spl

    exec host: qconf -sel qconf -se c35

    submit hosts: qconf -ss

    admin hosts: qconf -sh

    list calendars qconf -scall

    configuration qconf -sconf user list: qconf -suserl

    Scheduler conf: qconf -ssconf

    SGE Commands qping

  • 8/14/2019 Job Management Systems SGE v1.4

    30/32

    SGE Commands qping[anand@shark-c02 ~]$ qping -info shark-c01 537 execd 1

    05/24/2006 21:57:34:

    SIRM version: 0.1

    SIRM message id: 1

    start time: 05/24/2006 21:31:37(1148477497)

    run time [s]: 1768

    messages in read buffer: 0

    messages in write buffer: 0

    nr. of connected clients: 2status: 0

    info: dispatcher: R (0.04) | OK

    Monitor: disabled

    Acknowledgements & Copying

  • 8/14/2019 Job Management Systems SGE v1.4

    31/32

    Acknowledgements & Copying This material is based on my experience as well as material

    collected from SGE documentation.

    This presentation can be redistributed as follows:

    No commercial re-distribution: eg, as part of a for-profitCDROM or as part of your sales pitch. Seek my permission

    first. Must attribute the document creator.

    Share alike: If you use this document and enhance it ormodify, share the modifications or the modified document

    Which means I apply: Creative Commons License,http://creativecommons.org/licenses/by-nc-sa/2.5/

    The End

  • 8/14/2019 Job Management Systems SGE v1.4

    32/32

    The End Thanks for your time. If you have any feedback, corrections

    or questions please contact me: Anand Vaidya,

    [email protected] This document was created with OpenOffice on Linux. email me if

    you want the odp file instead of the pdf