Introduction to Using SLURM on Discover Chongxun (Doris) Pan doris.pan@nasa September 24, 2013
Introduction to Abel and SLURM
Transcript of Introduction to Abel and SLURM
Introduction to Abel and SLURM
Sabry Razick(Slides by: Katerina Michalickova)
The Research Computing Services Group, USIT
November 02, 2015
Topics• The Research Computing Services group
• Abel details
• Getting an account & Logging in
• Copying files
• Running a simple job
• Queuing system
• Job administration
• User administration
• Parallel jobs
The Research Computing ServicesSeksjon for IT i Forskning
• The RCS group provides access to IT resources
and high performance computing to
researchers at UiO and to NOTUR users
• http://www.uio.no/english/services/it/research/
• Part of USIT
• Contact: [email protected]
The Research Computing Services
• Operation of Abel - a computer cluster
• User support
• Data storage
• Secure data analysis and storage - TSD
• Portals
• Lifeportal (https://lifeportal.uio.no/)
• Geo (https://geoportal-dev.hpc.uio.no)
• Language (https://lap.hpc.uio.no)
Abel• Computer cluster
– Similar Computers connected by a local area network
(LAN). Different than a Cloud or a Grid.
• Enables parallel computing
• Science presents multiple problems of parallel nature
– Sequence database searches
– Genome assembly and annotation
– Simulations
We expected that users know what they are doing !
Bunch of computers -> Cluster• Hardware
– Powerful computers(nodes)
– High-speed connection between node
– Access to a common file system
• Software
– Operating system Linux, 64 bit Centos 6 (Rocks Cluster Distribution based)
– Identical mass installations.
– Queuing system enables timely execution of many concurrent processes
Abel in numbers
• Nodes - 650+
• Cores - 10000+
• Total memory - 40 TiB (40 TebiBytes)
• Total storage - 400 TiB using BeeGFS
• 96th most powerful in 2012 , now 444th (June 2015)
Accessing Abel
• If you are working or studying at UiO, you can have an Abel account directly from us.
• If you are Norwegian scientist (or need large resources), you can apply through NOTUR – http://www.notur.no
• Write to us for information:
• Read about getting access: http://www.uio.no/hpc/abel/help/access
Connecting to Abel
• Linux • Ubuntu
• Redhat - RHEL
• Windows - using Putty and WinSCP
• Mac OS
Putty http://www.putty.org/
File upload/download - WinSCP
• http://winscp.net/eng/download.php
File transfer on command line
• Unix users can use secure copy or rsync
commands
– Copy myfile.txt from the current directory on your
machine to your home area on Abel:
scp myfile.txt [email protected]:~
– For large files, use rsync command:
rsync -z myfile.tar [email protected]:~
Software on Abel
• Available on Abel:
http://www.uio.no/hpc/abel/help/software
• Software on Abel is organized in modules.
– List all software (and version) organized in modules:
module avail
– Load software from a module:
module load module_name
• If you cannot find what you looking for: ask us
Your own software
• You can install own software in your home
area
• This time we have a special lecture on this
by an expert (Bjørn-Helge Mevik) -
November-03,14:15
Using Abel
• When you log into Abel, you are either on login0 or login1 login node.
• It is not allowed to execute programs (jobs) directly on the login nodes.
• Jobs are submitted to Abel via the queuing system.
• The login nodes are just for logging in, copying files, editing, compiling, running short tests (no more than a couple of minutes), submitting jobs, checking job status, etc.
• For interactive execution use qlogin.
SLURM
● Simple Linux Utility for Resource Management
(workload manager)
● Allocates exclusive and/or non-exclusive
access to resources (computer nodes) to users
for some duration of time
● Provides a framework for starting, executing,
and monitoring work on a set of allocated
nodes.
● Managing a queue of pending work.
Computing on Abel• Submit a job to the queuing system
– Software that executes jobs on available resources on the cluster (and much more)
• Communicate with the queuing system using a shell (or job) script
• Retrieve results (or errors) when the job is done
• Read tutorial: http://www.uio.no/hpc/abel/help/user-guide
Job script• Job script - shell script including the command that one needs to
execute
• EXTRA comments read by the queuing system – “#SBATCH --xxxx”
• Compulsory values:
#SBATCH --account
#SBATCH --time
#SBATCH --mem-per-cpu
• Setting up a job environment
source /cluster/bin/jobsetup
Project/Account
• Each user belongs to a project on Abel
• Each project has set of resources
• Learn about your project(s):
– Use: projects
Minimal job script#!/bin/bash## Job name:#SBATCH --job-name=jobname## Project:#SBATCH --account=your_account## Wall time:#SBATCH --time=hh:mm:ss## Max memory #SBATCH --mem-per-cpu=max_size_in_memory
## Set up environmentsource /cluster/bin/jobsetup
## Run command./executable > outfile
Example job script
#!/bin/bash
#SBATCH --job-name=RCS1115_hello
#SBATCH --account=xxx
#SBATCH --time=00:01:05
#SBATCH --mem-per-cpu=512M
source /cluster/bin/jobsetup
set -o errexit
sleep 1m
python hello.py
Resources
Setup
Job
Submitting a job - sbatch
Job ID
Checking a job - squeue
• squeue -u <USER_NAME>
scontrol show job 1111231
JobId=1111231 Name=mitehunter
UserId=XXXX(100379) GroupId=users(100) Priority=21479 Nice=0 Account=nn9244k QOS=notur
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
RunTime=15-16:25:15 TimeLimit=21-00:00:00 TimeMin=N/A SubmitTime=2015-10-12T20:43:14 EligibleTime=2015-10-12T20:43:14 StartTime=2015-10-12T20:51:26 EndTime=2015-11-02T19:51:27 PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=long AllocNode:Sid=login-0-2:14827 ReqNodeList=(null) ExcNodeList=(null) NodeList=c18-34 BatchHost=c18-34 NumNodes=1 NumCPUs=4 CPUs/Task=4 ReqB:S:C:T=0:0:*:* Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=0 MinCPUsNode=4 MinMemoryCPU=3900M MinTmpDiskNode=0
Features=(null) Gres=(null) Reservation=(null)
Shared=0 Contiguous=0 Licenses=(null) Network=(null)
Command=/work/users/xxxxx/gadMor2/mite_hunter_1.0/mite_hunter.slurm
WorkDir=/work/users/xxxxx/gadMor2/mite_hunter_1.0
StdErr=/work/users/xxxxx/../slurm-12780231.out StdIn=/dev/null
StdOut=/work/users/XXX/gadMor2/mite_hunter_1.0/slurm-12780231.out
Use of the SCRATCH area#!/bin/sh #SBATCH --job-name=YourJobname #SBATCH --account=YourProject #SBATCH --time=hh:mm:ss #SBATCH --mem-per-cpu=max_size_in_memory source /cluster/bin/jobsetup
## Copy files to work directory: cp $SUBMITDIR/YourData $SCRATCH
## Mark outfiles for automatic copying to $SUBMITDIR: chkfile YourOutput
## Run command cd $SCRATCH executable YourData > YourOutput
Some usefull commands
• scancel <JOBID> - Cancel a job before it ends
• dusage - find out your disk usage
• squeue - list all queued jobs and find out the
• squeue -t R | more -position of your job
qlogin
• Reserve some resources for a given time.
• Example - Reserve one node (or 16 cores) on
Abel for your interactive use for 1 hour:
qlogin --account=your_project --ntasks-per-node=16 --time=01:00:00
• Run “source /cluster/bin/jobsetup“ after
receiving allocation
http://www.uio.no/english/services/it/research/hpc/abel/help/user-guide/interactive-logins.htm
Interactive use of Abel - qlogin• Send request for a resource (e.g. 4 cores)
• Work on command line when a node becomes available
• Example - book one node (or 16 cores) on Abel for your interactive use for 1 hour:
qlogin --account=your_project --ntasks-per-node=16 --time=01:00:00
• Run “source /cluster/bin/jobsetup“ after receiving allocation
• For more info, see:
http://www.uio.no/hpc/abel/help/user-guide/interactive-logins.html
Interactive use of Abel - qlogin
Environment variables
• SLURM_JOBID – job-id of the job
• SCRATCH – name of job-specific scratch-area
• SLURM_NPROCS – total number of cpus requested
• SLURM_CPUS_ON_NODE – number of cpus allocated on
node
• SUBMITDIR – directory where sbatch were issued
• TASK_ID – task number (for arrayrun-jobs)
Arrayrun
● Parallel jobs - executing many instances of
the same executable at the same time.
● Many input datasets
● Simulations with different input parameters.
● Possible to split a large input file into chunks
and parallelize you job.
MPI
● Message Passing Interface
● MPI is a language-independent communications
protocol used for programming parallel computers.
● We support Open MPI
○ module load openmpi
● jobs specifying more than one node automatically get
○ #SBATCH --constraint=ib