Introduction to Abel and SLURM

Introduction to Abel and SLURM

Sabry Razick(Slides by: Katerina Michalickova)

The Research Computing Services Group, USIT

November 02, 2015

Topics• The Research Computing Services group

• Abel details

• Getting an account & Logging in

• Copying files

• Running a simple job

• Queuing system

• Job administration

• User administration

• Parallel jobs

The Research Computing ServicesSeksjon for IT i Forskning

• The RCS group provides access to IT resources

and high performance computing to

researchers at UiO and to NOTUR users

• http://www.uio.no/english/services/it/research/

• Part of USIT

• Contact: [email protected]

mailto:[email protected]

The Research Computing Services

• Operation of Abel - a computer cluster

• User support

• Data storage

• Secure data analysis and storage - TSD

• Portals

• Lifeportal (https://lifeportal.uio.no/)

• Geo (https://geoportal-dev.hpc.uio.no)

• Language (https://lap.hpc.uio.no)

Abel• Computer cluster

– Similar Computers connected by a local area network

(LAN). Different than a Cloud or a Grid.

• Enables parallel computing

• Science presents multiple problems of parallel nature

– Sequence database searches

– Genome assembly and annotation

– Simulations

We expected that users know what they are doing !

Bunch of computers -> Cluster• Hardware

– Powerful computers(nodes)

– High-speed connection between node

– Access to a common file system

• Software

– Operating system Linux, 64 bit Centos 6 (Rocks Cluster Distribution based)

– Identical mass installations.

– Queuing system enables timely execution of many concurrent processes

Abel in numbers

• Nodes - 650+

• Cores - 10000+

• Total memory - 40 TiB (40 TebiBytes)

• Total storage - 400 TiB using BeeGFS

• 96th most powerful in 2012 , now 444th (June 2015)

Accessing Abel

• If you are working or studying at UiO, you can have an Abel account directly from us.

• If you are Norwegian scientist (or need large resources), you can apply through NOTUR – http://www.notur.no

• Write to us for information:

[email protected]

• Read about getting access: http://www.uio.no/hpc/abel/help/access

http://www.notur.no/

http://www.notur.no/

mailto:[email protected]

http://www.uio.no/hpc/abel/help/access



Connecting to Abel

• Linux • Ubuntu

• Redhat - RHEL

• Windows - using Putty and WinSCP

• Mac OS

Putty http://www.putty.org/

http://www.putty.org/

File upload/download - WinSCP

• http://winscp.net/eng/download.php

http://winscp.net/eng/download.php

http://winscp.net/eng/download.php

File transfer on command line

• Unix users can use secure copy or rsync

commands

– Copy myfile.txt from the current directory on your

machine to your home area on Abel:

scp myfile.txt [email protected]:~

– For large files, use rsync command:

rsync -z myfile.tar [email protected]:~

Software on Abel

• Available on Abel:

http://www.uio.no/hpc/abel/help/software

• Software on Abel is organized in modules.

– List all software (and version) organized in modules:

module avail

– Load software from a module:

module load module_name

• If you cannot find what you looking for: ask us

http://www.uio.no/hpc/abel/help/software

Your own software

• You can install own software in your home

area

• This time we have a special lecture on this

by an expert (Bjørn-Helge Mevik) -

November-03,14:15

Using Abel

• When you log into Abel, you are either on login0 or login1 login node.

• It is not allowed to execute programs (jobs) directly on the login nodes.

• Jobs are submitted to Abel via the queuing system.

• The login nodes are just for logging in, copying files, editing, compiling, running short tests (no more than a couple of minutes), submitting jobs, checking job status, etc.

• For interactive execution use qlogin.

SLURM

● Simple Linux Utility for Resource Management

(workload manager)

● Allocates exclusive and/or non-exclusive

access to resources (computer nodes) to users

for some duration of time

● Provides a framework for starting, executing,

and monitoring work on a set of allocated

nodes.

● Managing a queue of pending work.

Computing on Abel• Submit a job to the queuing system

– Software that executes jobs on available resources on the cluster (and much more)

• Communicate with the queuing system using a shell (or job) script

• Retrieve results (or errors) when the job is done

• Read tutorial: http://www.uio.no/hpc/abel/help/user-guide

http://www.uio.no/hpc/abel/help/user-guide



Job script• Job script - shell script including the command that one needs to

execute

• EXTRA comments read by the queuing system – “#SBATCH --xxxx”

• Compulsory values:

#SBATCH --account

#SBATCH --time

#SBATCH --mem-per-cpu

• Setting up a job environment

source /cluster/bin/jobsetup

Project/Account

• Each user belongs to a project on Abel

• Each project has set of resources

• Learn about your project(s):

– Use: projects

Minimal job script#!/bin/bash## Job name:#SBATCH --job-name=jobname## Project:#SBATCH --account=your_account## Wall time:#SBATCH --time=hh:mm:ss## Max memory #SBATCH --mem-per-cpu=max_size_in_memory

## Set up environmentsource /cluster/bin/jobsetup

## Run command./executable > outfile

Example job script

#!/bin/bash

#SBATCH --job-name=RCS1115_hello

#SBATCH --account=xxx

#SBATCH --time=00:01:05

#SBATCH --mem-per-cpu=512M

source /cluster/bin/jobsetup

set -o errexit

sleep 1m

python hello.py

Resources

Setup

Job

Submitting a job - sbatch

Job ID

Checking a job - squeue

• squeue -u <USER_NAME>

scontrol show job 1111231

JobId=1111231 Name=mitehunter

UserId=XXXX(100379) GroupId=users(100) Priority=21479 Nice=0 Account=nn9244k QOS=notur

JobState=RUNNING Reason=None Dependency=(null)

Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0

RunTime=15-16:25:15 TimeLimit=21-00:00:00 TimeMin=N/A SubmitTime=2015-10-12T20:43:14 EligibleTime=2015-10-12T20:43:14 StartTime=2015-10-12T20:51:26 EndTime=2015-11-02T19:51:27 PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=long AllocNode:Sid=login-0-2:14827 ReqNodeList=(null) ExcNodeList=(null) NodeList=c18-34 BatchHost=c18-34 NumNodes=1 NumCPUs=4 CPUs/Task=4 ReqB:S:C:T=0:0:*:* Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=0 MinCPUsNode=4 MinMemoryCPU=3900M MinTmpDiskNode=0

Features=(null) Gres=(null) Reservation=(null)

Shared=0 Contiguous=0 Licenses=(null) Network=(null)

Command=/work/users/xxxxx/gadMor2/mite_hunter_1.0/mite_hunter.slurm

WorkDir=/work/users/xxxxx/gadMor2/mite_hunter_1.0

StdErr=/work/users/xxxxx/../slurm-12780231.out StdIn=/dev/null

StdOut=/work/users/XXX/gadMor2/mite_hunter_1.0/slurm-12780231.out

Use of the SCRATCH area#!/bin/sh #SBATCH --job-name=YourJobname #SBATCH --account=YourProject #SBATCH --time=hh:mm:ss #SBATCH --mem-per-cpu=max_size_in_memory source /cluster/bin/jobsetup

## Copy files to work directory: cp $SUBMITDIR/YourData $SCRATCH

## Mark outfiles for automatic copying to $SUBMITDIR: chkfile YourOutput

## Run command cd $SCRATCH executable YourData > YourOutput

Some usefull commands

• scancel <JOBID> - Cancel a job before it ends

• dusage - find out your disk usage

• squeue - list all queued jobs and find out the

• squeue -t R | more -position of your job

qlogin

• Reserve some resources for a given time.

• Example - Reserve one node (or 16 cores) on

Abel for your interactive use for 1 hour:

qlogin --account=your_project --ntasks-per-node=16 --time=01:00:00

• Run “source /cluster/bin/jobsetup“ after

receiving allocation

http://www.uio.no/english/services/it/research/hpc/abel/help/user-guide/interactive-logins.htm

Interactive use of Abel - qlogin• Send request for a resource (e.g. 4 cores)

• Work on command line when a node becomes available

• Example - book one node (or 16 cores) on Abel for your interactive use for 1 hour:

qlogin --account=your_project --ntasks-per-node=16 --time=01:00:00

• Run “source /cluster/bin/jobsetup“ after receiving allocation

• For more info, see:

http://www.uio.no/hpc/abel/help/user-guide/interactive-logins.html



Interactive use of Abel - qlogin

Environment variables

• SLURM_JOBID – job-id of the job

• SCRATCH – name of job-specific scratch-area

• SLURM_NPROCS – total number of cpus requested

• SLURM_CPUS_ON_NODE – number of cpus allocated on

node

• SUBMITDIR – directory where sbatch were issued

• TASK_ID – task number (for arrayrun-jobs)

Arrayrun

● Parallel jobs - executing many instances of

the same executable at the same time.

● Many input datasets

● Simulations with different input parameters.

● Possible to split a large input file into chunks

and parallelize you job.

MPI

● Message Passing Interface

● MPI is a language-independent communications

protocol used for programming parallel computers.

● We support Open MPI

○ module load openmpi

● jobs specifying more than one node automatically get

○ #SBATCH --constraint=ib

Thank you.

[email protected]

http://www.uio.no/english/services/it/research/hpc/abel/

Introduction to Abel and SLURM

Documents

Transcript of Introduction to Abel and SLURM