WestGrid Town Hall & Cedar Office Hours...WestGrid Town Hall & Cedar Office Hours Patrick Mann,...

WestGrid Town Hall &Cedar Office Hours

Patrick Mann, Director of OperationsAlex Razoumov, Visualization Coordinator

Wednesday, July 12, 2017

Introduction

1. New Systemsa. Status and Availabilityb. Known Issues

2. CC Services Overview3. WestGrid Legacy Systems4. New System Training and Docs5. Using the New Systems (Alex Razoumov)6. “Office Hours” - we’ll answer questions

Admin

To ask questions:● Websteam: Email [email protected] ● Vidyo: Un-mute & ask your question

(VidyoDesktop users can also type questions in Vidyo Chat - click the chat bubble icon in Vidyo menu)

Vidyo Users: Please MUTE yourself when not speaking(click the microphone icon to mute / un-mute)

mailto:[email protected]

New Systems & Migration

Patrick MannDirector of Operations

WestGrid

Top500

https://www.top500.org/

Rank Name Computer Cores* RMax (TFlop/s)

Rpeak (TFlop/s)

1 Sunway Taihu Light

Sunway MPP, Sunway SW26010 260C 1.45GHz, Sunway

10,649,600 93,014 125,436

86 Cedar Dell C4130/Dell C6320,Xeon E5-2650v4/E5-2683-v4, Intel Omni-Path,NVIDIA Tesla P100

59,776 1,337 3,710 Intel Broadwell

95 Graham Huawei X6800 V3,Xeon E5-2683 v4 16C 2.1GHz, Infiniband EDR/FDR,NVIDIA Tesla P100

51,200 1,228 2,641 Intel Broadwell

* includes GPU’s



National Compute SystemsSystem Current Status

Arbutus (GP1, UVic)

● West.cloud.computecanada.ca: 7,640 cores● Storage updates complete, Network (100G) in progress.● East.cloud.computecanada.ca: part of the cloud network

IN OPERATION (Sep, 2016)RAC 2017

Cedar (GP2, SFU)

● 27,696 CPU cores● 146 GPU nodes (4 x NVidia P100)● Running jobs

IN OPERATION (June 30, 2017)RAC 2017

Graham (GP3, Waterloo)

● 33,472 CPU cores● 160 GPU nodes (2 x NVidia P100)● Running jobs

IN OPERATION (June 30, 2017)RAC 2017

Niagara(LP1, Toronto)

● Last-and-best-offer phase Late 2017

National Data Cyberinfrastructure (NDC)

System Current Status

Silo Interim ● Re-migration to permanent PROJECT+TAPE in planning Available

NDC-SFU-Project ● 10 PB, backed up to tape Available

NDC-SFU-Nearline ● Tape libraries installed and in operation● Tape off-lining processes in development

Autumn 2017

NDC-Waterloo-Project ● 13 PB, backed up to tape Available

NDC-Waterloo-Nearline (identical to SFU) Autumn 2017

NDC-Object Storage ● Object Storage. DDN WOS.● Lots of demand but not allocated.● Initial prototype for internal testing installed on cloud.

Autumn 2017

Attached (Scratch) ● High performance storage attached to clusters Available

Current Status

● Not perfect yet!● Systems are still in development and may go down.

○ But generally with advance warning.● RAC 2017s compute priorities have been included

○ But not storage (details next slide)● We expect users will continue to discover problems

○ Scale-out testing: lots of users ○ Diverse usage testing: users running their favourite apps.○ Documentation testing: errors, inaccuracies, misleading sections, missing information, ...

● Please continue to report issues and problems. ○ Known issues will be itemized on the docs wiki.

status.computecanada.ca docs.computecanada.ca [email protected]

http://status.computecanada.ca

http://status.computecanada.ca

https://docs.computecanada.ca/




Known Issues: Storage

See https://docs.computecanada.ca/wiki/Known_issues

1. Nearline is under developmenta. RAC 2017 Nearline allocations not yet available.

2. Project space still needs some configurationa. Currently /project/<username>.b. Eventually something like /project/projects/<groupname>.

■ (which is the basis for the storage allocations)3. RAC 2017 Storage quotas not implemented

a. Quotas are not in yet so we are keeping an eye on RAC 2017 allocations4. Ask [email protected] if you need your RAC (nearline or project), or

want help with group requests, shared storage, ...

https://docs.computecanada.ca/wiki/Known_issues


Known Issues: Jobs

1. Cedar and Graham running full production Slurma. Graham was running a prototype until late last week

2. Email job notificationsa. Working on Cedar, still needed on Graham

3. Slurm syntax is different.a. We’ve had a few tickets about details. Feel free to ask!

4. Scheduling is leaving unused nodesa. (next slide)

Known Issues: SchedulingNodes are empty!

● Small “by-core” partition so “by-core” requests have less of the machine.

● “By-node” can run on the whole cluster.● Similarly for run-times: 3 hour gives

everything

So - Best Practice● Request whole-nodes (--nodes=.. )● Request short run-times (--time=..)● The scheduler will choose the best partition for your job.

Known Issues: AccountingUsers with >1 accounting group are asked for specific group

sbatch --account=dev-pjmann-ab

RAC holders in particular have RAS (default) and RAC accounts.Annoying, and the accounting groups have a special format (not RAPI’s)

[def|rrg|rpp]-[PI’s user name]-[XX]

Example: “def-pjmann-ab ”, “rrg-george-bb ”

Can be set in the job script #SBATCH --account=..Or environment export SBATCH_ACCOUNT=..

pjmann@cedar5:~$ sbatch hello_world_no_account.sh sbatch: error: ----------------------------------------You are associated with multiple _cpu allocations...Please specify one of the following accounts to submit this job:

RAS default accounts: def-pjmann-ab, def-pjmann, RAC accounts: rpp-pjmann-abCompute-Burst accounts:

Use the parameter --account=desired_account when submitting your job

Known Issues: Development

1. CVMFS vs System Conflictsa. Shared library and applications on CVMFS (shared between graham and cedar, and is

also on test and dev systems).b. But some base libraries and apps come with the system.c. So build systems (cmake, autoconf, ..) can get mixed upd. Ask [email protected]

2. Srun vs mpiexec (for MPI jobs)a. Slurm’s “srun” should be faster than “mpiexec” due to knowing the node grouping

configuration.3. GPU Visualization and X11

a. CPU rendering works, but still issues with GPU rendering


WestGrid Legacy Systems

Site System(s) Defunding / Data Deletion

Victoria Hermes/Nestor June 1, 2017Data available until July 31

Calgary Breezy/Lattice August 31, 2017

Edmonton Hungabee/Jasper October 1, 2017

WestGrid Migration Details: https://www.westgrid.ca/migration_process

IMPORTANT: Data on defunded systems will be deleted after the published deletion date. WestGrid will not retain any long term or back-up copies of user data and as noted above users must arrange for migration of their data.

https://www.westgrid.ca/migration_process

WG System Issues

Orcinus● Filesystem (Lustre) July 4/5, 2017. Jobs lost.

Hungabee● July 2: major hardware failure● July 6: another hardware failure● Not allocated. Scheduled for defunding

end-of-September.● May not be repairable without prohibitive cost.● Please move to large-mem nodes on graham

and cedar (up to 3 TB).Bugaboo● Current: no storage left (scratch and project)! ● Please clean-up

Our usual warning: Legacy systems are old!● minimal support.● Generally storage is

kept under vendor support, but not the nodes.

Training on New Systems

Documentation Wiki * Still under development! *https://docs.computecanada.ca

Getting started on the systems: ● Running Jobs -

https://docs.computecanada.ca/wiki/Running_jobs ● Available Software -

https://docs.computecanada.ca/wiki/Available_software ● Storage & File Management -

https://docs.computecanada.ca/wiki/Storage_and_file_management

Mini-Webinar Video Serieshttp://tinyurl.com/CCsystemwebinars

Short video demonstrations of how to use the new systems. Topics include:

● Software environment● File systems● Managing jobs● Common mistakes to avoid● Getting help

Webinar July 19: How jobs are scheduled to run on Graham & Cedarhttps://www.westgrid.ca/events

This session will explain and demonstrate how the Slurm scheduler determines how jobs are dispatched to resources. It will also provide recommended best practices when submitting jobs and demonstrate tools for monitoring jobs and the job queue on Graham and Cedar.

https://docs.computecanada.ca

https://docs.computecanada.ca

https://docs.computecanada.ca/wiki/Running_jobs

https://docs.computecanada.ca/wiki/Running_jobs

https://docs.computecanada.ca/wiki/Available_software




http://tinyurl.com/CCsystemwebinars



https://www.westgrid.ca/events



Alex RazoumovVisualization Coordinator

WestGrid

Cedar Demonstration

Cedar Graham

purpose general-purpose cluster for a variety of workloads

specs https://docs.computecanada.ca/wiki/Cedar https://docs.computecanada.ca/wiki/Graham

processor count 27,696 CPUs (66,000 in 2018) and 584 GPUs 32,136 CPUs and 320 GPUs

interconnect 100Gbit/s Intel OmniPath, non-blocking to1024 cores

56-100Gb/s Mellanox InfiniBand,non-blocking to 1024 cores

128GB base nodes 576 nodes: 32 cores/node 800 nodes: 32 cores/node

256GB large nodes 128 nodes: 32 cores/node 56 nodes: 32 cores/node

0.5TB bigmem500 24 nodes: 32 cores/node 24 nodes: 32 cores/node

1.5TB bigmem1500 24 nodes: 32 cores/node -

3TB bigmem3000 4 nodes: 32 cores/node 3 nodes: 56 cores/node

128GB GPU base 114 nodes: 24-cores/node, 4 NVIDIA P100Pascal GPUs with 12GB HBM2 memory

160 nodes: 24-cores/node, 2 NVIDIA P100Pascal GPUs with 12GB HBM2 memory

256GB GPU large 32 nodes: 24-cores/node, 4 NVIDIA P100Pascal GPUs with 16GB HBM2 memory

-

- all nodes have on-node SSD storage

WestGrid Town Hall July 12, 2017 1 / 22

https://docs.computecanada.ca/wiki/Cedar

https://docs.computecanada.ca/wiki/Graham

File systemsDetails at https://docs.computecanada.ca/wiki/Storage_and_file_management

filesystem quotas backedup?

purged? performance mounted oncompute nodes?

/home 50GB, 5e5 files per user nightly no medium yes

/scratch no real quotas, except whenfull

no yes high for largefiles

yes

/project(long-term disk

storage)

1-10TB, 5e5 files per user10TB, 5e6 files per group

nightly no medium generally no

/nearline (tape

with disk caching)

5TB per group via RAC no no medium tolow

no

/tmp none no maybe very high local

Wide range of options from high-speed temporary storage to different kinds of long-term storage

Checking disk usage: coming soon to the wiki

Requesting more storage: see the wiki



Logging into the systems: use your CC account

On Mac or Linux in terminal:

$ ssh [email protected] # Cedar login node

On Windows many options:I MobaXTerm https://docs.computecanada.ca/wiki/Connecting_with_MobaXTermI PuTTY https://docs.computecanada.ca/wiki/Connecting_with_PuTTY

I bash from the Windows Subsystem for Linux (WSL) – Windows 10 only, need to enable developer mode andthen WSL

New Compute Canada systems use CC accounts, while legacy systems use WestGrid accounts

Ssh key pairs are very handy to avoid typing passwordsI implies secure handling of private keys, non-empty passphrasesI https://docs.computecanada.ca/wiki/SSH_KeysI https://docs.computecanada.ca/wiki/Using_SSH_keys_in_Linux

I https://docs.computecanada.ca/wiki/Generating_SSH_keys_in_Windows

GUI connection: X11 forwarding (through ssh), VNC, x2go – not yet fully set up


https://docs.computecanada.ca/wiki/Connecting_with_MobaXTerm

https://docs.computecanada.ca/wiki/Connecting_with_PuTTY

https://docs.computecanada.ca/wiki/SSH_Keys

https://docs.computecanada.ca/wiki/Using_SSH_keys_in_Linux

https://docs.computecanada.ca/wiki/Generating_SSH_keys_in_Windows

Cluster software environment at a glanceAll system run Linux (CentOS 7)

Programming languages: C/C++, Fortran 90, Python, R, Java, Chapel – severaldifferent versions and flavours for most of these

CPU parallel development support: MPI, OpenMP, Chapel

GPU parallel development support: CUDA, OpenCL, OpenACC

Job scheduler: Slurm open-source scheduler and resource manager

Popular software: installed by staff, listed athttps://docs.computecanada.ca/wiki/Available_software

I lower-level, not performance sensitive packages installed via Nix package managerI general packages installed via EasyBuild frameworkI everything located under /cvmfs, loaded via modules (next slide)

Other softwareI email [email protected] with your request, orI can compile in your own space (feel free to ask staff for help)



Software modules

Use appropriate modules to load centrally-installed software (might have to selectthe right version)

$ module avail <name> # search for a module$ module spider <name> # will give a little bit more info$ module list # show currently loaded modules$ module load moduleName$ module unload moduleName$ module show moduleName # show commands in the module

All associated prerequisite modules will be automatically loaded as well

Modules must be loaded before a job using them is submitted


File transfer(1) use scp to copy individual files and directories

$ scp filename [email protected]:/path/to$ scp [email protected]:/path/to/filename localPath

(2) use rsync to sync files or directories$ flags=’-av --progress --delete’$ rsync $flags localPath/*pattern* [email protected]:/path/to$ rsync $flags [email protected]:/path/to/*pattern* localPath

Or use Globus file transfer https://docs.computecanada.ca/wiki/GlobusI easy-to-use web interface https://globus.computecanada.ca (log in with your

CC account) to automate file transfers between any two endpointsI fast, reliable, and secureI runs in the background: initialize transfer and close the browser, it’ll email statusI uses GridFTP transfer protocol: much better performance than scp, rsyncI automatically restarts interrupted transfers, retries failures, checks file integrity, handles

recovery from faults


https://docs.computecanada.ca/wiki/Globus

https://globus.computecanada.ca

Installed compilers

Intel GNU PGIintel/2016.4 and openmpi/2.1.1 module load gcc/5.4.0 (∗) module load pgi/17.3 (∗)

loaded by defaultC icc mpicc gcc -O2 mpicc pgcc mpicc

Fortran 90 ifort mpifort gfortran -O2 mpifort pgfortran mpifort

C++ icpc mpiCC g++ -O2 mpiCC pgc++ mpiCC

OpenMP flag -qopenmp -fopenmp -mp

(∗) in both cases intel/2016.4 will be unloaded and openmpi/2.0.2 reloaded automatically

mpiXX scripts invoke the right compiler and link your code to the correct MPI libraryuse mpiXX --show to view the commands they use to compile and link


Scheduler: submitting serial jobs

$ icc pi.c -o serial$ sbatch [other flags] job_serial.sh$ squeue -u username [-t RUNNING] [-t PENDING] # list all current jobs$ sacct -j jobID [--format=JobID,MaxRSS,Elapsed] # list resources used by completed job

#!/bin/bash#SBATCH --time=00:05:00 # walltime in d-hh:mm or hh:mm:ss format#SBATCH --job-name="quick test"#SBATCH --mem=100 # 100M#SBATCH --account=def-razoumov-ac./serial

It is good practice to put all flags into a job script (and not the command line)

Could specify number of other flags (more on these later)


Scheduler: submitting array jobsJob arrays are a handy tool for submitting many serial jobs that have the same executable and mightdiffer only by the input they are receiving through a file

Job arrays are preferred as they don’t require as much computation by the scheduling system toschedule, since they are evaluated as a group instead of individually

In the example below we want to run 30 times the executable “myprogram” that requires an input file;these files are called input1.dat, input2.dat, ..., input30.dat, respectively

$ sbatch job_array.sh [other flags]

#!/bin/bash#SBATCH --array=1-30 # 30 jobs#SBATCH --job-name=myprog # single job name for the array#SBATCH --time=02:00:00 # maximum walltime per job#SBATCH --mem=100 # maximum 100M per job#SBATCH --account=def-razoumov-ac#SBATCH --output=myprog%A%a.out # standard output#SBATCH --error=myprog%A%a.err # standard error# in the previous two lines %A" is replaced by jobID and "%a" with the array index./myprogram input$SLURM_ARRAY_TASK_ID.dat


Scheduler: submitting OpenMP or threaded jobs$ icc -qopenmp sharedPi.c -o openmp$ sbatch job_openmp.sh [other flags]$ squeue -u username [-t RUNNING] [-t PENDING] # list all current jobs$ sacct -j jobID [--format=JobID,MaxRSS,Elapsed] # list resources used by completed job

#!/bin/bash#SBATCH --cpus-per-task=4 # number of cores#SBATCH --time=0-00:05 # walltime in d-hh:mm or hh:mm:ss format#SBATCH --mem=100 # 100M for the whole job (all threads)#SBATCH --account=def-razoumov-acexport OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK # passed to the programecho running on $SLURM_CPUS_PER_TASK cores./openmp

Did you get any speedup running this calculation on four cores?


Scheduler: submitting MPI jobs

$ mpicc distributedPi.c -o mpi$ sbatch job_mpi.sh [other flags]$ squeue -u username [-t RUNNING] [-t PENDING] # list all current jobs$ sacct -j jobID [--format=JobID,MaxRSS,Elapsed] # list resources used by completed job

#!/bin/bash#SBATCH --ntasks=4 # number of MPI processes#SBATCH --time=0-00:05 # walltime in d-hh:mm or hh:mm:ss format#SBATCH --mem-per-cpu=100 # in MB#SBATCH --account=def-razoumov-acsrun ./mpi # or mpirun

Did you get any speedup running this calculation on four processors?What is the code’s parallel efficiency? Why is it not 100%?


Scheduler: submitting GPU jobs

#!/bin/bash#SBATCH --nodes=3 # number of nodes#SBATCH --gres=gpu:1 # GPUs per node#SBATCH --mem=4000M # memory per node#SBATCH --time=0-05:00 # walltime in d-hh:mm or hh:mm:ss format#SBATCH --output=%N-%j.out # %N for node name, %j for jobID#SBATCH --account=def-razoumov-acsrun ./gpu_program

Do not run this script: only an example


Scheduler: interactive jobs

$ salloc --time=1:0:0 --ntasks=2 --account=def-razoumov-ac # 2-core interactive job$ echo $SLURM_... # can check out Slurm environment variables$ srun ./mpi # run an MPI code, could also use mpirun/mpiexec$ exit # terminate the job

Interactive jobs should automatically go to one of the Slurm interactive partitions$ sinfo -a | grep interac

Interactive jobs are useful for debugging or for any interactive work, e.g., GUIvisualization

I interactive CPU-based ParaView client-server visualization on Cedar and Grahamhttps://docs.computecanada.ca/wiki/Visualization

Make sure to only run the job on the processors assigned to your job – this willhappen automatically if you use srun


https://docs.computecanada.ca/wiki/Visualization

Slurm script flags and environment variablesSlurm script flags to specify job parameters

I --nodes, --ntasks=..., --time=..., --mem-per-cpu=..., --cpus-per-task=..., --output=...,--error=...

I use them inside a job script (with #SBATCH) or in the command line (after sbatch)

I we already saw many examples in previous slides

I for full list of flags, type man sbatch

Slurm environment variables are available inside running jobs

I SLURM_JOB_ID, SLURM_NNODES, SLURM_NTASKS, SLURM_MEM_PER_CPU,SLURM_JOB_NODELIST, ...

I you can start an interactive job, type echo $SLURM and then hit Tab to see all definedvariables inside your job

I often useful inside job scripts to pass parameters to your program at runtime, to printout job information


Slurm jobs and memory

It is very important to specify memory correctly!

If you don’t ask for enough, and your job uses more, your job will be killed

If you ask for too much, it will take a much longer time to schedule a job, and youwill be wasting resources

If you ask for more memory than is available on the cluster your job will never run;the scheduling system will not stop you from submitting such a job or even warnyou

I always ask for slightly less than total memory on a node as some memory is used for theOS, and your job will not start until enough memory is available

Can use either #SBATCH --mem=4000 or #SBATCH --mem-per-cpu=2000

What’s the best way to find your code’s memory usage?


Slurm jobs and memory (cont.)

Second best way: use Slurm command to estimate your completed code’s memoryusage

$ sacct -j jobID [--format=JobID,MaxRSS,Elapsed] # list resources used by completed job

Use the measured value with a bit of a cushion, maybe 15-20%

Be aware of the discrete polling nature of Slurm’s measurements

I sampling at equal time intervals might not always catch spikes in memory usageI sometimes you’ll see that your running process is killed by the Linux kernel (via

kernel’s cgroups https://en.wikipedia.org/wiki/Cgroups) since it has exceeded itsmemory limit but Slurm did not poll the process at the right time to see the spike inusage that caused the kernel to kill the process, and reports lower memory usage

I sometimes sacct output in the memory field will be empty, as Slurm has not had timeto poll the job (job ran too fast)


https://en.wikipedia.org/wiki/Cgroups

Getting information about your job

$ squeue -u username [-t RUNNING] [-t PENDING] # list all current jobs$ squeue -p partitionName # list all jobs in a partition$ sinfo # view information about Slurm partitions$ sacct -j jobID --format=JobID,MaxRSS,Elapsed # resources used by

# a completed job$ sacct -u username --format=JobID,JobName,AveCPU,MaxRSS,MaxVMSize,Elapsed$ scontrol show job jobID # produce a very detailed report for the job$ sprio [-j jobID1,jobID2] [-u username] # list job priority information$ sshare # show usage info for user$ sinfo --states=idle # show idle node(s) on cluster$ scancel [-t PENDING] [-u username] [jobID] # kill/cancel jobs

Job states: R = running, PD = pending, CG = completing right now, F = failed


Best practices: computing

Production runs: only on compute nodes via the scheduler

I do not run anything intensive on login nodes or directly on compute nodes

Only request resources (memory, running time) needed

I with a bit if a cushion, maybe 115-120% of the measured valuesI use Slurm command to estimate your completed code’s memory usage

For faster turnaround request whole-nodes (-nodes=...) and short run-times(-time=...)

I these can run on the “entire cluster” partitions, as opposed to smaller partitions forlonger and “by-core” jobs

Do not run unoptimized codes (use compilation flags -O2 or -O3 if needed)

Be smart in your programming language choice, use precompiled libraries


Best practices: file systems

Filesystems in CC are a shared resource and should be used responsibly

Do not store millions of small filesI organize your code’s outputI use tar or even better dar (http://dar.linux.free.fr, supports indexing, differential

archives, encryption)

Do not store large data as ASCII (anything bigger than a few MB): waste of diskspace and bandwidth

I use a binary formatI use scientific data formats (NetCDF, HDF5, etc.): portability, headers, binary, compression, parallelI compress your files

Use the right filesystemLearn and use parallel I/OIf searching inside a file, might want to read it firstRegularly clean up your data in /scratch and /project, possibly archive elsewhere


http://dar.linux.free.fr

Documentation and getting helpAll documentation at https://docs.computecanada.ca/wiki

Getting started videos http://bit.ly/2sxGO33 (CC youtube channel)

Legacy systems documented at https://www.westgrid.ca

Email support (goes to the ticketing systems)I [email protected] for most questions or if not sureI [email protected] for questions about Globus file transferI [email protected] for questions about CC accounts

I [email protected] for questions about visualization

Try to include your full name, CC username, institution, the cluster name, copy andpaste as much detail as you can (error messages, jobID, job script, software version)

Please get to know your local supportI difficult problems are best dealt face-to-faceI might be best for new users


https://docs.computecanada.ca/wiki

http://bit.ly/2sxGO33

https://www.westgrid.ca

Upcoming Saskatchewan HPC summer schoolProgram and registration at http://bit.ly/usaskss

July 24-27 at the University of Saskatchewan

beginner level intermediate level expert level

Domain sessions

I bioinformaticsI material science

Parallel programming

I parallel programming in ChapelI GPU programming with CUDA

Scientific computing

I introduction to HPCI scientific computing with PETScI scientific visualizationI Globus and research data managementI using GPUs via high-level languages

(Python, Matlab) and libraries

Short break in August, more training sessions in the fallPlanning to run ∼2 summer schools each year


http://bit.ly/usaskss

Questions?

Webstream viewers: email [email protected]

Vidyo viewers: unmute & ask question or use Vidyo Chat(chat bubble icon in Vidyo menu)


Support

Contact us anytime:

[email protected]

docs.computecanada.ca



http://www.westgrid.ca

http://www.westgrid.ca



WestGrid Town Hall & Cedar Office Hours...WestGrid Town Hall & Cedar Office Hours Patrick Mann,...

Documents

Transcript of WestGrid Town Hall & Cedar Office Hours...WestGrid Town Hall & Cedar Office Hours Patrick Mann,...