Guillimin HPC Users Meeting December 15, 2016 guillimin ...

24
Guillimin HPC Users Meeting - December 2016 Guillimin HPC Users Meeting December 15, 2016 [email protected] McGill University / Calcul Québec / Compute Canada Montréal, QC Canada

Transcript of Guillimin HPC Users Meeting December 15, 2016 guillimin ...

Page 1: Guillimin HPC Users Meeting December 15, 2016 guillimin ...

Guillimin HPC Users Meeting - December 2016

Guillimin HPC Users MeetingDecember 15, 2016

[email protected]

McGill University / Calcul Québec / Compute CanadaMontréal, QC Canada

Page 2: Guillimin HPC Users Meeting December 15, 2016 guillimin ...

Guillimin HPC Users Meeting - December 2016

• Please be kind to your fellow user meeting attendees • Limit to two slices of pizza per person to start please• And please recycle your pop cans.• Thank you!

2

Page 3: Guillimin HPC Users Meeting December 15, 2016 guillimin ...

Guillimin HPC Users Meeting - December 2016

• Compute Canada News• System Status• Software Updates• Training News• Special Topic

• Singularity as an alternative to Docker for HPC systems

Outline

3

Page 4: Guillimin HPC Users Meeting December 15, 2016 guillimin ...

Guillimin HPC Users Meeting - December 2016

• Cedar (GP2) and Graham (GP3) specifications:• https://docs.computecanada.ca/wiki/Migration2016:Ne

w_Systems• NDC: National Data Cyberinfrastructure (storage)

• Cedar-Compute + Cedar-GPU + NDC-SFU• Graham-Compute + Graham-GPU + NDC-Waterloo

• 2017 Resource Allocation Competitions• RAC (RPP, Fast Tracks and RRG) reviews undergoing

Compute Canada News

4

Page 5: Guillimin HPC Users Meeting December 15, 2016 guillimin ...

Guillimin HPC Users Meeting - December 2016

• GPFS file system related downtimes:• Friday November 11 - fixed over the following week• Friday November 25 - fixed over the weekend• Monday November 28 - quick recovery on Nov. 29• Tuesday November 29 - recovery on Nov. 30 evening• Early December - intermittent slowness, with very

quick recovery due to active and sustained monitoring

Storage and Infiniband Status

5

Page 6: Guillimin HPC Users Meeting December 15, 2016 guillimin ...

Guillimin HPC Users Meeting - December 2016

• Contributing factors include:• Hardware issues: faulty infiniband network cables and

switch modules• Ethernet core switch downtime• GPFS software: Long waiters, causing node expelling

(temporarily losing access to GPFS), monitoring scripts that have good intentions but apply pressure

• Fixes and remedies applied:• Reseated and replaced faulty network cables• Made system more resilient: no more local DNS

lookups via ethernet, fixed scripts, so that failures are localized and do not spread to the whole system

Guillimin core elements nearly 6 years old

Storage and Infiniband Status

6

Page 7: Guillimin HPC Users Meeting December 15, 2016 guillimin ...

Guillimin HPC Users Meeting - December 2016

• Space Management• /gs is full: 97% used, 124 TB free (as of Dec. 15)

• For better space management we continue to migrate cold data from disk to tape• Metadata remains on disk• Users can still access their files through usual

methods, but with an increased latency• Storage space is a precious resource - manage it

wisely!• Delete temporary files, compress large files not

frequently accessed, tar many smaller files into collections, …

Storage Status

7

Page 8: Guillimin HPC Users Meeting December 15, 2016 guillimin ...

Guillimin HPC Users Meeting - December 2016

• Matlab R2016b (for users from McGill only)• Matlab Distributed Computing Server (MDCS) R2016b• Singularity/2.2• Stacks 1.44 (Genomic)• Intel Advisor/2017_update1 (analyzes vectorization and

threading in code)• LAMMPS/20161117 (Molecular Dynamics)• OpenFOAM/2.4.0 (CFD)• R-bundle-Bioconductor/3.3-R-3.3.1 (Bioinformatics)• OligoArrayAux/3.3 (Genomic)• Xerces-C++/3.1.4 (XML Parser)

New Software Installations

8

Page 9: Guillimin HPC Users Meeting December 15, 2016 guillimin ...

Guillimin HPC Users Meeting - December 2016

• All upcoming events: calculquebec.eventbrite.ca• ---

• Recently completed:• Nov. 23 - Programmation en R intermédiaire (U. Montreal)• Dec. 1 - Advanced and Parallel Python (McGill U.)• Dec. 5 - Easy GPU Programming with OpenACC (U.

Montreal)• Dec. 6 - Introduction à la programmation en Python (U.

Sherb.)• All materials from previous workshops are available

online: wiki.calculquebec.ca/w/Formations/en• All user meeting presentations online at www.hpc.mcgill.ca

Training News

9

Page 10: Guillimin HPC Users Meeting December 15, 2016 guillimin ...

Guillimin HPC Users Meeting - December 2016

• Support Level Activity during the Holiday Period• December 23 to January 2nd inclusive - returning January 3• Reduced level of access to general user support• All systems and services available and will be closely monitored• Priority and critical issues will be addressed

Other News

10

Happy Holidays! Joyeuses Fêtes!

Page 11: Guillimin HPC Users Meeting December 15, 2016 guillimin ...

Guillimin HPC Users Meeting - December 2016

• Questions? Comments?• We value your feedback. Contact us at:

[email protected]

• Guillimin Operational News for Users– Status Pages

• http://www.hpc.mcgill.ca/index.php/guillimin-status• http://serveurscq.computecanada.ca (all CQ systems)

– Follow us on Twitter• http://twitter.com/McGillHPC

User Feedback and Discussion

11

Page 12: Guillimin HPC Users Meeting December 15, 2016 guillimin ...

Guillimin HPC Users Meeting - December 2016

McGill University / Calcul Québec / Compute CanadaMontréal, QC Canada

Singularity as an alternative to Docker for HPC systemsDecember 15, 2016

[email protected]

Page 13: Guillimin HPC Users Meeting December 15, 2016 guillimin ...

Guillimin HPC Users Meeting - December 2016

Outline

• Overview of virtual machines and containers• What is Singularity?

• Container solution• Project lead: Gregory M. Kurtzer, LBNL

• (figures in this slide deck are his)• Why Singularity?

• Mobility of Compute• Reproducibility• User Freedom• Supports traditional HPC• Able to run newer software stack (that is, a whole

Linux distribution except the kernel itself) on older OS with minimal effort - or vice versa.

13

Page 14: Guillimin HPC Users Meeting December 15, 2016 guillimin ...

Guillimin HPC Users Meeting - December 2016

• Emulators• The whole machine including the CPU is emulated.

• Examples: Bochs, QEMU (without KVM), OpenMSX• Virtual Machines

• Most of the machine is emulated but CPU code runs mostly natively. There is a guest OS kernel.• Examples: VirtualBox, QEMU with KVM, VMware

• Containers• Code in a container interfaces directly with the host OS

kernel. There is no guest OS kernel.• Examples: Docker, Singularity

Emulators, Virtual Machines, Containers

14

Page 15: Guillimin HPC Users Meeting December 15, 2016 guillimin ...

Guillimin HPC Users Meeting - December 2016

Virtual Machines

15

Page 16: Guillimin HPC Users Meeting December 15, 2016 guillimin ...

Guillimin HPC Users Meeting - December 2016

Docker-style Containers

16

Page 17: Guillimin HPC Users Meeting December 15, 2016 guillimin ...

Guillimin HPC Users Meeting - December 2016

Docker vs Singularity

17

Page 18: Guillimin HPC Users Meeting December 15, 2016 guillimin ...

Guillimin HPC Users Meeting - December 2016

• Docker• Runs with a daemon that orchestrates everything• Primary use case: network service virtualization

• More lightweight than VMs• But… Docker tries to emulate VMs in many respects:

• Network isolation and other hardware isolation (using cgroups)

• Virtualized but still dangerous “root” account inside container

• “udocker” on guillimin removes “root” but still fairly isolated and relatively heavyweight.

• Singularity• No daemon, but only a launcher, container runs with normal

user-owned processes.• Only namespaces are virtual (file system) and optionally (not by

default), PIDs, no cgroups• So containers see the host network, infiniband, GPUs, etc.

Docker vs. Singularity

18

Page 19: Guillimin HPC Users Meeting December 15, 2016 guillimin ...

Guillimin HPC Users Meeting - December 2016

Singularity workflow

19

Page 20: Guillimin HPC Users Meeting December 15, 2016 guillimin ...

Guillimin HPC Users Meeting - December 2016

On system with root access (Linux laptop, Linux VM on Windows/Mac):

sudo singularity create --size 1024 centos7-ompi.img

Followed by bootstrapping, for example:sudo singularity bootstrap centos7-ompi.img

centos7-ompi.def Or importing, for example:

sudo singularity import tensorflow.img \

docker://tensorflow/tensorflow:latest

See if it works:singularity shell centos7-ompi.img

ls

exit

Singularity workflow

20

Page 21: Guillimin HPC Users Meeting December 15, 2016 guillimin ...

Guillimin HPC Users Meeting - December 2016

Next we copy the container to guillimin:scp centos7-ompi.img [email protected]:

And login to guillimin:ssh [email protected]

Loading the Singularity modulemodule load Singularity/2.2

Running a shell, can bind $SCRATCH or other folderssingularity shell centos7-ompi.img

singularity shell -B $SCRATCH centos7-ompi.img

mpirun inside/outside container:singularity exec mpirun -n 2 /usr/bin/mpi-ring

module load iomkl/2015b

mpirun -n 2 singularity exec /usr/bin/mpi-ring

Singularity workflow

21

Page 22: Guillimin HPC Users Meeting December 15, 2016 guillimin ...

Guillimin HPC Users Meeting - December 2016

OpenMPI with Singularity processes

22

Page 23: Guillimin HPC Users Meeting December 15, 2016 guillimin ...

Guillimin HPC Users Meeting - December 2016

Next we copy the container to guillimin:scp centos7-ompi.img [email protected]

And login to guillimin:ssh [email protected]

Loading the Singularity modulemodule load Singularity/2.2

Running a shell, can bind $SCRATCH or other folderssingularity shell centos7-ompi.img

singularity shell -B$SCRATCH

mpirun inside/outside container:singularity run mpirun -n 2 /usr/bin/mpi-ring

module load iomkl/2015b

mpirun -n 2 singularity /usr/bin/mpi-ring

Singularity workflow

23

Page 24: Guillimin HPC Users Meeting December 15, 2016 guillimin ...

Guillimin HPC Users Meeting - December 2016

Early adaptor, asked to install:NIAK, SIMEXP lab (Dr. Pierre Bellec, Pierre-Olivier Quirion)http://niak.simexp-lab.org/niak_installation.html

Singularity use within Guillimin

24