[email protected] May 12, 2016 Guillimin HPC … · Guillimin HPC Users Meeting - May 2016...

26
Guillimin HPC Users Meeting - May 2016 Guillimin HPC Users Meeting May 12, 2016 [email protected] McGill University / Calcul Québec / Compute Canada Montréal, QC Canada

Transcript of [email protected] May 12, 2016 Guillimin HPC … · Guillimin HPC Users Meeting - May 2016...

Page 1: guillimin@calculquebec.ca May 12, 2016 Guillimin HPC … · Guillimin HPC Users Meeting - May 2016 Guillimin HPC Users Meeting May 12, 2016 guillimin@calculquebec.ca McGill University

Guillimin HPC Users Meeting - May 2016

Guillimin HPC Users MeetingMay 12, 2016

[email protected]

McGill University / Calcul Québec / Compute CanadaMontréal, QC Canada

Page 2: guillimin@calculquebec.ca May 12, 2016 Guillimin HPC … · Guillimin HPC Users Meeting - May 2016 Guillimin HPC Users Meeting May 12, 2016 guillimin@calculquebec.ca McGill University

Guillimin HPC Users Meeting - May 2016

• Compute Canada News• System Status• Software Updates• Training News• Special Topic

• How to Build Your Own Modules

Outline

2

Page 3: guillimin@calculquebec.ca May 12, 2016 Guillimin HPC … · Guillimin HPC Users Meeting - May 2016 Guillimin HPC Users Meeting May 12, 2016 guillimin@calculquebec.ca McGill University

Guillimin HPC Users Meeting - May 2016

• HPCS 2016 - Edmonton - June 19th to 22nd– Registration open: http://canheit-hpcs.ualberta.ca/

Compute Canada News

3

Page 4: guillimin@calculquebec.ca May 12, 2016 Guillimin HPC … · Guillimin HPC Users Meeting - May 2016 Guillimin HPC Users Meeting May 12, 2016 guillimin@calculquebec.ca McGill University

Guillimin HPC Users Meeting - May 2016

• Completed: Scheduled Downtime for Maintenance– From: Friday April 22 at 8:00 AM– Target for return to service: evening Saturday April 23– Maintenance:

• On ETS campus electrical network• On Guillimin network (network switch firmware updates)

– System Access Restoration• Login Nodes: April 24• Storage: April 24 for /sb and /lb, but not /gs (*)• Batch System: April 25 with temporary scratch• Switch firmware updates have improved stability of

network access to various services within our Virtual Machine (VM) hosting environment

System Status

4

Page 5: guillimin@calculquebec.ca May 12, 2016 Guillimin HPC … · Guillimin HPC Users Meeting - May 2016 Guillimin HPC Users Meeting May 12, 2016 guillimin@calculquebec.ca McGill University

Guillimin HPC Users Meeting - May 2016

• Summary: /gs file system issues (April 14 to May 6)– The 3 PB /gs file system is built from 5 individual

storage building blocks– Storage building blocks provide redundancy at the

hardware and software layers so as to handle failures - but not possible to protect against 100% of failures

– Starting April 14:• a set of hardware issues triggered an i/o storm that

exposed several bugs within the software layer of GPFS• Resulted in potential metadata corruption and i/o errors

when attempting access to many files on /gs• The extent of the damage to metadata and the ability to

repair the corruption was initially unknown.• Although metadata is replicated, there was risk of

possible data loss.

Storage Status

5

Page 6: guillimin@calculquebec.ca May 12, 2016 Guillimin HPC … · Guillimin HPC Users Meeting - May 2016 Guillimin HPC Users Meeting May 12, 2016 guillimin@calculquebec.ca McGill University

Guillimin HPC Users Meeting - May 2016

• Timeline of Events (April 14 to May 6)– April 14: storm of i/o errors within one building block– April 17: hardware i/o errors caused corruption of a portion

of /gs metadata; parts of files are unavailable for read and/or write access

– April 22: Site Maintenance Period begins– April 24: /gs kept offline pending root cause analysis and

availability of software patches (efixes) to correctly handle the i/o errors and would attempt to fix metadata

– May 2: “surgery” to manually repair metadata and apply the efixes - successful!

– May 3 - 6: stress testing reveals problematic disk drawers and module that are subsequently replaced

– May 6: /gs file system back online

Storage Status

6

Page 7: guillimin@calculquebec.ca May 12, 2016 Guillimin HPC … · Guillimin HPC Users Meeting - May 2016 Guillimin HPC Users Meeting May 12, 2016 guillimin@calculquebec.ca McGill University

Guillimin HPC Users Meeting - May 2016

• Current status of /gs (as of May 12)– Since May 6 all system elements behaving well and as

expected.– Metadata successfully restored and with no data loss.– Monday May 9 @ 10:00 to 10:06 am

• New storm of i/o errors observed in 1 of 5 building blocks• GPFS hardware and updated firmware and software

cleanly handled such errors• Root cause analysis traces the problem source to a faulty

server HCA (host channel adapter) card• HCA will be replaced in next days with no interruption of

access to /gs

Storage Status

7

Page 8: guillimin@calculquebec.ca May 12, 2016 Guillimin HPC … · Guillimin HPC Users Meeting - May 2016 Guillimin HPC Users Meeting May 12, 2016 guillimin@calculquebec.ca McGill University

Guillimin HPC Users Meeting - May 2016

• Reminder of Storage Policies and Practices– All storage is operated with redundancy at the

hardware and software layers - but not 100% guaranteed!

– Guillimin Storage File Systems• Home: small files, codes, backed-up nightly• Scratch: temporary i/o files, no backup• Project: any project files either temporary or longer term,

no backup by default without RAC award

– To mitigate against loss of important data we continue to recommend that, whenever possible, copies of important data are kept off-site

Storage Status

8

Page 9: guillimin@calculquebec.ca May 12, 2016 Guillimin HPC … · Guillimin HPC Users Meeting - May 2016 Guillimin HPC Users Meeting May 12, 2016 guillimin@calculquebec.ca McGill University

Guillimin HPC Users Meeting - May 2016

• Reminder about temporary scratch space on /sb:– The cleanup of /sb/scratch spaces will start on May 18– Make sure to move all necessary data to $SCRATCH

(/gs/scratch/$USER) or to a project space– Also, make sure your job scripts are using $SCRATCH

instead of “/sb/scratch/$USER”• Partition /gs is almost full:

– 96% used– 129TB left (as of May 11)– We are actually moving some cold data to tapes

• Metadata remains on disks• Users can still access their files, but with a significant

latency

Storage Status

9

Page 10: guillimin@calculquebec.ca May 12, 2016 Guillimin HPC … · Guillimin HPC Users Meeting - May 2016 Guillimin HPC Users Meeting May 12, 2016 guillimin@calculquebec.ca McGill University

Guillimin HPC Users Meeting - May 2016

• About the Lmod/EasyBuild based module structure:– Default since March 22, 2016– New LUA syntax (presented in special topic)

• New Installations– Total Academic Headcount (TAH) Matlab license file

(for McGill users only) : R201{2a, 4a, 6a}– MATLAB Distributed Computing Server (MDCS) license

file (for all users, 64 licenses): R201{2, 3, 4, 5}{a, b}– GraphicsMagick/1.3.21 (intel/2015b, iomkl/2015b)– argtable/2.13 (foss/2015b, iomkl/2015b)– VTK/6.3.0-Python-2.7.10 (iomkl/2015b)

Software Update

10

Page 11: guillimin@calculquebec.ca May 12, 2016 Guillimin HPC … · Guillimin HPC Users Meeting - May 2016 Guillimin HPC Users Meeting May 12, 2016 guillimin@calculquebec.ca McGill University

Guillimin HPC Users Meeting - May 2016

• See “Training and Outreach” at www.hpc.mcgill.ca for our calendar of training and workshops for 2016 and for links to registration pages

• Upcoming events: calculquebec.eventbrite.ca• May 17 - Introduction au nuage de Calcul Canada (U. de

Sherbrooke, online training)• May 24 - Assemblée générale de Calcul Québec• May 26 - Data Intensive Computing• June - Suggestions for training? Please let us know!

• All materials from previous workshops are available online: https://wiki.calculquebec.ca/w/Formations/en

• Recently completed:• April 28 - Advanced OpenMP• April 21 - Introduction to OpenMP

Training News

11

Page 12: guillimin@calculquebec.ca May 12, 2016 Guillimin HPC … · Guillimin HPC Users Meeting - May 2016 Guillimin HPC Users Meeting May 12, 2016 guillimin@calculquebec.ca McGill University

Guillimin HPC Users Meeting - May 2016

• Questions? Comments?• We value your feedback. Contact us at:

[email protected]

• Guillimin Operational News for Users– Status Pages

• http://www.hpc.mcgill.ca/index.php/guillimin-status• http://serveurscq.computecanada.ca (all CQ systems)

– Follow us on Twitter• http://twitter.com/McGillHPC

User Feedback and Discussion

12

Page 13: guillimin@calculquebec.ca May 12, 2016 Guillimin HPC … · Guillimin HPC Users Meeting - May 2016 Guillimin HPC Users Meeting May 12, 2016 guillimin@calculquebec.ca McGill University

Guillimin HPC Users Meeting - May 2016

McGill University / Calcul Québec / Compute CanadaMontréal, QC Canada

Guillimin HPC Users MeetingMay 12, 2016

How to Build Your Own Modules

Page 14: guillimin@calculquebec.ca May 12, 2016 Guillimin HPC … · Guillimin HPC Users Meeting - May 2016 Guillimin HPC Users Meeting May 12, 2016 guillimin@calculquebec.ca McGill University

Guillimin HPC Users Meeting - May 2016

How Do You Configure Your Environment?

• If you want to run a specific version of a software:export PATH=$HOME/soft-1.2.3/bin:$PATH

which soft

• Most applications have library dependencies:export LD_LIBRARY_PATH=$HOME/soft-1.2.3/lib:$LD_LIBRARY_PATH

ldd $(which soft)

• Some applications have manual pages:export MANPATH=$HOME/soft-1.2.3/share/man:$MANPATH

man soft

• For your own code, you have to set:• For interpreters: PYTHONPATH, PERL5LIB• For C/C++, Fortran: CPATH, FPATH, LIBRARY_PATH

14

Page 15: guillimin@calculquebec.ca May 12, 2016 Guillimin HPC … · Guillimin HPC Users Meeting - May 2016 Guillimin HPC Users Meeting May 12, 2016 guillimin@calculquebec.ca McGill University

Guillimin HPC Users Meeting - May 2016

How do you set these environment variables?• From ~/.bashrc?

• This file is a script executed at the beginning of all Bash processes, so it is quite convenient

• When using a different version of a software, there is a risk of conflicts (ex.: a binary with wrong libraries).

• From a source file?• Very flexible, it can run any low level Bash command. A

source file can load another source file. The name of a source file can describe the environment

• How to maintain multiple versions? How to load only the minimum needed? How to implement requirements? How to remove values in variables and eliminate conflicts?

How Do You Configure Your Environment?

15

Page 16: guillimin@calculquebec.ca May 12, 2016 Guillimin HPC … · Guillimin HPC Users Meeting - May 2016 Guillimin HPC Users Meeting May 12, 2016 guillimin@calculquebec.ca McGill University

Guillimin HPC Users Meeting - May 2016

Modules modify environment variables. They:• Set or prepend values to environment variables

• When unloading a module, the corresponding values are automatically removed from the variable

• Load required modules (or dependencies)• Example: a specific OpenMPI module can load the

corresponding compiler module• Prevent conflicts between similar modules

• Conflicts between themselves and another module, or between two dependencies

• Load new sets of modules• Provide a description of themselves and a “help” text

What Lmod Modules Can Do

16

Page 17: guillimin@calculquebec.ca May 12, 2016 Guillimin HPC … · Guillimin HPC Users Meeting - May 2016 Guillimin HPC Users Meeting May 12, 2016 guillimin@calculquebec.ca McGill University

Guillimin HPC Users Meeting - May 2016

Default Sets of Modules

17

Page 18: guillimin@calculquebec.ca May 12, 2016 Guillimin HPC … · Guillimin HPC Users Meeting - May 2016 Guillimin HPC Users Meeting May 12, 2016 guillimin@calculquebec.ca McGill University

Guillimin HPC Users Meeting - May 2016

Where Modules are Coming From?

18

Page 19: guillimin@calculquebec.ca May 12, 2016 Guillimin HPC … · Guillimin HPC Users Meeting - May 2016 Guillimin HPC Users Meeting May 12, 2016 guillimin@calculquebec.ca McGill University

Guillimin HPC Users Meeting - May 2016

Where Modules are Coming From?

19

Page 20: guillimin@calculquebec.ca May 12, 2016 Guillimin HPC … · Guillimin HPC Users Meeting - May 2016 Guillimin HPC Users Meeting May 12, 2016 guillimin@calculquebec.ca McGill University

Guillimin HPC Users Meeting - May 2016

The Content and Syntax of a LUA File

20

help([[Toolkit ... Intel C/C++ and Fortran compilers, Intel MKL & OpenMPI. - ...]])

whatis([[Name: iomkl]])

whatis([[Version: 2015b]])

whatis([[Description: ... ]])

add_property("type_","recommended")

conflict("iomkl")

load("icc/2015.3.187-GNU-4.9.3-2.25")

load("ifort/2015.3.187-GNU-4.9.3-2.25")

load("OpenMPI/1.8.8")

load("iompi/2015b")

load("imkl/11.2.3.187")

Page 21: guillimin@calculquebec.ca May 12, 2016 Guillimin HPC … · Guillimin HPC Users Meeting - May 2016 Guillimin HPC Users Meeting May 12, 2016 guillimin@calculquebec.ca McGill University

Guillimin HPC Users Meeting - May 2016

The Content and Syntax of a LUA File

21

local root = "/software/CentOS-6/eb/software/Core/iomkl/2015b"

prepend_path("MODULEPATH", "/software/CentOS-6/eb/modules/all/Toolchain/iomkl/2015b")

setenv("EBROOTIOMKL", root)

setenv("EBVERSIONIOMKL", "2015b")

setenv("EBDEVELIOMKL", pathJoin(root, "easybuild/Core-iomkl-2015b-easybuild-devel"))

• This could be saved in:• $HOME/modulefiles/name/1.0.1.lua• /gs/project/abc-123-aa/software/modulefiles/name/1.0.1.lua

Page 22: guillimin@calculquebec.ca May 12, 2016 Guillimin HPC … · Guillimin HPC Users Meeting - May 2016 Guillimin HPC Users Meeting May 12, 2016 guillimin@calculquebec.ca McGill University

Guillimin HPC Users Meeting - May 2016

The Content and Syntax of older TCL Files

22

https://wiki.calculquebec.ca/w/Cr%C3%A9er_un_module/en#%Module1.0

###############################################################

## OPENMPI MPI lib

proc ModulesHelp { } {

puts stderr "\tAdds the OpenMPI library. "

}

module-whatis "(Category_______) mpi"

module-whatis "(Name___________) OpenMPI"

module-whatis "(Version________) 1.6.3"

conflict mpi

prereq compilers/intel/2013

set root /software/MPI/openmpi/1.6.3_intel

prepend-path PATH $root/bin

prepend-path LD_LIBRARY_PATH $root/lib

setenv OMPI_MCA_plm_rsh_num_concurrent 960

Page 23: guillimin@calculquebec.ca May 12, 2016 Guillimin HPC … · Guillimin HPC Users Meeting - May 2016 Guillimin HPC Users Meeting May 12, 2016 guillimin@calculquebec.ca McGill University

Guillimin HPC Users Meeting - May 2016

• If you create or have access to multiple module repositories:• module use /gs/project/abc-123-aa/software/modulefiles

• module load ...

• Then, to use another set of modules:• module unload ... # Or: module purge

• module unuse /gs/project/abc-123-aa/software/modulefiles

• module use /gs/project/def-456-aa/software/modulefiles

For Multiple Project spaces

23

Page 24: guillimin@calculquebec.ca May 12, 2016 Guillimin HPC … · Guillimin HPC Users Meeting - May 2016 Guillimin HPC Users Meeting May 12, 2016 guillimin@calculquebec.ca McGill University

Guillimin HPC Users Meeting - May 2016

• A module system needs to present modules (with avail or spider commands) that are spread across multiple directories and subdirectories

• Listing all these modules is an expensive task doing lots of small I/O operations

• Typical file systems for clusters are not optimized for this kind of access

• Solution:• Caching all the information in a single database file,

and use the cache for avail or spider commands• Since each user could have access to different sets of

modules, each user will have a private cache file

About the Lmod Cache

24

Page 25: guillimin@calculquebec.ca May 12, 2016 Guillimin HPC … · Guillimin HPC Users Meeting - May 2016 Guillimin HPC Users Meeting May 12, 2016 guillimin@calculquebec.ca McGill University

Guillimin HPC Users Meeting - May 2016

• Problems of a cached module structure:• It has to be renewed periodically. A central cache is

updated every time we add a new module on the main tree. Then, users’ cache is updated on the next Bash session

• Updating a cache takes time, but it must be worth it:• Below $LMOD_SHORT_TIME seconds, cache is not

needed. By default, it is 10 seconds or less• If you write your own modulefiles watch out for

caching:• Force new cache via rm -rf ~/.lmod.d/.cache• Disable cache via export LMOD_SHORT_TIME=86400

About the Lmod Cache

25