HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018....

66
High performance computing @ RZ CAU Rechenzentrum HPC introduction course 11 April 2018

Transcript of HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018....

Page 1: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

High performance computing @ RZ CAURechenzentrum

HPC introduction course

11 April 2018

Page 2: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

OutlineRechenzentrum

A. HPC services

B. Available HPC systems

C. Short tour through server rooms

D. Working on the local HPC systems

E. Discussion and questions

Page 3: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

Rechenzentrum

HPC services

Page 4: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

ContactRechenzentrum

HPC support team

User consulting

Dr. Karsten BalzerDr. Simone Knief

System administration

Dr. Cebel KucukkaracaDr. Holger NaundorfAlfred Wagner

E-mail support (ticket system)

[email protected]

Page 5: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

High performance computing @ RZ CAURechenzentrum

Provisioning of compute resources of different performance classes

• Local compute resources (at RZ)

• 2 x86-based Linux Clusters (rzcluster and caucluster)

• 1 hybrid NEC HPC system, consisting of◦ NEC x86-based HPC Linux cluster◦ NEC SX-ACE vector system

• Non-local compute resources (at HLRN)

• North-German Supercomputing Alliance (HLRN)◦ = Norddeutscher Verbund zur Forderung des Hoch- und Hochstleistungs-

rechnens, Hochleistungsrechner Nord◦ Component of the national HPC infrastructure◦ Formed by seven North-German states◦ By agreement the technology is regularly updated

Page 6: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

High performance computing @ RZ CAURechenzentrum

Provisioning of compute resources of different performance classes

• Local compute resources (at RZ)

• 2 x86-based Linux Clusters (rzcluster and caucluster)

• 1 hybrid NEC HPC system, consisting of◦ NEC x86-based HPC Linux cluster◦ NEC SX-ACE vector system

• Non-local compute resources (at HLRN)

• North-German Supercomputing Alliance (HLRN)◦ = Norddeutscher Verbund zur Forderung des Hoch- und Hochstleistungs-

rechnens, Hochleistungsrechner Nord◦ Component of the national HPC infrastructure◦ Formed by seven North-German states◦ By agreement the technology is regularly updated

All above compute resources are accessible to all CAU and GEOMAR members!

Page 7: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

High performance computing @ RZ CAURechenzentrum

HPC user consulting

• We provide user support for the aforementioned HPC systems

• In particular:

• Choice of appropriate computer architecture (x86/vector, RZ/HLRN)

• Assistance in software installation and porting

• Assistance in program and code optimization

• Support on parallelization and vectorization issues

• Support on HLRN project proposals

Page 8: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

Rechenzentrum

High performance computing

Page 9: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

What is high performance computing?Rechenzentrum

High performance computing

• = computer aided computation.

• HPC is not a very well defined term

• But mainly synonymous with “parallel computing”

• Due to the fact that typical HPC hardware allows for parallel computations

• Even desktop PCs are nowadays multi-core platforms

• Typical ingredients for a HPC machine

• Many compute nodes with multi-core processors

• Fast communication network between nodes

• Powerful I/O file systems

• Special hardware architecture (e.g., vector processors)

Page 10: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

Parallel compute systemsRechenzentrum

Two different forms

• MPP• = Massively Parallel Processing

• Typical features:

◦ Many compute nodes; moderate number or cores/node◦ Nodes cannot share main memory, but fast interconnect◦ Small to moderately large main memory per node◦ Moderate clock frequency

• SMP

• = Symmetric Multi-Processing

• Typical features:

◦ Single, isolated compute node; many cores◦ Processors share main memory◦ (Extremely) large main memory◦ High clock frequency

Page 11: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

Vector compute systemsRechenzentrum

Features

• Has processors with a special central processing unit (vector CPU)

• A vector CPU executes instructions that operate on one-dimensional arrays(vectors) instead on single data items.

• On-chip parallel processing

• 6= Scalar processors, whose instructions operate on single data items,and where parallel processing is achieved over number of processors.

• Suitable for vectorizable code, e.g. matrix-vector operations

Page 12: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

(Modern) scalar vs. vector architectureRechenzentrum

Page 13: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

Parallel compute systems @ RZ CAURechenzentrum

Starting in 1999

1999-2002 Cray T3E18 CPUs, 256 MB 22 GFlops

2002-2006 SGI Altix 3700128 CPUs, 4 GB 664 GFlops

• Since 2005: Continuous extension of the Linux RZ-Cluster (rzcluster)

Page 14: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

Vector systems @ RZ CAURechenzentrum

Long history of vector system provisioning

1987 Cray X-MP 1/8 470 MFlops

1994 Cray Y-MP/M90 666 MFlops1996 Cray T94 7.2 GFlops

2002 NEC SX-5 64 GFlops2005 NEC SX-8 640 GFlops2007 NEC SX-8+SX-8R 1.2 TFlops2009 NEC SX-9 8.2 TFlops

2014 NEC SX-ACE 65.5 TFlops

Page 15: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

Hybrid NEC HPC systemRechenzentrum

Hybrid NEC HPC system

• In operation since 2012:

• NEC x86-based HPC Linux cluster• NEC SX-ACE vector system

• Uniform system environment:

• Login nodes• Shared, global I/O file systems• Single batch system to organize compute jobs

Page 16: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

Hybrid NEC HPC systemRechenzentrum

Hybrid NEC HPC system

• In operation since 2012:

• NEC x86-based HPC Linux cluster• NEC SX-ACE vector system

• Uniform system environment:

• Login nodes• Shared, global I/O file systems• Single batch system to organize compute jobs

Use different CPU architectures on one system without data transfer!

Page 17: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

Example: Lapack routine DGEEVRechenzentrum

• DGEEV: Computes eigenvalues and eigenvectors of a general real matrix.

• Single-core performance:

5000

0.1

1

10

100

1000

500 500001000 10000

RU

NTI

ME

/SE

CO

ND

S

MATRIX DIMENSION

LAPACK DGEEV

HASWELL (x86)SKYLAKE (x86)SX-ACE (vector)

x86: Intel/17.0.4 MKL+ AVX#

vector: MathKeisan/4.0.3

clock frequencies:

Haswell: 3.1 GHzSkylake: 3.5 GHzSX-ACE: 1.0 GHz

Page 18: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

Rechenzentrum

Available HPC systems

Page 19: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

Rechenzentrum

Local HPC systems

1. “rzcluster” and “caucluster”

2. NEC HPC system

Page 20: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

“rzcluster”Rechenzentrum

Hardware

• 1 login node (12 cores, 48 GB main memory)

• 142 batch nodes (1948 cores in total)

• Different processors and processor generations (AMD and Intel)• Cores per node: 8 - 32• Main memory per node: 32 GB - 1 TB

• Generally usable resources:

• 40 batch nodes (388 cores in total)◦ 23 nodes with 8 cores and 32 GB each◦ 17 nodes with 12 cores and 48 GB each

System environment

• Operating system: Scientific Linux 6.x (part. Cent OS 7.x)

• Batch system: Slurm

Page 21: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

“caucluster”Rechenzentrum

Hardware

• 1 login node (8 cores, 64 GB main memory)

• 7 batch nodes (240 cores in total)• 4-socket Intel Haswell CPUs• Cores per node: 40• Main memory per node: 256 GB

System environment

• Operating system: CentOS 7.x

• Batch system: Slurm• Job priority is calculated via a fair-share algorithm• 6= FIFO (first-in first-out)

Page 22: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

Shared resources on “rzcluster” and “caucluster”Rechenzentrum

I/O file systems

• 3 different file systems

• $HOME directory◦ 13 TB for all users◦ Regularly backup

• $WORK directory◦ 350 TB parallel BeeGFS file system

• TAPE library◦ Accessible from the login nodes

Page 23: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

NEC HPC systemRechenzentrum

Hardware

• Hybrid HPC system (470 TFlops peak performance)

• Scalar NEC HPC Linux cluster• NEC SX-ACE vector system

• 4 login node (32 cores, 768 GB main memory)

• HPC Linux cluster:• 180 batch nodes:◦ Intel Xeon Gold 6130 [Skylake-SP] with 32 cores and 192 GB,

resp. 384 GB (8 nodes) main memory• 18 batch nodes:◦ Intel Haswell with 24 cores and 128 GB main memory

• Fast EDR infiniband network between nodes

• SX-ACE:• 256 SX-ACE batch nodes:◦ With 4 vector cores and 64 GB main memory◦ 256 GB/s memory bandwidth

• Fast node interconnect (IXS crossbar switch)

Page 24: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

NEC HPC system (cont.)Rechenzentrum

System environment

• Operating system:• HPC Linux cluster/login nodes: Red Hat Enterprise Linux 7.4• SX-ACE: Super-UX

• One batch system: NQSII

I/O file systems

• Global file systems for $HOME and $WORK

• $HOME:◦ 64 TB (NEC GxFS)◦ Regularly backup

• $WORK:◦ 5 PB (NEC ScaTeFS, particularly suitable for parallel I/O)

• Access to the TAPE library:• On all login nodes and via a special batch queue (feque)

Page 25: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

Rechenzentrum

Non-local HPC systems

3. HLRN III

Page 26: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

HLRNRechenzentrum

Page 27: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

HLRN: Organizational structureRechenzentrum

Page 28: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

HLRN: Organizational structureRechenzentrum

• Administrative board

• One representative from each member state• Decides on subjects of fundamental importance

• Technical comission

• Heads of the participating computing centers• Advises the administrative board and operation centres on all technical issues

• Scientific council

• Scientists of different fields from the participating member states• Decides on project proposals and grant of compute resources

• Specialist and local consultants

• Assist and advise in technical, operational, organizational, professional issues• Support HLRN users on computational, scientific issues

Page 29: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

HLRN: Brief historyRechenzentrum

• HLRN-I

• Duration: May 2002 to July 2008• IBM p690 system (IBM Power4 processors)• Peak performance: 5.3TFlops

• HLRN-II

• Duration: July 2008 to November 2013• SGI Altix ICE (Intel Harpertown und Nehalem processors)• Peak performance: 329 TFlops

• HLRN-III

• Since September/December 2013

• CRAY XC30/XC40 systems• Peak performance: 2.6 PFlops

• HLRN-IV

• Planned start/inauguration in autumn 2018

Page 30: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

HLRN-III: System overviewRechenzentrum

• Hannover (Gottfried)

• MPP system:◦ Cray XC30, XC40◦ 1680 nodes: IvyBridge and Haswell CPUs; 40320 cores

• SMP system:◦ 64 nodes: SandyBridge + NvidiaK40; 2300 cores◦ 256/512 GB main memory

• Berlin (Konrad)

• MPP system:◦ Cray XC30, XC40◦ 1672 nodes: IvyBridge and Haswell CPUs; 44928 cores

• Test environment:◦ Cray XC40 Many-Integrated-Core◦ 16 nodes with XeonPhi & XeonKNL

Page 31: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

Rechenzentrum

Access to the HPC systems

Page 32: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

How to get access?Rechenzentrum

HPC @ RZ CAU

• Access can be requested at any time

• Use form no. 3 from [1]:

”Request for use of a High-Performance Computer“ Link

Return completed form to RZ user administration

• After registration, login details will be sent in written formto the responsible person (via in-house CAU mail)

• User account comprises subscription to HPC mailing lists:hpc [email protected] [email protected] [email protected]

[1] https://www.rz.uni-kiel.de/en/about-us/terms-and-policies/forms/sign-up

Page 33: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

How to get access?Rechenzentrum

HPC @ HLRN

• Utilization possible for all university employees inSchleswig-Holstein, incl. universities of applied sciences

• Access in two steps [1]

i. Application for a test account (called “Schnupperkennung”)

ii. Application for a research project (“Großprojekt”)

[1] https://www.hlrn.de/home/view/Service/ApplicationHowTo

Page 34: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

Access to HLRNRechenzentrum

i. Schnupperkennung

• Apply at any time

• Duration: 3 quarters

• Limited computing time: 2500 NPL(increase to max. 10000 NPL possible)

• NPL=“Norddeutsche Parallelrechner

Leistungseinheit“

1 MPP node for 1h: 2 or 2.4 NPL1 SMP node for 1h: 4 or 6 NPL

• Use this account to preparea research project (Großprojekt)

ii. Großprojekt

• Deadlines for proposal submission:28.1., 28.4., 28.7., 28.10.

• Duration: Total NPL are granted for1 year and distributed on a quaterly basis

• Proposals will be assessed by thescientific council of HLRN

• Prerequisite: a valid HLRN user account

• Min. demands for a project proposal:Abstract, project description or statusreport (for project extension),requested NPL incl. justification,other resource demands (such as

main memory, disk space)

• For more details, see [1] Link

[1] https://www.hlrn.de/home/view/Service/ProjektBeschreibungHinweise

Page 35: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

Rechenzentrum

Short tour through server rooms

Page 36: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

Rechenzentrum

Interactive work and batch processing

Page 37: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

Access onto HPC systems IRechenzentrum

• Access only via the login (front end) nodes

• Only from within the networks of CAU, GEOMAR and UKSH• From outside: Establish first a VPN connection to one of the above networks• Connection protocol: ssh

• How to establish an ssh connection?:

• One needs an ssh client• Linux and Mac:◦ ssh clients available in the standard distributions

• Windows:◦ Putty:

https://www.putty.org/

Command line based access, no X11 forwarding by default

◦ MobaXterm:https://mobaxterm.mobatek.net

◦ X-Win32:https://www.rz.uni-kiel.de/de/angebote/software/x-win32/X-Win32

Page 38: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

Access onto HPC systems IIRechenzentrum

• ssh connections to the login nodes

• rzcluster:

ssh -X username @rzcluster.rz.uni-kiel.de

• caucluster:

ssh -X username @caucluster.rz.uni-kiel.de

• NEC HPC system:

ssh -X username @nesh-fe.rz.uni-kiel.de

• Password change

• rzcluster and caucluster: passwd• NEC HPC system: yppasswd

Page 39: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

Data transfer to/from HPC systemsRechenzentrum

• Linux and MAC

• Use scp command:

scp file username @nesh-fe.rz.uni-kiel.de:/path-to-destination-dir

• Windows

• Use WinSCP, which is a gui-based scp client• https://winscp.net

Page 40: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

HPC systems - editorsRechenzentrum

• Available editors:

• nedit• vi• GNU nano• GNU Emacs• mc

• Additional note:

• Windows-based editors generally put an extra “carriage return”(also referred to as control-M/ˆM) character at the end of each line of text.

• This will cause problems for most Linux-based applications:

/bin/bash^M: bad interpreter: no such file or directory

• For correction, execute the built-in utility dos2unix on the ASCII file:

dos2unix filename

Page 41: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

Rechenzentrum

File systems and data integrity

Page 42: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

File systems on HPC systemsRechenzentrum

• Four different file systems available:

• HOME directory

• WORK directory

• Local disk space on compute nodes

• Tape library (magnetic tape robot system)

Page 43: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

HOME directoryRechenzentrum

• Directory after ssh login

• Accessible via the environment variable $HOME

• Available disk space (for all users)

• rzcluster/caucluster: 13 TB• NEC HPC system: 60 TB

• Mounted on all compute nodes

• Regularly backup (daily)

• For important data which need a backup• e.g., software, programs, code, scripts, small amount of results

• Currently no user quotas

• NOT for the execution of production runs (batch jobs)!• ... slow access times

Page 44: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

WORK directoryRechenzentrum

• Global parallel file system

• Suitable for the execution of production runs (batch jobs)!

• Available disk space (for all users)

• rzcluster/caucluster: 350 TB (BeeGFS)

• NEC HPC system: 5 PB (ScaTeFS)

• Accessibility:• rzcluster/caucluster: Path /work beegfs/username

• NEC HPC system: Environment variable $WORK

• NO backup!

• User quotas for disk space and inodes

Page 45: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

WORK directory - user quotasRechenzentrum

• rzcluster/caucluster

• Disk space:◦ 1 TB

• Inodes:◦ 250000

• Display quota usage:

beegfs-ctl --getquota --uid username

• NEC HPC system

• Disk space:◦ Soft limit: 1.8 TB (CAU), 4.5 TB (GEOMAR)

◦ Hard limit: 2.0 TB (CAU), 5.0 TB (GEOMAR)• Inodes:◦ Soft limit: 225000◦ Hard limit: 250000

• Display quota usage:

workquota

Page 46: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

Local disk spaceRechenzentrum

• Disk space directly on compute nodes

• Accessible via the environment variable $TMPDIR

• Only temporarily available within a batch job

• Advantage:• Fast access times• Faster I/O for read-/write-intensive computations

Page 47: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

Tape libraryRechenzentrum

• Disk space with automatic file relocation onto magnetic tapes

• Accessibility:• rzcluster/caucluster: Path /nfs/tape cache/username

• NEC HPC system: Environment variable $TAPE CACHE

• Tape cache directory only mounted on the login nodes• NEC HPC system: Also available in the special batch queue “feque”

• Use the tape library to store non-active data

• Transfer and store only archived data• e.g. tar-files (max. 1 TB, recommended: 3-50 TB)• Do not store many small files on the tape library

• Slow access times

• NEVER work interactively within the tape chache directory!

• Tape library is no archive system• Only a single copy present• Deleted data cannot be recovered

Page 48: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

Data integrityRechenzentrum

• WORK directory:• No backup!

• Tape library:• No backup!

• Home directory:• Daily backup

• Backup history comprises the last 8 weeks

• For last 2 week, files can be recovered on a daily basis, prior to thaton a weekly basis

• Each user can access its own backup directory◦ rzcluster/caucluster: /mybackup-0

◦ NEC HPC system: /nfs/mybackup-0 or /nfs/mybackup-1

• Note:• User data is only accessible from an active account• The computing center does not provide a long-term data archiving service• Check your data stock from time to time

Page 49: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

Rechenzentrum

Software and program compilation

Page 50: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

Software environmentRechenzentrum

• User software• Python, Perl, R, Matlab, Gaussian, Turbomole, Quantum Espresso,

Gnuplot, Xmgrace, ...

• Compilers• GNU compilers• Intel compilers• Portland compilers (rzcluster and caucluster only)• SX cross compilers for using the NEC SX-ACE vector system

• MPI environment• Intel MPI (all Linux clusters)

• SX MPI (NEC SX-ACE vector system)

• Libraries• NetCDF, HDF5, FFTW, MKL, PETSc, GSL, Boost, Lapack/Blas ...

Page 51: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

Software environmentRechenzentrum

• User software• Python, Perl, R, Matlab, Gaussian, Turbomole, Quantum Espresso,

Gnuplot, Xmgrace, ...

• Compilers• GNU compilers• Intel compilers• Portland compilers (rzcluster and caucluster only)• SX cross compilers for using the NEC SX-ACE vector system

• MPI environment• Intel MPI (all Linux clusters)

• SX MPI (NEC SX-ACE vector system)

• Libraries• NetCDF, HDF5, FFTW, MKL, PETSc, GSL, Boost, Lapack/Blas ...

Provisioning via a module environment.

Page 52: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

Module environmentRechenzentrum

• Simplified access to installed software, programs, libraries, etc.

• After loading a specific module:• Software usable without specifying a full path◦ $PATH

• Other required search paths and environment variables are automatically set◦ $LD LIBRARY PATH, $MAN PATH ...

• Module commands:

module avail Shows all available modules

module load name Loads the module name and performs all required settings

module list Lists all modules which are currently loaded

module unload name Removes the module name, i.e., resets all corresponding settings

module purge Removes all currently loaded modules (module list becomes empty)

module show name Displays the settings which are performed by the module

Page 53: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

Module environment - examplesRechenzentrum

• Matlab:

• For CAU users:

module load matlab2017amatlabmatlab -nodisplay

• For GEOMAR users:

module load matlab2017a geomarmatlab -nodisplay

• Program compilation with Intel compilers

module load intel17.0.4ifort ...icc ...icpc ...

module load intel17.0.4 intelmpi17.0.4mpiifort ...mpiicc ...mpiicpc ...

Page 54: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

Rechenzentrum

Interactive work and batch processing

Page 55: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

Working on HPC systemsRechenzentrum

• Interactive work

• On login nodes only - never on compute nodes!• Data transfer (desktop PC ←→ HPC systems ←→ tape library)• Software provisioning and installation• Program development and compilation• Small test calculations (monitor CPU, memory consumption, and runtime!)• Preparation of batch scripts and batch jobs• Submission and control of batch jobs• Small pre- and postprocessing

• Batch processing

• No job start from the user’s command line• Instead: Prepare a small script (batch script) which contains all necessary

information. This script is then submitted to the batch system• Typical information provided in a batch script:◦ Required resources (# cores, main memory, walltime, ...)◦ Program call

• Different batch systems:◦ rzcluster and caucluster: SLURM◦ NEC HPC system: NQSII

Page 56: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

The “principle of batch processing” IRechenzentrum

login node

batch node 1

batch node 2

batch node 3

batch node 4

batch server

management

ofworkloadresourcespriority

INTERACTIVE WORK BATCH PROCESSING

...job submission/control

job execution

job script

Page 57: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

The “principle of batch processing” IIRechenzentrum

• Batch processing

• Batch server takes the batch script, searches for free, appropriatecompute resources and then executes the actual computationor queues the job.

• Advantages◦ Very efficient use of the available compute resources→ Larger throughput→ Fair distribution of compute resources→ Every user can execute multiple jobs in parallel

◦ Presence of requested compute resources are ensured

• Disadvantages◦ At first glance, unusual work flow

Page 58: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

NQSII batch script (NEC HPC Linux cluster)Rechenzentrum

• A serial calculation

#!/bin/bash#PBS -b 1#PBS -l cpunum job=1#PBS -l elapstim req=01:00:00#PBS -l cputim job=01:00:00#PBS -l memsz job=20gb#PBS -N test#PBS -o test.out#PBS -j o#PBS -q clmedium

# Change into qsub directorycd $PBS O WORKDIR

# Start of the computationmodule load intel17.0.4time ./program.ex

# Output of used resources (computation time, main memory) after the job/usr/bin/nqsII/qstat -f ${PBS JOBID/0:}

Page 59: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

NQSII batch script (NEC HPC Linux cluster)Rechenzentrum

• A parallel, multinode MPI calculation

#!/bin/bash#PBS -T intmpi#PBS -b 4#PBS -l cpunum job=32#PBS -l elapstim req=10:00:00#PBS -l cputim job=320:00:00#PBS -l memsz job=256gb#PBS -N test#PBS -o test.out#PBS -j o#PBS -q clbigmem

# Change into qsub directorycd $PBS O WORKDIR

# Start of the computationmodule load intel17.0.4 intelmpi17.0.4time mpirun $NQSII MPIOPTS -np 128 ./program.ex

# Output of used resources (computation time, main memory) after the job/usr/bin/nqsII/qstat -f ${PBS JOBID/0:}

Page 60: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

NQSII batch script (NEC SX-ACE vector system)Rechenzentrum

• A multinode MPI calculation on the SX-ACE

#!/bin/bash#PBS -T mpisx#PBS -b 2#PBS -l cpunum job=4#PBS -l elapstim req=10:00:00#PBS -l cputim job=40:00:00#PBS -l memsz job=64gb#PBS -N test#PBS -o test.out#PBS -j o#PBS -q smallque

# Change into qsub directorycd $PBS O WORKDIR

# Gather performance information (for a Fortran-based program)export F PROGINF=DETAILexport MPI PROGINF=DETAIL

# Start of the computationmpirun -nn 2 -np 4 ./program.sx

Page 61: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

Slurm batch script (rzcluster)Rechenzentrum

• An OpenMP parallel calculation

#!/bin/bash#SBATCH --nodes=1#SBATCH --tasks-per-node=1#SBATCH --cpus-per-task=4#SBATCH --mem=1000#SBATCH --time=01:00:00#SBATCH --job-name=test#SBATCH --output=test.out#SBATCH --error=test.err#SBATCH --partition=small

export OMP NUM THREADS=4

module load intel16.0.0time ./program.ex

Page 62: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

Slurm batch script (rzcluster)Rechenzentrum

• A hybrid MPI+OpenMP multinode calculation

#!/bin/bash#SBATCH --nodes=2#SBATCH --tasks-per-node=3#SBATCH --cpus-per-task=4#SBATCH --mem=1000#SBATCH --time=01:00:00#SBATCH --job-name=test#SBATCH --output=test.out#SBATCH --error=test.err#SBATCH --partition=small

export OMP NUM THREADS=4

module load intel16.0.0 intelmpi16.0.0time mpirun -np 6 ./program.ex

Page 63: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

Multinode MPI calculationsRechenzentrum

• For Linux clusters:

• Batch nodes need to communicate without password!• Thus, a one-off generation of a key pair required. Two steps:

◦ 1.) Execute the ssh-keygen command

ssh-keygen -t rsa

◦ 2.) Copy the public key into the authorized keys file

cp $HOME/.ssh/id rsa.pub $HOME/.ssh/authorized keys

Page 64: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

Job submission and controlRechenzentrum

• Important batch commands:

NEC HPC system (NQSII) rz-/caucluster (Slurm)

qsub jobscript sbatch jobscript Submission of a new batch job

qstatall squeue List all jobs currently in the system

qstat / qstatcl / qstatace squeue -u username List only the own jobs

qdel jobid scancel jobid Delete or terminate a batch job

qstat -f jobid scontrol show job jobid Show details of a specific job

qcl / qace sinfo Get information about queues/partitions

qstat -J jobid Lists on which nodes the job is running

• Interactive work on batch nodes:• Possible on rzcluster and caucluster• Request a node via the batch system:

srun --pty --nodes=1 --cpus-per-task=1 --time=00:10:00 --partition=small/bin/bash

Page 65: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

Appropriate use of the batch systemRechenzentrum

• Do not request more nodes and cores that are required by the computation

• Adapt the walltime and memory resources to the need of the program• A more accurate specification can lead to smaller waiting times and

increased throughput• But do not plan too restrictive

• Try to save intermediate results!• Particularly during longer calculations• Check if the program has a restart option

• The stdout of a batch job should be kept small, otherwise redirect it toa file on the local disk or $WORK

Page 66: HPC introduction course - Rechenzentrum · 2018. 4. 12. · HPC introduction course 11 April 2018. Outline Rechenzentrum A. HPC services B. Available HPC systems ... Uniform system

Documentation and supportRechenzentrum

• WWW pages Link

• HPC in general:◦ https://www.rz.uni-kiel.de/de/angebote/hiperf

• NEC SX-ACE vector system:◦ https://www.rz.uni-kiel.de/de/angebote/hiperf/nec-sx-ace

• NEC HPC Linux Cluster:◦ https://www.rz.uni-kiel.de/de/angebote/hiperf/nec-linux-cluster

• rzcluster:◦ https://www.rz.uni-kiel.de/de/angebote/hiperf/linux-cluster-rzcluster

• caucluster:◦ https://www.rz.uni-kiel.de/de/angebote/hiperf/linux-cluster-caucluster

• E-mail support (ticket system)

[email protected]