WestGrid Town Hall: October 2017

38
WestGrid Town Hall: October 2017 Patrick Mann, Director of Operations Greg Newby, CTO Compute Canada Friday, October 13, 2017

Transcript of WestGrid Town Hall: October 2017

Page 1: WestGrid Town Hall: October 2017

WestGrid Town Hall: October 2017

Patrick Mann, Director of OperationsGreg Newby, CTO Compute Canada

Friday, October 13, 2017

Page 2: WestGrid Town Hall: October 2017

Admin

To ask questions:● Websteam: Email [email protected] ● Vidyo: Un-mute & ask your question

(VidyoDesktop users can also type questions in Vidyo Chat - click the chat bubble icon in Vidyo menu)

Vidyo Users: Please MUTE yourself when not speaking(click the microphone icon to mute / un-mute)

Page 3: WestGrid Town Hall: October 2017

Outline

1. New Systems Status2. Known Issues3. Legacy Systems Migration Update4. RAC 2018

a. Updatesb. Best Practices

5. Upcoming User Training6. Future Systems

a. Status and Availability7. Consultation for GP4 - 2018 (Greg Newby)

Page 4: WestGrid Town Hall: October 2017

New System Updates

Page 5: WestGrid Town Hall: October 2017

National Compute Systems

System Current Status

Cedar (GP2, SFU) ● 27,696 CPU cores● 146 GPU nodes (4 x NVidia P100)● Major expansion in progress

IN OPERATION (June 30, 2017)RAC 2017

Graham (GP3, Waterloo)

● 33,472 CPU cores● 160 GPU nodes (2 x NVidia P100)

IN OPERATION (June 30, 2017)RAC 2017

Niagara(LP1, Toronto)

● Large parallel system ~60,000 cores● Final contract negotiations with vendor

Expected online early 2018RAC 2018

GP4, Quebec ● Early stages● 2nd half of this presentation!

Later in 2018Not included in RAC 2018

Page 6: WestGrid Town Hall: October 2017

National Cloud SystemsSystem Current Status

Arbutus (UVic)

● 290 nodes - 7,640 cores (including original west.cloud)● 100G network completed July.● Additional storage purchase in progress (Ceph for Block Storage)

○ Current ~ 1.5PB raw, increase by 2 TB (3x replicated)● Funding for additional cores is available ~2,000 cores planned.

IN OPERATION (Sep, 2016)RAC 2017

Cedar Cloud Partition(SFU)

● 10 nodes with 2x16 cores/node, ~500 TB usable Ceph storage● Elastic - could expand to 48 nodes as necessary

Under development

Graham Cloud Partition(Waterloo)

● 10 nodes with 2x16 cores/node, 256 GB, ~100TB usable Ceph storage

● Elastic - could expand to 53 nodes as necessary.

IN OPERATION (Sep 15, 2017)

east.cloud(Sherbrooke)

● 36 nodes with 2x16 cores/node, 128 GB, ~100TB usble Ceph IN OPERATION (Sep, 2016)

RAC 2018 - generic cloud. Please note in the justification if you are already on one of the cloud systems.

Page 7: WestGrid Town Hall: October 2017

National Data Cyberinfrastructure (NDC)

System Current Status

NDC-SFU-ProjectNDC-Waterloo-Project

● 10 PB at SFU (cedar)● 13 PB at UWaterloo (graham)● backed up to tape

AvailableRAC 2018

NDC-SFU-NearlineNDC-Waterloo-Nearline

● Tape libraries installed and in operation● Tape off-lining processes in development

Autumn 2017?RAC 2018

NDC-Object Storage ● Object Storage. DDN WOS.● Lots of demand but not allocated.● Initial installation for special projects should be available shortly.● Otherwise needs a lot of work.

DelayedAvailable 2018Not available for RAC

Attached (Scratch) ● High performance storage attached to clusters AvailableNot available for RAC

National Data Cyberinfrastructure (CC Docs wiki)

Page 8: WestGrid Town Hall: October 2017

Known Issues

See https://docs.computecanada.ca/wiki/Known_issues

Page 9: WestGrid Town Hall: October 2017

Storage

1. Nearline is under development○ RAC 2017 Nearline allocations not yet available○ on request are being implemented in /project.○ May end up being full HSM from /project. Under investigation.

2. Project space was re-configured to allow group sharing.○ /project/<groupid>/<groupmember1, groupmember2, ..>○ Each group member directory is in the PI’s group○ Shared (writeable) subdirectories can be created○ https://docs.computecanada.ca/wiki/Project_layout

Ask [email protected] if you want help with group requests, shared storage, Nearline, ...

Page 10: WestGrid Town Hall: October 2017

Scheduling Issues

RAS (default) users can be overwhelmed by a few users with optimal jobs● Users with small/short jobs can take almost all default resources and have ended up penalizing default

users with larger jobs.● New configuration decouples default users - but must wait for a major cluster update to implement.

Large groups with many members may not get their group-based fair share (RAC) compared to groups with fewer members.

● New fair-share configuration and priority algorithm have been implemented (this week).Not enough small jobs to fill in all the holes!

● Good idea to run small/short jobs if you’ve got them. And we recommend whole-node jobs.

Best Practice● Request whole-nodes when possible

○ --nodes=4 --ntasks-per-node=32● Request short run-times

○ --time=3:00 # DD_HH:MMThe scheduler will choose the best partition for your job.

Kamil’s scheduling course - Oct 24 How to get the most from a Cluster

● Also Kamil gave a Sharcnet Scheduling course

Page 11: WestGrid Town Hall: October 2017

Development

1. CVMFS vs System Conflictsa. Shared library and applications on CVMFS (shared between graham

and cedar, and is also on test and dev systems). Some base libraries and apps come with the system.

b. So build systems (cmake, autoconf, ..) can get mixed upc. Ask [email protected]

2. VNC/GPU (interactive) nodes not available yeta. Neither graham nor cedar have interactive visualization nodes yet.

i. client-server paraview/VisIT is availableii. See WG data visualization training sessions.

Page 12: WestGrid Town Hall: October 2017

Defunded Systems & Migration

Page 13: WestGrid Town Hall: October 2017

WestGrid DEFUNDED Systems

Site System(s) Defunded Date Current Status

Victoria Hermes/Nestor June 1, 2017 Nestor offline.(Virtual) Hermes - local users only

Calgary Breezy/Lattice August 31, 2017 Shared Parallel storage still available.Breezy/Lattice - local users only

Edmonton Hungabee/Jasper October 1, 2017 Not available. File system updates and maintenance 4-8 weeks.

Local users only after this.

Local users are defined as those users who are members of the host site institution, or who are affiliated with a Principal Investigator from the site. Researchers are encouraged to contact their local sites directly for further information:

● University of Alberta: [email protected]● University of Calgary: [email protected]● University of Victoria: [email protected]

Page 14: WestGrid Town Hall: October 2017

WestGrid Legacy Systems

Site System(s) Defunding Date

SFU Bugaboo(next slide)

December 31, 2017

UBC Orcinus March 31, 2017

Manitoba Grex March 31, 2017

Calgary Parallel March 31, 2017

WestGrid Migration Details: https://www.westgrid.ca/migration_process

IMPORTANT: Data on defunded systems will be deleted after the published deletion date. WestGrid will not retain any long term or back-up copies of user data and as noted above users must arrange for migration of their data.

Page 15: WestGrid Town Hall: October 2017

Bugaboo Migration

Bugaboo storage is going off support at the end of the year● SFU is purchasing ~100 nodes (~4,000 cores) ahead of the major expansion● Will be added to cedar right away● These nodes will be used immediately to move allocations from Bugaboo.

Groups will be moved one-by-one to cedar.● Each group’s storage will be migrated to cedar.● Details to follow.

Everyone encouraged to move now if they can!● Especially those w/o an allocation. They need to move by themselves.

Page 16: WestGrid Town Hall: October 2017

Resource Allocation Competition

RAC 2018

Page 17: WestGrid Town Hall: October 2017

RAC 2018

RAC 2018 Start FinishFast Track submission Oct 3, 2017 Nov 2, 2017

RRG & RPP full application submission Oct 3, 2017 Nov 16, 2017RPP Progress Report submission Jan 30, 2017 Mar 14, 2018

Award letters sent Mid Mar, 2018 Late Mar, 2018

RAC 2018 Allocations implemented Mid Apr, 2018 Early May, 2018

RAC 2018 application forms are online!● https://www.computecanada.ca/research-portal/accessing-resources/resource-allocation-co

mpetitions/

Page 18: WestGrid Town Hall: October 2017

Continuing Demand!

Page 19: WestGrid Town Hall: October 2017

Good & Bad News:

Niagara (LP) Expected to be in production by early 2018. Included in RAC 2018.

Cedar stage 2 Expansion to Cedar (compute) by early 2018. Included in RAC 2018.Additional storage

Graham stage 2 Additional storage (no additional compute)

Page 20: WestGrid Town Hall: October 2017

RAC 2018 Changes to note:

● The Fast Track eligibility criteria has been relaxed: 296 eligible PIs

● Notice of Intent (NOI) is NOT a prerequisite any RAC.

● PIs can delegate RPP or RRG application to a sponsored user.

○ Fast Track applications cannot be delegated: PIs must complete the application themselves.

● CCV is required for any RRG and RPP full application,

○ not for Fast Track or comtinuing RPPs (progress reports)

Page 21: WestGrid Town Hall: October 2017

CCV Info

Applicants with no CCV on CCDB● No change in process.

● Follow the instructions in the CCV Submission Guide.

Applicants with a CCV on CCDB● To upload UPDATED CCV must:

a. "Destroy" existing Compute Canada CCV on CCDB (not on the CCV site), then

b. Upload the NEW version and select publications enabled by CC.

● NOTE: Only the PI can destroy their CCV (this cannot be delegated to a sponsored user)

● IMPORTANT: Carefully read the instructions in the CCV Submission Guide (section 6)

Page 22: WestGrid Town Hall: October 2017

Niagara and RAC 2018

Niagara is being allocated for RAC 2018. ~60k cores!● In the technical justification be very clear about any

requirements for large parallel jobs○ Things like results of scaling tests are much appreciated.○ Would you like the whole machine for some runs?

● Rule-of-thumb: ○ >1,024 cores → niagara○ ≤1,024 cores → cedar/graham (1,024 core “islands”)○ Large mem (>4 GB/core) should go to cedar/graham.○ No GPUs on niagara

Page 23: WestGrid Town Hall: October 2017

Competitive Process

● CFI mandates allocations based on excellence

● Peer-review process to determine SCIENCE SCORE○ Based on quality of science, HQP, PI

experience, resource plan● High Score = Less Scaling

***Scaling is required due to insufficient resources, which makes the RAC a very competitive process.***

Page 24: WestGrid Town Hall: October 2017

Applying for a RAC?

Summary of Best Practices● JUSTIFY YOUR REQUEST (Impact & Resources)● BE COMPLETE - PROVIDE DETAILS

○ Follow the templates○ By projects: Resources, HQP

● BE CONCISE ● SHOW HQP IMPACT● UPDATED CCV & PAST YEAR PROGRESS

Page 25: WestGrid Town Hall: October 2017

Justify your Request

● Provide details on significance and impact of research.

● Citation rates of recent work help justify science.● Justify why/how the resources requested will be used to

accomplish/support the science.

“Please improve motivation of why the proposed calculations are important, and what is to be learned and/or what other science depends on the results.”

Page 26: WestGrid Town Hall: October 2017

Provide Adequate Details

● Clearly explain WHAT the science is, using specific details. ● Use tables to provide resource details by project● If it is difficult to predict usage then emphasize the areas of uncertainty.

Project Team Members

Estimated Core-Years

/project Storage

Memory/ core Comments

Project 1 Student X 10,000 100TB 4 GB ...

Project 2 Students Y, Z 5,000 50TB 32 GB ...

Totals: 15,000 150TB

“The technical justification did not show a calculation of the computing and storage needs. Providing a table with this information, as suggested in

the guidelines, would have made this section stronger.”

Page 27: WestGrid Town Hall: October 2017

Be Concise

● Provide ALL the information asked for, but ONLY what is asked for. ● Answer the specific questions and provide only those details requested.● Do not re-use submissions from other proposals or competitions. ● Be clear, avoid jargon - reviewers are not experts in all sub-domains.● Take time to edit and review the application before submitting it.

“The proposal is very long. The very detailed scientific justification is more reminiscent of a NSERC proposal. For next year, I recommend to

shorten the science part, and to work out more clearly the purpose of the CC usage, and the results that will be enabled by the CC Allocation.”

Page 28: WestGrid Town Hall: October 2017

HQP Impact

● Ideally provide a table showing the expected HQP at each level (undergraduate, masters, doctorate, etc).

● Highlight the contribution of any excellent graduate students.● Mention any training opportunities beyond that of normal academic

teaching.“For future allocation requests… describe in detail the involvement of HQP in past projects that utilized Compute Canada resources. The current proposal states that one PDF will be involved and reviewers were wondering if there were plans for this PDF to mentor junior graduate or undergraduate students.

If so, mention this in the proposal.”

***“This proposal was very clearly articulated with an integrated HQP training

plan. The number of HQPs is impressive.”

Page 29: WestGrid Town Hall: October 2017

Updated CCV & Past Year Progress

● Keep your CCV up to date!● Reference citations in the proposal (“Progress over the Past Year” section)

which were and will continue to be supported by the resources requested.

“For future allocation requests, please consider adding references to published work in the proposal, progress made in the previous year, and clear description of how HQP will be involved in achieving the milestones.”

***“Progress over the past year is missing. Based on the CCV, most of the

group's publications were enabled through Compute Canada resources, so that information should have been accessible and would have been relevant to

include.”

Page 30: WestGrid Town Hall: October 2017

Training & Workshops

Page 31: WestGrid Town Hall: October 2017

Training on New Systems

Documentation Wiki * Always under development! *https://docs.computecanada.ca

Getting started on the systems: ● Running Jobs -

https://docs.computecanada.ca/wiki/Running_jobs ● Available Software -

https://docs.computecanada.ca/wiki/Available_software ● Storage & File Management -

https://docs.computecanada.ca/wiki/Storage_and_file_management

Mini-Webinar Video Serieshttp://tinyurl.com/CCsystemwebinars

Short video demonstrations of how to use the new systems. Topics include:

● Software environment● File systems● Managing jobs● Common mistakes to avoid● Getting help

Page 32: WestGrid Town Hall: October 2017

WestGrid Training Sessions

Full details online at www.westgrid.ca/training

DATE TOPIC TARGET AUDIENCE

OCT 18 Intro to ARC Support in Manitoba Researchers in Manitoba (in person)

OCT 24-26 Scheduling & Job Management CC / WG Users (online)

OCT 25 UVIC: ARC Support & RAC 2018 Researchers at UVic (in person)

NOV 1 Introduction to Classical Molecular Dynamics Simulations Anyone (online)

NOV 21 Exploring Containerization with Singularity Anyone (online)

Page 33: WestGrid Town Hall: October 2017

Greg NewbyChief Technical Officer (CTO)

Compute Canada

Page 34: WestGrid Town Hall: October 2017

Stage 1 Review & Stage 2 Preview

Award notification, June, 2015

Award Finalization February,

2016

1st System (Arbutus)

OperationalSept., 2016

Cedar, Graham

OperationalJune, 2017

Niagara OperationalEarly, 2018

2015 2016 2017 2018

STAGE 1 3 Systems operational. Software services development continues through 2018.*Note: Niagara purposely delayed by recommendation of CFI expert panel, to benefit from technology improvements.

GP4 OperationalEarly 2019

Stage 2 2018-19Stage 1

Stage 2 includes expansions of stage 1 systems and new GP4 system.

Cedar & Graham

Expansion2017-18

Cloud Expansion

2018

Page 35: WestGrid Town Hall: October 2017

GP4

CFI Cyberinfrastructure Round 2 funding. A major new General Purpose system at a 5th national hosting site● ~$17M for compute● ~$7M for storage● Mix of CPUs, GPUs, fat nodes

The lead site is currently working on the RFPNeed input - what should GP4 include?

We have invited Greg Newby, CC’s CTO, to review GP4 and lead a discussion.

Cedar additional ~$12M for compute+storageGraham additional ~$1M for storage

Page 36: WestGrid Town Hall: October 2017

Feedback needed!Node Types

● What different node types are most needed? ● Should there be adjustments to the characteristics of the node types compared with

Cedar/Graham? ● What about the relative quantity of each node type?● Memory/node? GPU nodes?● Interconnects - Infiniband, Intel Omnipath - what would be a reasonable limit on parallel jobs?

Workloads● What workloads are not well-supported by Cedar? How should GP4 be different?● Should we focus on long/short thin/fat serial/parallel jobs?● Interactive use (bioinformatics pipelines??)

Other considerations?● Constraints or configuration choices for Cedar and Graham that could be different for GP4?● Specials - FPGA’s for instance?

Page 37: WestGrid Town Hall: October 2017

GP4 Discussion

● Webstream viewers: Email [email protected]

● Vidyo viewers: Un-mute & ask question or use Vidyo Chat (chat bubble icon in Vidyo menu)