WestGrid Stakeholder Session & Annual General Meeting · Cedar Planning 1. ownCloud upgrade 2....
Transcript of WestGrid Stakeholder Session & Annual General Meeting · Cedar Planning 1. ownCloud upgrade 2....
WestGrid Stakeholder Session & Annual General Meeting
October 10, 2018Vancouver, BC
WestGrid Update
2
3
4
WestGrid Key Activities 2017-18
ARC Leadership
5
“I would like to thank the Working Group members who have spent the last nine months with me locked in rooms and on countless phone calls - you have been a highly dedicated group of people who have brought tremendous expertise and commitment to this task. It has been such a pleasure to work with you.”
- Robbin Tourangeau, LCDRI
Partnership: Precision Infection Management (PIM)
“Thank you so much Sergei and Lindsay for your contributions on this project! The serious computational muscle gave this project real legitimacy and I am very grateful to both of you for making this a reality! Thanks Lindsay! You are a rockstar, this wouldn't have been possible without your support!”
- Ian Lewis, LSARP Project Lead
6
Bioinformatics Helpdesk
7
133 subscribed users
22 question posts since launch in january
79 total questions in database
4 one-on-one sessions
6 questions via contact form
Advocacy
8
Unique Culture
9
The West is the Best
● Collaborative● Social● Likeminded “get it
done” attitude● Altruistic
Summary of WestGrid Value
★ Leadership★ HQP development★ Funding diversification★ Cost savings to members★ Education, Outreach & Training★ Engagement & advocacy★ Addressing user need ★ Improving user experience
10
Projects & Outreach
11
Co-op Students“I am honored to have worked with so many people who were so passionate about their career, as well as the well-being of our industry. It was truly an amazing experience that I will never forget."-- Steven Bucholtz, Okanagan College Student, Computer Information Services
“This was the perfect opportunity for us to learn how to set up a site from nothing. That everyone is approachable and willing to help without judgement or any hesitation. This opportunity is something I will never forget so again thank you for everything."-- Christina Hebert, Okanagan College, Computer Information Sciences
12
Re-developed Status Page
13
CANHEIT-TECC
14
● Largest conference ever held at SFU Burnaby campus
● Almost 700 attendees● 130 total sessions, over
100 hours of programming including the Women in Technology Breakfast,
● Over 30 staff from the Compute Canada federation presented in nearly 20 sessions over 3 days!
Survey Numbers: WestGrid User Survey &
Accounts RenewalsErin Trifunov
Manager, Projects and OutreachWestGrid
WestGrid User Survey - 2017
“The services provided by Westgrid are invaluable and our organization would be substantially less
productive without WestGrid's assistance. Some of our research would have been impossible without
WestGrid.” - University of Victoria, Environmental & Earth Sciences Research Staff
GOAL: Awareness of Services
Increase awareness of specialized support services, especially Bioinformatics support.
GOAL: WestGrid User Support
23%
incr
ease
Increase awareness of and offerings of support
GOAL: Training & Outreach
Increase awareness of and offerings of support for new users & in-person sessions.
In 2017-18 we delivered:• 1000s hrs, 42 events• 800+ RSVPs, 42% new users• 2 regional summer schools• 24 Software Carpentry events• 11 WestGrid Town Halls• 6 Viz workshops• SMT visits to all 7 sites
New User Support
2 In-person HPC Summer Schools
New website for training materials - simpler to use
21
Visualize This!
National competition led by WestGrid’s Visualization and Training Coordinator Alex Razoumov
GOAL: Overall Satisfaction
2018 Compute Canada Renewal Survey Overall Satisfaction with Compute Canada(8675 respondents, 2581 from within WestGrid, 37% PIs)
2017 WestGrid User SurveyOverall Satisfaction with Compute Canada(283 respondents, 28% PIs)
Provide more targeted user support for RAC.
RAC User Support
● RAC Best Practices presentations from WG Director of Operations (3 hrs+ each)○ 2017: 2 Sites + 1 Virtual○ 2018: 6 Sites + 2 Virtual
● RAC User Guides, Slides and Site Analysis
2018-19 Goals & Opportunities
Build awareness of WestGrid services and value to stakeholders.
User Support
12% -->Down from 56% in 2016
● 11% DO NOT have adequate local support (21% from UofA, 16% from UVIC & SFU)
● 20% “didn’t know” (25% from UBC, 15% UofA)
● 35% of less experienced users felt DO HAVE adequate local support, 38% unsure
GOAL: Increase awareness of and advocate for more local support presence.
User Growth vs Staffing
From 2013-2018, CC users from WestGrid based institutions
INCREASED 580%WestGrid FTE levels from 2013-2018 increased by only 66%
Need for Local Support
“I strongly believe that the usefulness of Compute Canada in general, and of Westgrid in particular, is a
direct function of the strong local support.” - Manitoba Principal Investigator
New Opportunities
Explore opportunities to meet resource needs and users outside CC scope.
Resources Needs
Future needs
Industry Activity
Operations update
32
Highlights
33
New Cloud Sept, 2016Oct 2018
Arbutus (UVictoria)Major additional capacity (cores and storage)
New GP systems Summer, 2017Autumn 2017-2018
Cedar (SFU), Graham (Waterloo)Major Cedar upgrade: ~2x increase in cores! Planned doubling in storage.
New LP system April, 2018 Niagara (Toronto)
Defunding old systems 2017/2018 Grex (UManitoba, Mar 31, 2018)Parallel/Lattice (UCalgary, Mar 31, 2018)Hungabee/Jasper (UofA, Oct.1, 2017)Silo (USask, Jan.31, 2017)
Migration 2017-2018 Migrate users from legacy to new systems(WestGrid led the CC Migration Working Group)
New System support:Software, Documentation and Training
2017-2018 Complete new software distribution system (CVMFS).Comprehensive new software suite.Full re-write of user docs.Lead by Research Support National Team: extensive WG involvement.
Support Tickets
April-May 2018: transition from WG OTRS ticketing system to CC central ticketing system.● Statistics are from June 1, 2018 (CC OTRS ticketing system)
34
Total CC tickets in all queues 6,800
Total WG tickets in all queues 2,266 (33%) WG users: ~30%
Average WG tickets per month in all queues
566 Includes CC accounts and software install tickets.
Average tickets per month in WG queues
344 WG OTRS 2017: 397
WG Tickets per Month
35
March-April-May:● WG OTRS to Central CC OTRS
March-April● Account renewals
April● Allocations implemented
Monthly Mean (June 1-Oct.1)● 344 (WG OTRS 2017 was 396)
WG Tickets in separate WG OTRS system
Top Responders
36
1 Ali Kerrache University of Manitoba 889
2 Daniel Stubbs Université de Montréal 658
3 Maxime Boissonneault
Université de Laval 518
4 Doug Phillips University of Calgary 290
5 Roman Baranowski UBC 227
6 Erick Giguere Université de Sherbrooke 143
7 Belaid Moa University of Victoria 138
8 Ross Dickson Dalhousie University 133
9 Pier-Luc St-Onge McGill University 129
10 Charles Coulombe Université de Laval 126
15 Grigory Shamov Manitoba 94
18 Martin Siegert SFU 84
19 Dmitri Rozmanov Calgary 79
30 Ataollah Roudgar SFU 51
32 Kamil Marcinkowski Alberta 48
34 John Simpson Alberta 47
40 Malcolm Petch UBCO 36
44 Adam McKenzie Sask 31
48 Robert Fridman Calgary 24
49 Chris Want Alberta 22
Past Events General Jan/Feb Meltdown bug fixes
Arbutus Feb 16 Bug in Openstack High Availability network implementation● HA was removed after about 2 weeks of instability.
Graham Aug.21 to Sep.5(ongoing recovery)
Facility upgrade to dual 3MW transformers.● Two separate UPS systems backed by their own generator.● /project filesystem corruption: 2% of files needed to be restored from
tape (8M files!). See https://status.computecanada.ca (Sept.5) for details. Note that restore has only just completed.
● Also seem to be some remaining filesystem issues (/home).
Cedar Aug 30 Router and switch firmware upgrade. Vancouver link is now 100 GBit!About 60 minutes. Jobs continued to run.
August New Atlas Tier 1 is in production.Data migration from old to new is complete.
Arbutus July 22 BCNet network maintenance - short disruptions to cloud network access.
Upgrades and Maintenance
Orcinus October 2 day power systems maintenance.
Orcinus Mar.31, 2019 Defunding! Last of the WestGrid legacy systems.
Cedar Oct.15-19 1 week major outage● OS updates (to CentOS 7.5)● Related Lustre and Omnipath upgrades● Intel OPA (Omnipath) Cable issues
Cedar Planning 1. ownCloud upgrade2. Nearline (tape) service (ready for RAC 2019)3. Cloud partition
Graham Oct 9-10 Electrical work by regional utility with complete outage.
Arbutus cloud
October 3, 2018In progress.
Major outage for cloud expansion● Additional ~1,400 cores and 3.5 PB usable storage.● DB service with 2 new DBaaS nodes● New Openstack version.
WG RAC 2018 Stats
Number of Applications from Members 149
Number of Applications receiving an allocation
147 99% of WG applications received some allocation.
Average science score 3.33 Lower limit was 2.0CC average was 3.33
Number of GPU asks 21 451 CY allocated
Descriptor Score Patrick’s CommentsExceptional 5 This is almost never given, and is only for the highest quality research and proposal.Outstanding 4 Excellent science, excellent technical approaches with high quality proposals.Very strong 3 Small impressions can make a difference between “Strong” and “Very Strong”Strong 2 RAC 2018 cutoff was 2.0, so “Strong” proposals are insufficient for an allocation!Moderate 1 A passable proposal with no significant issues.Insufficient 0 Significant issues either scientific or technical.
RAC Resource Shortfall
40
The major issue by far is the lack of resources. The basic numbers are shortfalls of 2x in computational cores (ask is about 50% of available), and 5x in GPU cores across Compute Canada. Similar numbers are expected for RAC 2019 after legacy systems are defunded and the new GP4 system is installed.
The overwhelming impression from a detailed re-reading of the RAC 2018 WestGrid proposals is that of well-written, well-justified asks from competent and generally experienced teams. The prevalent complaint from the survey, RAC proposals and informal discussion is that of very long queue times and inability to acquire sufficient resources to carry out a research program.
So a strong recommendation is that WestGrid should consider acquiring additional resources.
Note: institutions are re-purposing legacy systems or acquiring new hardware for institutional ARC.
Under Pressure!
RAC -2018 Results
GPU’s Really Bad
RAC 2018: 21% of asks allocated!
GPUs and Machine Learning
● Major increase in GPU asks!○ Machine Learning is the major use○ Increasing demand for production training
● Both Cedar and Graham have GPUs (P100’s)○ Béluga will have lots of them!○ Current estimate on linear extrapolation: <30% of asks can be satisfied!
● Future Compute Canada planning○ Opportune time as CC is working with ISED and Federal Gov’t for DRI.
GPU requests will require very strong
justifications.
Machine Learning
# WG asks
Average (median) GY per ask
Component 17 11% 23 (96) Projects which use Machine Learning as a component of the proposed methods
Research 4 3% 87 (5) Projects which are actively researching machine learning algorithms.
Platforms and Portals
WestGrid RPP 2018 Applications
#
Portals (large) 16 64%
Platforms (smaller) 6 24%
External portal 3 12%
Total 25
The Research Platforms and Portals (RPP) Competition enables communities to develop research projects that improve access to shared datasets, enhance existing online research tools and facilities, or advance national or international research collaborations.
Up from 0 three years ago!
RPP Development and Management● NO DIRECT SUPPORT OFFERED BY CC.● There is a need for such tier 3 support
○ Requests for support.○ Tickets for specific sysadmin help.○ Security breaches due to naive sysadmin.
● WG has expertise○ Cloud National Team (WG-centred) provides tier 2
support○ Sysadmin generally
● WG providing some support (.5 FTE for LSARP)● CC has Middleware national team developing
useful services (like single-sign-on)
WG Proposal to hire jr developer/devops
RAC Best Practices
University of Manitoba Monday September 17
University of Victoria Thursday September 20
University of Alberta Monday September 24
University of Calgary Thursday September 27
University of British Columbia Friday October 12
Simon Fraser University Thursday October 11
Online: Compute Canada Public Q&A Thursday October 4
Online recap and office hours: Mid-October
RAC Best Practices on-site presentations from WG Director of Operations1. Introduction by site teams (½ hour, site-dependent)2. WestGrid Introduction by Patrick (½ hour)3. RAC Best practices by Patrick (1 hour)
Plus Researcher Consultation lead by Patrick (1 hour as necessary)
WestGrid Annual General Meeting
Member Confirmation
Institution Voting Member Representative Attending:
UBC Steve Cundy, Director of ARC (Proxy) Remote
SFU No member in attendance
UVic Wency Lum, CIO (Proxy) Remote
UofC No member in attendance
UofA Walter Dixon, AVP(R) Remote
USask Dena McMartin, AVP(R) In person
UofM James Kerr, Director of Technology Services Remote
1. Call to Order & Member Representative Confirmation2. Approval of Agenda3. Approval of Financial Statements4. Appointment of Public Accountant5. Approval of Director Terms6. Other Business7. Adjournment
Agenda
Tha y