Post on 17-Jul-2020
WestGrid Town Hall: September 20, 2019
Patrick Mann, WestGrid Director of OperationsIkenna Okpala, WestGrid Senior Developer
Alex Razoumov, WestGrid Training and Visualization
Admin
To ask questions during today’s session:
From Webstream: Email info@westgrid.ca From Vidyo: Use the GROUP CHAT to ask questions.
Please mute your mic unless you have a question.
Afterwards: support@computecanada.ca
Outline
1. WestGrid & Compute Canada Updates2. 2020 Resource Allocation Competitions (RAC)
a. Overview & changes from 20193. Operations
a. Outages and maintenanceb. Updates
4. Web Development for Scientific Gatewaysa. A high-level overview from the WestGrid Development team
5. User Training - Webinars & Workshops
WestGrid andCompute Canada
Update
DRI NewOrg
Follow-up from the Spring 2018 announcement of $572 million for DRI.
● Establish a new organization that will encompass:○ Research Data Management○ Advanced Research Computing○ Research Software
● Approved funding from federal Minister of Science of $375 million.● Applicant Board announced - more info at engagedri.ca● Inaugural board is to be appointed early in 2020● Face-to-face consultation sessions starting next week.
○ “consultations will build on the feedback received during the development of the proposal in the spring and address governance issues including membership and board composition for the new organization.”
Neworg Consultations
Toronto Tuesday September 24
Montreal Friday September 27
Ottawa Tuesday October 1
Halifax Thursday October 3
Saskatoon Monday October 7
Vancouver Friday October 11
http://engagedri.ca/New Organization for Digital Research Infrastructure (DRI)
NewOrg Applicant Board
Dr. Guillaume Bourque Director of the Canadian Centre for Computational Genomics, Professor, Human Genetics, McGill University,
Ms. Barbara Kieley, ICD.D Retired Partner Ernst & Young focus on public sector digital service transformation projects
Mr. Peter MacKinnon, OC, QC President Emeritus of the University of Saskatchewan, and former interim president of Dalhousie University and Athabasca University,
Ms. Lori MacMullen Executive Director, Canadian University Council of Chief Information Officers (CUCCIO).
● Incorporate the new organization○ Basic bylaws, structure
● Appoint interim CEO● Appoint inaugural board
WestGrid
CEO Search● Lindsay Sill, the WestGrid CEO is moving on after many years at the
helm of WestGrid.● The WG Board is recruiting a CEO - watch for the profile coming out
soon from Odgers Berndtson
WestGrid and Compute Canada Annual General Meetings● CC AGM: Sep 25 Calgary● WG AGM: Sep 26-27 Calgary
○ WG Mission, Vision and Core Values approved by the board○ Work underway to draft the strategy and action plans to align with
our new vision.
RAC 2020Patrick Mann,
WestGrid Director of Operations
Review of RAC 2019
2019 RAC had the highestnumber of applications in history: 507 projects — 8.1% more than 2018! Due to high demand, the 2019 RAC was able to award: ● 41% of total CPUs requested● 86% of total storage requested● 20% of total GPUs requested● 95% of total virtual CPUs requested
2020 RAC will be another competitive year!
CPU Allocation Trends
As usual we are not able to meet demand by a very wide margin.
GPUs!!!
Year Supply Need Allocated
2019 1,644 6,555 1,331 20.3%
2018 976 4,092 840 20.5%
2017 1,420 2,790 1,047 37.5%
GPUs are in very high demand. The new funding will allow for significant increases, but the demand is rocketing up!
Storage Allocations
The first year that we did not have enough storage!● Generally storage cannot be automatically scaled.● New storage at cedar and graham helped us.● Plus some rearrangement and discussions with users.
Type Supply (TB)
Need (TB)
Allocated(TB)
Allocated/ Need
Project 54,600 44,114 33,673 76% Project-based, backed up
dCache 9,060 9,459 9,376 99% Special projects (LHC, ..)
Cloud 5,184 4,802 3,850 80% Platforms and Portals
Nearline 32,500 30,687 29,400 96% Tape - 2 replicas
Total 101,344 89,063 76,299 86%
Good news: Transition Funding
Compute Canada’s request last year for “Transition Funding” was successful!● CC made a case for transition funding between now and NewOrg (2022)
The Ministry of Innovation, Science and Economic Development Canada (ISED) has announced that it will “Invest $50 million in the immediate expansion of Advanced Research Computing (ARC) capacity at up to five existing national ARC host sites”.
Compute Canada has been working with ISED and expects substantial new resources to be installed in the next six months, hopefully available for RAC 2020. Details have not been announced yet, so keep an eye on the Compute Canada website.
2020 RAC Key Dates
Start FinishFast Track submission(Invitations sent before Sept.24) Sep 24, 2019 Oct 24, 2019
RRG & RPP full application submission Sep 24, 2019Nov 7, 2019**
11:59 PM EST
RPP Progress Report submission* November 2019 Jan 30, 2019*
Award letters sent* Mid Mar 2019 Late Mar 2019
RAC 2020 Allocations implemented* Mid Apr 2019 Early May 2019
* Final dates to be confirmed.
** No appeal process!
Compute Canada Q&A Dates
Q&A (English)https://www.eventbrite.ca/e/compute-canada-rac-2020-qa-session-tickets-71690112055
October 0812:00 - 1:30 pm EDT
Q&A (French)h(ttps://www.eventbrite.ca/e/seance-dinformation-pour-le-concours-dallocation-de-ressources-2020-tickets-71789322797?ref=estw
October 0912:00 - 1:30 pm EDT
RAC 2020 Changes
RPP and RRG Evaluation Criteria
The scores are now on a 5-point decimal scale with a new set of categories and descriptions.Increased emphasis on management and necessary expertise
Management plan A management plan identifying the team, required expertise and governance has been recommended.
No explicit HQP section This has been integrated into the management plan.
Decreased page limits Generally 10 pages (decreased from 12)
HPC4Health Secure Cloud (maybe!)
New resource for RAC 2020.Highly secure system currently used for health data.
OperationsPatrick Mann,
WestGrid Director of Operations
Outages and Maintenance I
Graham Aug 21, 2019 Power outage (electrical storm). All nodes rebooted. All jobs lost.
Aug 18, 2019 Scheduled outage for maintenance. Out all weekend.
Aug 15, 2019 /home performance issue. No jobs lost.
Jul 30, 2019 Graham cloud scheduled update. Out for the day
Cedar Aug 30, 2019 Short power brownout took cluster out. All jobs lost.
Aug 1, 2019 /project unresponsive.
Jul 21, 2019 Scheduled outage. July 15-21. Replacements for storage controllers. However new parts were defective and had to await delivery of new controller.
Jul 3, 2019 /project unresponsive
Jun 28, 2019 Mysql database down (very large data upload)
Jun 12, 2019 /home and /scratch hardware issue. Controller cards replaced.
Jun 4, 2019 Filesystem issues.
Outages and Maintenance II
Béluga Sep 3, 2019 /scratch unresponsive. Required whole cluster reboot. About 5 hours. All jobs lost.They’re working on this continuing issue.
Aug 27, 2019 /home and /project unresponsive.Scheduler not allowing new jobs. Needed restart.
Continuing..Aug 13, Jul 24, Jul 17, Jun 19, Jun 12, Jun 8,Jun 4
/scratch and /project intermittently unresponsive.They’re working on it.
Niagara Aug 1, 2019 Scheduled maintenance shutdown for emergency power generator installation. Out for the day.
Outages and Maintenance III
Arbutus Sep 25, 2019 Scheduled maintenance. External switch upgrades will result in external network interruption. VM’s will continue to run.
Jul 30, 2019 Intermittent networking issues.
Jul 29, 2019 Login problem due to network issue.
Jul 16, 2019 Login problem
Jun 28, 2019 New storage installed over a week. Some slowdown in performance.
Jun 13, 2019 Network maintenance. External network out. VM’s continue running.
Cedar Notes
Disabled job submission from /home.
March/April 2019
/home, /scratch were hanging. (Both on DDN SFA14K storage system).Storage servers rebooting.Issue was certain jobs running from /home.
● Decision to disable job submission from /home.Firmware upgrades have been applied and seem to address some of the stability issues.
DDN 14K hardware replacements
July 15-17, extended to July 21
Replace storage components, including a chassis replacement.Unexpectedly triggered a whole-fllesystem recheck: 48 hours!New parts were defective. Further replacements needed to be shipped.Still some internal performance problems. Working with vendor.Sep 19: new /home hardware has been racked!!! October in production.
Continuing issues with jobs performing very intensive i/o: these can seriously effect the shared filesystems.
● Lots of small file i/o hammer the metadata servers and can cause significant performance degradation.● Continuing monitoring - jobs have been terminated.
Current Singularity issue: when jobs running in a singularity container exceed the requested memory the the node fails, but also hangs the filesystems that serve executables that are running on that node (all executables, not just from the singularity container).
Modern Development Practices
Ikenna Okpala, WestGrid Senior Developer
What we do: Dev Team
Senior Developer: Ikenna Okpala● Has more than 12 years experience in sectors ranging from government to
medical to e-commerce. ● A specialist in web development, he subscribes to agile, pair programming
(where required), behaviour driven (BDD), DRY, progressive enhancement, and open source approaches
Junior Developer: Steven Bucholtz● Steven first connected with WestGrid in 2018 (while
acquiring a Bachelors in Computer Science), through a co-op position via the Canada Summer Jobs (CSJ) program.
● Prior to joining WestGrid, he’s worked as a network analyst, web developer, and quality assurance engineer.
Together, they are building solutions that create, extend, implement, and maintain scientific gateways and advanced computing software tools and databases for researchers.
Considering a project? Contact us: support@westgrid.ca
Summer/Autumn Intern: Courtney Gosselin
Characteristics
Self Organizing team - We are self starters- We choose best ways to accomplish work through consensus.- Can get work done irrespective of the situation (like absence of a
team member).
Iterative development - Product progresses in a series of two week long “sprints” cycles
- Discovery through user story writing with the subject matter
experts and users
- Requirements are captured as items that form “backlog”
Software engineering - Can develop or work on Monolith or Microservices- Protocol based communication
- (Proxy pass or Restful Architecture) - Continuous integration / delivery
Practices- Agile (i.e Scrum / Kanban)
- Daily Stand ups
- Planning Meetings
- Show and Tells
- Retrospectives
- Test Driven Development
- (TDD/BDD)
- Paring programming
- Domain Driven Development
- Automation and Reproducibility
Implementation Tools
- DevOps- Terraform
- Ansible
- Concourse CI
- OpenStack
- Frontend- Bootstrap
- ReactJS
- MomentJS
- JQuery
- etc
- Backend- Python/Flask
- Ruby/Sinatra/Grape/RoR
Projects
CCDB Portal - Part of the national team
- Get allocated tasks.
CC Status Portal - Full stack development of the portal
- Frontend
- Backend
- DevOps
- Due to be released in the fall of 2019
ResistanceDB Portal - Technical support and guidance
- Full stack development of the portal
- Frontend, Backend, DevOps
CC Status
Compute Canada: ● Alerts notification and
management system
Team (Started by Co-op students): ● Steven Bucholtz● Christina Herbert● Courtney Gosselin● Devon MacNeil
Built as a monolith.
Going live in the fall of 2019.
ResistanceDB
Genome Canada: ● Large-scale advanced
research project program● Precision health,
personalized medicine
$11 million
University of Calgary researchers: ● Ian Lewis ● Sergei Noskov
WestGrid contributing 1 FTE developer resource.
https://www.resistancedb.org
ResistanceDB: Pipeline Architecture
CLS Culture process
CLS Data
Growth Data
Patient Sample
Dataset
Metabolomics Pipeline
Proteomics Pipeline
Genomics Pipeline
Merge
Scores, etc. to CLS(WestGrid)
Portal Display(WestGrid)
Management/Logging/Auditing (WestGrid)
Web ArchitectureAttributes:
● Separation of concerns.● Communication over
protocols.● Restful endpoints● Virtualisation through
OpenStack. Docker and other container tech.
● Data transfer format: JSON.● Monitoring.● Business layer can be
reused.● Flexibility of the Frontend to
change. User flow activities.
Architecture
DEMO
What’s Next
- Continue to contribute and help WestGrid community
- Continue with our Compute Canada contributions
- Contributing to existing open source projects
- Open sourcing some of our libraries
Contact us if need any help with a
platform projectpatrick.mann@westgrid.ca
ThanksAny Questions?
Upcoming User TrainingAlex Razoumov, WestGrid Visualization &
Training Coordinator
What we do: Visualize This
4th annual Visualize This competition
Website: https://computecanada.github.io/visualizeThis
WestGrid News Announcement
2019 Theme: Remote Parallel Rendering● Participants can use their own research data, or we have a
“default” dataset available● You can use any parallel rendering tool● Sep-30 default dataset path will be posted● Nov-30 (midnight Pacific) submissions due● Register your interest if you want updates in your mailbox
Co-hosted by:
WestGrid Online Sessions
Every second Wednesday ● 10am Pacific● 11am Mountain● 11am/12pm Central
WestGrid’s bi-weekly webinars
Working with the Python DASK libraryWednesday, October 16
File access control approaches and best practicesWednesday, October 30
Building a bioinformatics QC pipeline Wednesday, November 16
Tips & tools for mining Twitter data for research Wednesday, November 27
Geospatial analysis with high performance computing (HPC)Wednesday, December 11
http://bit.ly/wg2019b● details of upcoming webinars● links to slides and recordings
from past webinars
In-person training
UAlberta ResearchComputing Bootcamp
Sept 23 - Oct 4
UVic Research ComputingWinter SchoolDec 10 - 13
http://bit.ly/wg2019b● details and registration links
User Training Archive
https://westgrid.github.io/trainingMaterials
● Recently added:○ Batch visualization on
Compute Canada clusters
● Links to other guides, documentation & upcoming events
Support / Follow-up
If you have questions following this session, you can email us anytime:
support@westgrid.caor
support@computecanada.ca
We can also advocate on behalf of WestGrid member and user concerns within Compute Canada.