PL-Grid – Status and Plans The first functioning National Grid Initiative in Europe
Operations in PL-Grid
-
Upload
sharon-kelly -
Category
Documents
-
view
34 -
download
2
description
Transcript of Operations in PL-Grid
Polish Infrastructurefor Supporting Computational Science
in the European Research Space
EUROPEAN UNION
Operations in PL-Grid
M. RadeckiM. Radecki, T. Szepieniec, M. Krakowian,, T. Szepieniec, M. Krakowian,
T. Szymocha, M. Zdybek, D. Harezlak, T. Szymocha, M. Zdybek, D. Harezlak, and J. Andrzejewskiand J. Andrzejewski
ACC CYFRONET AGHACC CYFRONET AGH
Cracow Grid WorkshopCracow, 11.10.2010
2
OutlineOutline
Goal of Grid Operations PL-Grid services for users
User registration and account management – PL-Grid Portal Incident reporting Usage monitoring
PL-Grid services for Polish NGI service availability monitoring grid usage accounting issue tracking
High level view on EGI, NGI and PL-Grid Operations Incident Management in PL-Grid Grid Infrastructure Monitoring Operations Communication and Documentation
3
Goal of PL-Grid OperationsGoal of PL-Grid Operations
coordinate and fulfill activities and processes required to provide and manage services for PL-Grid users
manage the technology required to provide and support these services
4
PL-Grid infrastructure servicesPL-Grid infrastructure services
Services for users access to computing power and storage space in 5 largest Polish computing
centers scientific software (e.g Gaussian, Fluent, Povray) user account management system facilities to report problems & service requests resource usage monitoring system application portals and other tools for users (soon)
PL-Grid as Polish NGI is obliged to provide some services interfaced to EGI service availability monitoring system issue tracking and user support system accounting (resource usage) system
5
User account managementUser account management
Motivation: necessity to determine if user is entitled to use PL-Grid resources Registration process confirms a user is researcher affiliated to Polish research
unit or ward: undergraduates, PhD students authorized by supervisor Registration must be on-line for user
Implementation: PL-Grid Portal based on Liferay engine Successful user registration results in Portal account - PL-Grid “entry point” for
the user Easily extended with new functionality using JSR 268 portlets Ability to re-use rich Liferay components library like e.g. forum, wiki
PL-Grid specific features Easy personal certificate access - ability to get X.509 certificate on-line
• scope limited to PL-Grid services only User account data integrated with PL-Grid tools & services
• User login used for services allowing login/password authentication/authorization
Broadcast tool to contact all users
6
User account management – 1User account management – 1stst year experiences year experiences
PL-Grid user registration opened at last year's CGW
PL-Grid Portal technology changed from Java Spring through Google Web Tookit to Liferay
Agreed formal process description documents indispensable
user registration important for
all PL-Grid computing centers procedure security
User statistics (as of 10.10.2010) Registered users: 204
• PL-Grid staff: 64 independent researchers: 56 wards: 84
Jan – Oct 2010
no. of registered users
7
PL-Grid Scientific Software & HelpdeskPL-Grid Scientific Software & Helpdesk
PL-Grid offers access to both commercial and free scientific applications NAMD, ADF, Blender, CFour, CPMD, Dalton, Fluent, Gamess, Gaussian,
Gromacs, NWChem, Povray, Turbomole Availability of software and current status are monitored and results are feed to
incident management system higher availability for users
Users can check if program failed due to their fault of computing center problem Issues with monitoring
monitoring system designed for site admins, web interface unacceptable for users, consider possibility of using myEGI portal when available
PL-Grid Helpdesk allows reporting issues, problems and service requests Reporting can be done via phone call, e-mail or PL-Grid Helpesk web
interface, phone call reports are registered by operator Report registration returns a user with incident identifier
• allows to refer and modify the incident later on Incident transferred to EGI level if solution lies beyond the scope of Polish NGI
• still can be managed via PL-Grid Helpdesk
8
Resource Usage Monitoring SystemResource Usage Monitoring System
Motivation: PL-Grid grant accounting, daily data reports for users In first prototype available the users can track their resource usage
status of jobs daily daily workload (CPU-, walltime) per computing center
Currently used in parallel with EGI accounting - APEL
9
EGI, NGI & PL-Grid Operations – high level viewEGI, NGI & PL-Grid Operations – high level view
EGI: Central Operator on Duty
NGI: Regional Operator on Duty EGIOperations Dashboard
GGUS
PL-GridHelpdesk
WebSvc Web
SvcRegional Technical Support
Site Administrators use
use
use
use
Operations Support Teams Operations Support Tools
MonitoringJMS
10
PL-Grid Operations: Incident ManagementPL-Grid Operations: Incident Management
“The main objective of incident management process is to resume regular state of affairs as quickly as possible and minimize the impact of business processes."
Service Operation based on ITIL(R) V3 Identification
incidents are triggered by monitoring system, users or technical staff Registration
issue tracking system (PL-Grid adapted Request Tracker) incident reported by user or staff is always registered only long-standing (>24h) problems reported by monitoring system are registered
Classification regular middleware services / PL-Grid applications
Escalation experts are responsible for making sure the problem is solved or reassign incidents can be escalated to EGI for software problems
Solution applied & Tested => Issue Closed administrator of failed resource applies solution triggers execution of the monitoring system probes check if user is satisfied => if all OK, close incident
11
Incident Management – PL-Grid experienceIncident Management – PL-Grid experience
Pro-active procedures for troubleshooting in first 24h monitoring system reported incidents, involving Regional Technical Support
Incident solution process can be useful source of knowledge PL-Grid introduced Operational Problems Knowledge Base
Regional Technical Support team creates entries data to be re-used when similar problem occurs again publicly available - web pages indexed by search engines entry contains full error message and detailed solution procedure - in case of
problems – paste your error message in Google Search KB population started in Aug 2009, ~50 entries knowledge base link: https://weblog.plgrid.pl/category/1st-line-support/
Incident Management Metrics – evaluate performance quantitative e.g. number of incidents, individual submitters, GGUS share etc. focused on teams response time
Issues team reaction time metrics indicate room for improvement, need to promote
incident handling procedures among supporters/experts Knowledge Base requires initial investment, but more entries, more it pays off
12
Grid Infrastructure Monitoring SystemGrid Infrastructure Monitoring System
Motivation: not acceptable to wait for user to notify service problem PL-Grid monitoring system is extended version of EGI nagios-based system for grid
services availability monitoring PL-Grid extensions
monitoring PL-Grid scientific software probes for availability of PL-Grid VO (vo.plgrid.pl) other middleware services (being integrated)
Alarms sent to EGI message bus (based on ActiveMQ JMS implementation) and then displayed in EGI Operations Dashboard (incl. PL-Grid extensions)
Issues core services poorly or not monitored monitoring system triggers incidents, nice to have possibility to monitor trends
and predict failures no control system, services does not have management interface – software
maturity issue
13
Operations Communication & DocumentationOperations Communication & Documentation
PL-Grid Operations Center is distributed, resources are located in geographically distant centers – requires other than F2F means of communication
Solving operational problem requires interactive communication (better than e-mail) Coordination of distributed teams require procedures, work descriptions and
handovers PL-Grid use bi-weekly teleconferences where operations issues can be discussed Jabber service with automatically generated contact list to all registered PL-Grid staff RTS fills daily handover reports and quarterly summary Operational Documentation
Incident Handling in PL-Grid Helpdesk
• https://weblog.plgrid.pl/procedura-obslugi-helpdesku/ Operational Procedures for ROD, RTS and site admins
• https://weblog.plgrid.pl/procedury-operacyjne-pl-grid/
14
Questions?Questions?