Operations in PL-Grid

14
Polish Infrastructure for Supporting Computational Science in the European Research Space EUROPEAN UNION Operations in PL-Grid M. Radecki M. Radecki , T. Szepieniec, M. Krakowian, , T. Szepieniec, M. Krakowian, T. Szymocha, M. Zdybek, D. Harezlak, T. Szymocha, M. Zdybek, D. Harezlak, and J. Andrzejewski and J. Andrzejewski ACC CYFRONET AGH ACC CYFRONET AGH Cracow Grid Workshop Cracow, 11.10.2010

description

M. Radecki , T. Szepieniec, M. Krakowian, T. Szymocha, M. Zdybek, D. Harezlak, and J. Andrzejewski ACC CYFRONET AGH. Operations in PL-Grid. Cracow Grid Workshop Cracow, 11.10.2010. Outline. Goal of Grid Operations PL-Grid services for users - PowerPoint PPT Presentation

Transcript of Operations in PL-Grid

Page 1: Operations in PL-Grid

Polish Infrastructurefor Supporting Computational Science

in the European Research Space

EUROPEAN UNION

Operations in PL-Grid

M. RadeckiM. Radecki, T. Szepieniec, M. Krakowian,, T. Szepieniec, M. Krakowian,

T. Szymocha, M. Zdybek, D. Harezlak, T. Szymocha, M. Zdybek, D. Harezlak, and J. Andrzejewskiand J. Andrzejewski

ACC CYFRONET AGHACC CYFRONET AGH

Cracow Grid WorkshopCracow, 11.10.2010

Page 2: Operations in PL-Grid

2

OutlineOutline

Goal of Grid Operations PL-Grid services for users

User registration and account management – PL-Grid Portal Incident reporting Usage monitoring

PL-Grid services for Polish NGI service availability monitoring grid usage accounting issue tracking

High level view on EGI, NGI and PL-Grid Operations Incident Management in PL-Grid Grid Infrastructure Monitoring Operations Communication and Documentation

Page 3: Operations in PL-Grid

3

Goal of PL-Grid OperationsGoal of PL-Grid Operations

coordinate and fulfill activities and processes required to provide and manage services for PL-Grid users

manage the technology required to provide and support these services

Page 4: Operations in PL-Grid

4

PL-Grid infrastructure servicesPL-Grid infrastructure services

Services for users access to computing power and storage space in 5 largest Polish computing

centers scientific software (e.g Gaussian, Fluent, Povray) user account management system facilities to report problems & service requests resource usage monitoring system application portals and other tools for users (soon)

PL-Grid as Polish NGI is obliged to provide some services interfaced to EGI service availability monitoring system issue tracking and user support system accounting (resource usage) system

Page 5: Operations in PL-Grid

5

User account managementUser account management

Motivation: necessity to determine if user is entitled to use PL-Grid resources Registration process confirms a user is researcher affiliated to Polish research

unit or ward: undergraduates, PhD students authorized by supervisor Registration must be on-line for user

Implementation: PL-Grid Portal based on Liferay engine Successful user registration results in Portal account - PL-Grid “entry point” for

the user Easily extended with new functionality using JSR 268 portlets Ability to re-use rich Liferay components library like e.g. forum, wiki

PL-Grid specific features Easy personal certificate access - ability to get X.509 certificate on-line

• scope limited to PL-Grid services only User account data integrated with PL-Grid tools & services

• User login used for services allowing login/password authentication/authorization

Broadcast tool to contact all users

Page 6: Operations in PL-Grid

6

User account management – 1User account management – 1stst year experiences year experiences

PL-Grid user registration opened at last year's CGW

PL-Grid Portal technology changed from Java Spring through Google Web Tookit to Liferay

Agreed formal process description documents indispensable

user registration important for

all PL-Grid computing centers procedure security

User statistics (as of 10.10.2010) Registered users: 204

• PL-Grid staff: 64 independent researchers: 56 wards: 84

Jan – Oct 2010

no. of registered users

Page 7: Operations in PL-Grid

7

PL-Grid Scientific Software & HelpdeskPL-Grid Scientific Software & Helpdesk

PL-Grid offers access to both commercial and free scientific applications NAMD, ADF, Blender, CFour, CPMD, Dalton, Fluent, Gamess, Gaussian,

Gromacs, NWChem, Povray, Turbomole Availability of software and current status are monitored and results are feed to

incident management system higher availability for users

Users can check if program failed due to their fault of computing center problem Issues with monitoring

monitoring system designed for site admins, web interface unacceptable for users, consider possibility of using myEGI portal when available

PL-Grid Helpdesk allows reporting issues, problems and service requests Reporting can be done via phone call, e-mail or PL-Grid Helpesk web

interface, phone call reports are registered by operator Report registration returns a user with incident identifier

• allows to refer and modify the incident later on Incident transferred to EGI level if solution lies beyond the scope of Polish NGI

• still can be managed via PL-Grid Helpdesk

Page 8: Operations in PL-Grid

8

Resource Usage Monitoring SystemResource Usage Monitoring System

Motivation: PL-Grid grant accounting, daily data reports for users In first prototype available the users can track their resource usage

status of jobs daily daily workload (CPU-, walltime) per computing center

Currently used in parallel with EGI accounting - APEL

Page 9: Operations in PL-Grid

9

EGI, NGI & PL-Grid Operations – high level viewEGI, NGI & PL-Grid Operations – high level view

EGI: Central Operator on Duty

NGI: Regional Operator on Duty EGIOperations Dashboard

GGUS

PL-GridHelpdesk

WebSvc Web

SvcRegional Technical Support

Site Administrators use

use

use

use

Operations Support Teams Operations Support Tools

MonitoringJMS

Page 10: Operations in PL-Grid

10

PL-Grid Operations: Incident ManagementPL-Grid Operations: Incident Management

“The main objective of incident management process is to resume regular state of affairs as quickly as possible and minimize the impact of business processes."

Service Operation based on ITIL(R) V3 Identification

incidents are triggered by monitoring system, users or technical staff Registration

issue tracking system (PL-Grid adapted Request Tracker) incident reported by user or staff is always registered only long-standing (>24h) problems reported by monitoring system are registered

Classification regular middleware services / PL-Grid applications

Escalation experts are responsible for making sure the problem is solved or reassign incidents can be escalated to EGI for software problems

Solution applied & Tested => Issue Closed administrator of failed resource applies solution triggers execution of the monitoring system probes check if user is satisfied => if all OK, close incident

Page 11: Operations in PL-Grid

11

Incident Management – PL-Grid experienceIncident Management – PL-Grid experience

Pro-active procedures for troubleshooting in first 24h monitoring system reported incidents, involving Regional Technical Support

Incident solution process can be useful source of knowledge PL-Grid introduced Operational Problems Knowledge Base

Regional Technical Support team creates entries data to be re-used when similar problem occurs again publicly available - web pages indexed by search engines entry contains full error message and detailed solution procedure - in case of

problems – paste your error message in Google Search KB population started in Aug 2009, ~50 entries knowledge base link: https://weblog.plgrid.pl/category/1st-line-support/

Incident Management Metrics – evaluate performance quantitative e.g. number of incidents, individual submitters, GGUS share etc. focused on teams response time

Issues team reaction time metrics indicate room for improvement, need to promote

incident handling procedures among supporters/experts Knowledge Base requires initial investment, but more entries, more it pays off

Page 12: Operations in PL-Grid

12

Grid Infrastructure Monitoring SystemGrid Infrastructure Monitoring System

Motivation: not acceptable to wait for user to notify service problem PL-Grid monitoring system is extended version of EGI nagios-based system for grid

services availability monitoring PL-Grid extensions

monitoring PL-Grid scientific software probes for availability of PL-Grid VO (vo.plgrid.pl) other middleware services (being integrated)

Alarms sent to EGI message bus (based on ActiveMQ JMS implementation) and then displayed in EGI Operations Dashboard (incl. PL-Grid extensions)

Issues core services poorly or not monitored monitoring system triggers incidents, nice to have possibility to monitor trends

and predict failures no control system, services does not have management interface – software

maturity issue

Page 13: Operations in PL-Grid

13

Operations Communication & DocumentationOperations Communication & Documentation

PL-Grid Operations Center is distributed, resources are located in geographically distant centers – requires other than F2F means of communication

Solving operational problem requires interactive communication (better than e-mail) Coordination of distributed teams require procedures, work descriptions and

handovers PL-Grid use bi-weekly teleconferences where operations issues can be discussed Jabber service with automatically generated contact list to all registered PL-Grid staff RTS fills daily handover reports and quarterly summary Operational Documentation

Incident Handling in PL-Grid Helpdesk

• https://weblog.plgrid.pl/procedura-obslugi-helpdesku/ Operational Procedures for ROD, RTS and site admins

• https://weblog.plgrid.pl/procedury-operacyjne-pl-grid/

Page 14: Operations in PL-Grid

14

Questions?Questions?