INFSO-RI-508833 Enabling Grids for E-sciencE Operation and management issues in the EGEE/SWE grid...
-
date post
19-Dec-2015 -
Category
Documents
-
view
213 -
download
0
Transcript of INFSO-RI-508833 Enabling Grids for E-sciencE Operation and management issues in the EGEE/SWE grid...
![Page 1: INFSO-RI-508833 Enabling Grids for E-sciencE Operation and management issues in the EGEE/SWE grid infrastructure G. Barreira, G. Borges,](https://reader034.fdocuments.us/reader034/viewer/2022051618/56649d2a5503460f949fe2b2/html5/thumbnails/1.jpg)
INFSO-RI-508833
Enabling Grids for E-sciencE
www.eu-egee.org
Operation and management issues in the EGEE/SWE grid infrastructure
G. Barreira, G. Borges, M. David, N. Dias, J. Gomes, J. P. MartinsLIP: Laboratório de Instrumentação em Física Experimental de Partículas
C. Borrego, M. Delfino, G. Merino, K. Neuffer, A. PachecoPIC: Port d’Informació Científica
F. Bernabé, J. Fontán, J. Lopez, P. ReyCESGA: Fundación Centro Tecnológico de Supercomputación de Galicia
R. Marco IFCA/CSIC: Instituto de Física de Cantabria / Consejo Superior de Investigaciones Científicas
J. PalaciosIFIC/CSIC: Instituto de Física Corpuscular / Consejo Superior de Investigaciones Científicas
![Page 2: INFSO-RI-508833 Enabling Grids for E-sciencE Operation and management issues in the EGEE/SWE grid infrastructure G. Barreira, G. Borges,](https://reader034.fdocuments.us/reader034/viewer/2022051618/56649d2a5503460f949fe2b2/html5/thumbnails/2.jpg)
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 2
Enabling Grids for E-sciencE
INFSO-RI-031688
Outline
o The EGEE grid project.
o Main operation activities inside EGEE South-West grid infrastructure:– Resources;– Activities coordination:
Certification;• Sites and middleware certification;
Accounting;• EGEE View• Participation in the Accounting Enforcement task;
Monitoring;• Interaction with the Grid Operation Centre (GOC);• Participation in COD;
Support;• Interaction with the Global Grid User Support (GGUS);
Authentication and Security;• Activities in the EUGridPMA framework.
Middleware tests and integration.
![Page 3: INFSO-RI-508833 Enabling Grids for E-sciencE Operation and management issues in the EGEE/SWE grid infrastructure G. Barreira, G. Borges,](https://reader034.fdocuments.us/reader034/viewer/2022051618/56649d2a5503460f949fe2b2/html5/thumbnails/3.jpg)
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 3
Enabling Grids for E-sciencE
INFSO-RI-031688
EGEE project
o The Enabling Grids for E-sciencE project:– An European financed grid project;
– The biggest world wide grid for multi-disciplinary sciences; Integrates several national and regional grids;
More then 90 partners distributed over 32 countries;
– Developed on top of the infrastructures and software built in EDG and LCG grid projects.
o The LHC Computing Grid project:– LHC will be the world most powerful particle accelerator;
Built at CERN and expected to start operating in 2007;
– LCG aims to build and maintain a data storage and analysis infrastructure for the large LHC physics community: 15 Petabytes of experimental data annually,
Available during the 15 years life time of the LHC machine;
Fully accessible to ~5000 scientists from more than 500 institutes.
![Page 4: INFSO-RI-508833 Enabling Grids for E-sciencE Operation and management issues in the EGEE/SWE grid infrastructure G. Barreira, G. Borges,](https://reader034.fdocuments.us/reader034/viewer/2022051618/56649d2a5503460f949fe2b2/html5/thumbnails/4.jpg)
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 4
Enabling Grids for E-sciencE
INFSO-RI-031688
EGEE project
o EGEE concentrates in three core areas:– Improve and maintain the middleware;
Provide a reliable service;– Attract new users from industry as well as from science;
Ensure they receive high standard of training and support; – Combine national, regional and thematic Grid efforts;
For a seamless Grid infrastructure for scientific research and to build a sustainable Grid for business research and industry.
o EGEE has expanded from the originally two scientific field (High energy physics and life sciences) and now integrates applications from other scientific fields:– Astrophyics; Biomedic and Bioinformatic applications;– Computational chemistry; Earth Sciencies;– Finance; Fusion; Geophysics;– (...)
o EGEE supports more than 100 virtual organizations.
![Page 5: INFSO-RI-508833 Enabling Grids for E-sciencE Operation and management issues in the EGEE/SWE grid infrastructure G. Barreira, G. Borges,](https://reader034.fdocuments.us/reader034/viewer/2022051618/56649d2a5503460f949fe2b2/html5/thumbnails/5.jpg)
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 5
Enabling Grids for E-sciencE
INFSO-RI-031688
EGEE project
![Page 6: INFSO-RI-508833 Enabling Grids for E-sciencE Operation and management issues in the EGEE/SWE grid infrastructure G. Barreira, G. Borges,](https://reader034.fdocuments.us/reader034/viewer/2022051618/56649d2a5503460f949fe2b2/html5/thumbnails/6.jpg)
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 6
Enabling Grids for E-sciencE
INFSO-RI-031688
EGEE Operations: The GOC
o The Grid Operations Centre is responsible for coordinating the overall operation of the EGEE Grid:
– Devises and manages mechanisms and procedures which encourage optimal operation of the Grid;
– It acts as a central point of operational information such as: Site local and central services; Site resources configuration;Contact details.
– Monitores the operation of the Grid Infrastructure as a whole;GOC works with the federation local support groups to assist them in
providing the best possible service while their infrastructure is connected to the Grid.
![Page 7: INFSO-RI-508833 Enabling Grids for E-sciencE Operation and management issues in the EGEE/SWE grid infrastructure G. Barreira, G. Borges,](https://reader034.fdocuments.us/reader034/viewer/2022051618/56649d2a5503460f949fe2b2/html5/thumbnails/7.jpg)
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 7
Enabling Grids for E-sciencE
INFSO-RI-031688
EGEE Operations: The ROCs o The fulfillment of the federations key objectives is supervised by
the Regional Operation Centre (ROC):– Operate essential core services;
RBs, data management services, information services, VOMS servers;
– Interface between VO requests and sites resources;– To provide monitoring and operational troubleshooting services; – Receiving, responding and coordinating the resolution of grid operation
problems from the sites and users point of view.
– South-Western Europe
– France
– UK/Ireland
– Northern Europe
– Germany/Switzerland
– CERN
– Italy
– Central Europe
– South Eastern Europe
– Russia
– Asia/Pacific
![Page 8: INFSO-RI-508833 Enabling Grids for E-sciencE Operation and management issues in the EGEE/SWE grid infrastructure G. Barreira, G. Borges,](https://reader034.fdocuments.us/reader034/viewer/2022051618/56649d2a5503460f949fe2b2/html5/thumbnails/8.jpg)
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 8
Enabling Grids for E-sciencE
INFSO-RI-031688
South-West federation
o EGEE South-West federation is part of the European Grid Operation, Support and Management activity (SA1).
o Responsible for maintaining high quality services of the grid infrastructure inside the South-West region:– Portuguese: LIP;
– Spanish: CESGA, CSIC, PIC, CIEMAT, BIFI;
– PIC is the “Tier 1” centre of the SWE federation.
o The EGEE SWE ROC is shared among the different institutes:– This requires a higher coordination effort;
All operations/management questions are weekly reported to the ROC manager during a VRVS meeting;
Promotes the communication between the different site managers;
Promotes the knowledge exchange necessary for a faster resolution of problems.
![Page 9: INFSO-RI-508833 Enabling Grids for E-sciencE Operation and management issues in the EGEE/SWE grid infrastructure G. Barreira, G. Borges,](https://reader034.fdocuments.us/reader034/viewer/2022051618/56649d2a5503460f949fe2b2/html5/thumbnails/9.jpg)
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 9
Enabling Grids for E-sciencE
INFSO-RI-031688
South-West federation resources
o EGEE South-West federation is presently offering…– Core services for the production testbed (13/10/2006):
8 Resource Brokers;
8 top BDII machines;
3 LFC central catalogs;
1 FTS service.
– Local services for the production infrastructure: 18 Computing Elements;
• 1052 CPUs = 935.2 Normalized CPUs.o (Norm = 1000 SpecInts2000 = Pentium IV @ 2.8 GHz).
18 Storage Elements;
• 35.4 TB of online storage (disk);
• 1.5 PB of nearline storage (tape backend).
– These resources are currently shared according to the federation internal policies by more than 20 virtual organizations.
![Page 10: INFSO-RI-508833 Enabling Grids for E-sciencE Operation and management issues in the EGEE/SWE grid infrastructure G. Barreira, G. Borges,](https://reader034.fdocuments.us/reader034/viewer/2022051618/56649d2a5503460f949fe2b2/html5/thumbnails/10.jpg)
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 10
Enabling Grids for E-sciencE
INFSO-RI-031688
SWE ROC tasks: Site certification
o The SWE ROC is responsible for certifying if a site fulfills the necessary requirements to join the grid production infrastructure:– Performed by LIP in Portugal;
– Performed by PIC in Spain;
– The certification process consists on a set of demanding tests: Information system;
Site configuration;
Interactions with the central core services.
– ROC negotiates service level agreements (SLA’s): Settle the level of services each Resource Center (RC) should
provide to the infrastructure.
![Page 11: INFSO-RI-508833 Enabling Grids for E-sciencE Operation and management issues in the EGEE/SWE grid infrastructure G. Barreira, G. Borges,](https://reader034.fdocuments.us/reader034/viewer/2022051618/56649d2a5503460f949fe2b2/html5/thumbnails/11.jpg)
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 11
Enabling Grids for E-sciencE
INFSO-RI-031688
SWE ROC tasks: Accounting
o The EGEE South-West federation was one of the first to widely deploy grid accounting tools;– CESGA is the responsible entity inside the South-West federation
for maintaining the accounting portal; – The most relevant information is monthly compiled and reported to
the ROC and federation members.
o Due to its expertise, CESGA was proposed as the responsible entity to handle the “Accounting enforcement task”…– Monitor all the EGEE infrastructure;– Check if all the Resource Centres are publishing correct accounting
information and open tickets if they don’t;– Help the Resource Centres to deploy the necessary accounting
tools;
o … and take charge of the “EGEE View”:– Portal with accounting information from all EGEE sites.
![Page 12: INFSO-RI-508833 Enabling Grids for E-sciencE Operation and management issues in the EGEE/SWE grid infrastructure G. Barreira, G. Borges,](https://reader034.fdocuments.us/reader034/viewer/2022051618/56649d2a5503460f949fe2b2/html5/thumbnails/12.jpg)
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 12
Enabling Grids for E-sciencE
INFSO-RI-031688
SWE ROC tasks: Accounting
→ → 949658 Jobs949658 Jobs
→ → 3504204 hours3504204 hours
→ → 2870184 hours2870184 hours
Some SWE accounting charts
![Page 13: INFSO-RI-508833 Enabling Grids for E-sciencE Operation and management issues in the EGEE/SWE grid infrastructure G. Barreira, G. Borges,](https://reader034.fdocuments.us/reader034/viewer/2022051618/56649d2a5503460f949fe2b2/html5/thumbnails/13.jpg)
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 13
Enabling Grids for E-sciencE
INFSO-RI-031688
SWE ROC tasks: Accounting
![Page 14: INFSO-RI-508833 Enabling Grids for E-sciencE Operation and management issues in the EGEE/SWE grid infrastructure G. Barreira, G. Borges,](https://reader034.fdocuments.us/reader034/viewer/2022051618/56649d2a5503460f949fe2b2/html5/thumbnails/14.jpg)
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 14
Enabling Grids for E-sciencE
INFSO-RI-031688
SWE ROC tasks: Accounting
Some “EGEE View” charts
![Page 15: INFSO-RI-508833 Enabling Grids for E-sciencE Operation and management issues in the EGEE/SWE grid infrastructure G. Barreira, G. Borges,](https://reader034.fdocuments.us/reader034/viewer/2022051618/56649d2a5503460f949fe2b2/html5/thumbnails/15.jpg)
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 15
Enabling Grids for E-sciencE
INFSO-RI-031688
SWE ROC tasks: Accounting
![Page 16: INFSO-RI-508833 Enabling Grids for E-sciencE Operation and management issues in the EGEE/SWE grid infrastructure G. Barreira, G. Borges,](https://reader034.fdocuments.us/reader034/viewer/2022051618/56649d2a5503460f949fe2b2/html5/thumbnails/16.jpg)
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 16
Enabling Grids for E-sciencE
INFSO-RI-031688
SWE ROC tasks: Monitoring
o COD on Duty (COD) is done by Telefonica I+D helped by PIC;
o CODs are grid expert teams which manage the day-to-day operation of the grid:– Active monitoring of the infrastructure;
– Take appropriate action to protect the grid from the effects of failing components and to recover from operational problems. Ex: A Resource Centre is causing problems by generating invalid information;
COD team opens a ticket to the Resource Centre;
COD team contacts the corresponding ROC operations support line;
COD team informs a network operations centre of suspected failures;
COD may remove the RC from the grid if the RC in unresponsive and until the problem has been fixed;
– Many of these support and troubleshooting roles are undertaken in conjunction with Regional Operation Centres; It is intended that tools will be developed to automate much of this work;
![Page 17: INFSO-RI-508833 Enabling Grids for E-sciencE Operation and management issues in the EGEE/SWE grid infrastructure G. Barreira, G. Borges,](https://reader034.fdocuments.us/reader034/viewer/2022051618/56649d2a5503460f949fe2b2/html5/thumbnails/17.jpg)
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 17
Enabling Grids for E-sciencE
INFSO-RI-031688
SWE ROC tasks: Monitoring
o CESGA maintains a GridICE portal for all the SWE RC’s.– The GridIce server collects information through specific sensors
included in the EGEE middleware: job information, grid service, fabric monitoring data.
– Based on some plugins for Nagios: Collect the data published by the sites;
Keeps them in a “postgresql” database;
Shows them in a web page.
– GridICE also includes e-mail notifications about changes in the status of the sites (Hosts, important processes, etc...
o CESGA is also responsible for the SWE monitoring alert system based on SFT/SAM results and Gstat: – Site Availability Monitoring:
Collection of comprehensive tests that are run daily on each certified site;
– GStat Monitor: A snapshot of the Grid Information System.
![Page 18: INFSO-RI-508833 Enabling Grids for E-sciencE Operation and management issues in the EGEE/SWE grid infrastructure G. Barreira, G. Borges,](https://reader034.fdocuments.us/reader034/viewer/2022051618/56649d2a5503460f949fe2b2/html5/thumbnails/18.jpg)
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 18
Enabling Grids for E-sciencE
INFSO-RI-031688
SWE ROC tasks: Monitoring
![Page 19: INFSO-RI-508833 Enabling Grids for E-sciencE Operation and management issues in the EGEE/SWE grid infrastructure G. Barreira, G. Borges,](https://reader034.fdocuments.us/reader034/viewer/2022051618/56649d2a5503460f949fe2b2/html5/thumbnails/19.jpg)
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 19
Enabling Grids for E-sciencE
INFSO-RI-031688
SWE ROC tasks: Monitoring
![Page 20: INFSO-RI-508833 Enabling Grids for E-sciencE Operation and management issues in the EGEE/SWE grid infrastructure G. Barreira, G. Borges,](https://reader034.fdocuments.us/reader034/viewer/2022051618/56649d2a5503460f949fe2b2/html5/thumbnails/20.jpg)
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 20
Enabling Grids for E-sciencE
INFSO-RI-031688
ROC SWE tasks: Monitoring
![Page 21: INFSO-RI-508833 Enabling Grids for E-sciencE Operation and management issues in the EGEE/SWE grid infrastructure G. Barreira, G. Borges,](https://reader034.fdocuments.us/reader034/viewer/2022051618/56649d2a5503460f949fe2b2/html5/thumbnails/21.jpg)
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 21
Enabling Grids for E-sciencE
INFSO-RI-031688
SWE ROC tasks: Support
o The regional EGEE South-West federation help desk portal is maintained by CSIC-IFIC:– Users/Admins from the SWE federation can open tickets;
o The coordination of the user support services inside the federation is handled by LIP: – It is LIP responsibility to follow all tickets assigned to the SWE
federation; – Make sure that they are routed to the correct RC and solved in time; – SWE ROC is automatically warned (and acts accordingly) when:
Open tickets are opened by users or COD staff on federation sites; SAM or any other monitoring tool reports failures…
![Page 22: INFSO-RI-508833 Enabling Grids for E-sciencE Operation and management issues in the EGEE/SWE grid infrastructure G. Barreira, G. Borges,](https://reader034.fdocuments.us/reader034/viewer/2022051618/56649d2a5503460f949fe2b2/html5/thumbnails/22.jpg)
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 22
Enabling Grids for E-sciencE
INFSO-RI-031688
SWE ROC tasks: Support
o The SWE help desk portal interacts with the EGEE Global Grid User Support (GGUS);
o GGUS is a trouble ticketing system application:
– Grid users and administrators can open tickets asking for help; Users can start a ticket using independent regional portals. Local experts can
try to solve the problem or assign it to the central GGUS service; A ticket can also be opened directly in the GGUS services via a web form or
email;– First line of support is provided by “Ticket Processing Managers”:
TPM teams are composed of 3 Grid experts, who change on a weekly basis; TPM’s are able to provide a solution to a given grid operation problem or
assign the issue to more specialized support unit.– Support is assured 5 days a week, 9 hours a day;– GGUS is used to start COD trouble tickets when the monitoring jobs
fail;
o LIP contributes with one “Ticket Processing Manager” team for the general GGUS tasks.
![Page 23: INFSO-RI-508833 Enabling Grids for E-sciencE Operation and management issues in the EGEE/SWE grid infrastructure G. Barreira, G. Borges,](https://reader034.fdocuments.us/reader034/viewer/2022051618/56649d2a5503460f949fe2b2/html5/thumbnails/23.jpg)
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 23
Enabling Grids for E-sciencE
INFSO-RI-031688
SWE ROC tasks: Support
Regional SWE help-desk
![Page 24: INFSO-RI-508833 Enabling Grids for E-sciencE Operation and management issues in the EGEE/SWE grid infrastructure G. Barreira, G. Borges,](https://reader034.fdocuments.us/reader034/viewer/2022051618/56649d2a5503460f949fe2b2/html5/thumbnails/24.jpg)
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 24
Enabling Grids for E-sciencE
INFSO-RI-031688
SWE ROC tasks: Authentication and Security
o The emission of valid certificates for EGEE for SWE region is operated by:– LIP, through the LIP Certification Authority (LIPCA), in Portugal; – CSIC-IFCA and PK-IRISGRID in Spain.
o These CA’s are members of the European Policy Management Authority for Grid Authentication in e-Science (EUGridPMA).– EUGridPMA coordinates a Public Key Infrastructure (PKI) used in
the emission of X.509 certificates;
o SWE CAs participate in the body of EUGridPMA and in the revision of the CP/CPS (Certificate Policy/Certification Practice Statement).
o LIP (in Portugal) and RED.ES (in Spain) are responsible for security coordination and for handling security incidences.
![Page 25: INFSO-RI-508833 Enabling Grids for E-sciencE Operation and management issues in the EGEE/SWE grid infrastructure G. Barreira, G. Borges,](https://reader034.fdocuments.us/reader034/viewer/2022051618/56649d2a5503460f949fe2b2/html5/thumbnails/25.jpg)
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 25
Enabling Grids for E-sciencE
INFSO-RI-031688
SWE ROC tasks: Middleware integration
o gLite is the middleware layer developed by EGEE.– Extends the use of the grid infrastructure to all fields of science;
– Follows a Service Oriented Architecture (SOA): Decreases the middleware dependence on the user’s applications and
interactions with the different services.
o gLite middleware doesn’t support all LRMs systems:– Only LFS and Torque/Maui batch schedulers by default:
– LIP and CESGA, together with IC, are involved in an EGEE task force to provide gLite support for SGE batch system: New jobmanager implementation;
New infoprovider scripts;
Upgrade the yaim installation procedure.
![Page 26: INFSO-RI-508833 Enabling Grids for E-sciencE Operation and management issues in the EGEE/SWE grid infrastructure G. Barreira, G. Borges,](https://reader034.fdocuments.us/reader034/viewer/2022051618/56649d2a5503460f949fe2b2/html5/thumbnails/26.jpg)
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 26
Enabling Grids for E-sciencE
INFSO-RI-031688
SWE pre-production testbed
o In parallel with the EGEE production testbed, some SWE sites also participate in a pre-production testbed:– CESGA, CSIC-IFIC, LIP and PIC;
o Objectives of the pre-production testbed:– Test new middleware releases;
First contact with new services;
Test all services interactions/interconnections;
Report bugs to the developers;
Test bug fixes;
– Release the middleware packages/patches which were correctly validated to the production testbed;
o SWE ROC participates in the validation process of middleware components and helps the deployment in the RC’s.
![Page 27: INFSO-RI-508833 Enabling Grids for E-sciencE Operation and management issues in the EGEE/SWE grid infrastructure G. Barreira, G. Borges,](https://reader034.fdocuments.us/reader034/viewer/2022051618/56649d2a5503460f949fe2b2/html5/thumbnails/27.jpg)
Operation and management issues in the EGEE/SWE grid infrastructure CGW’06 27
Enabling Grids for E-sciencE
INFSO-RI-031688
Summary & Conclusions
o We have presented the main EGEE SWE federation activities:– Its resources for the production testbed;– Its operation and regional management procedures;– Its responsibilities in the some general EGEE tasks:
Certification; Accounting; Support; Monitoring Authentication; Middleware tests and integration;
– Further details regarding EGEE SWE federation activities can be obtained consulting the SWE portal mantained by the CSIC-IFCA.
o This presentation aims to a better understanding of the EGEE project, their fundamental organization and to acknowledge how the different resources work together to deliver high quality services to the users.