EGEE-III INFSO-RI-222667
Enabling Grids for E-sciencE
www.eu-egee.org
EGEE and gLite are registered trademarks
Maite BarrosoSA1 activity leaderCERN
EGEE-III First Review, 24-25 June, 2009
Grid OperationsSA1 Status Report
Enabling Grids for E-sciencE
EGEE-III INFSO-RI-222667 SA1 – Maite Barroso- EGEE-III First Review 24-25 June 2009 2
SA1 Activity Overview
Country Total PM
planned at M24 (1)
Total FTE
Austria 37 1.5
Belgium
Bulgaria 60 2.5
CERN 420 17.5
Croatia 47 2.0
Cyprus 47 2.0
Czech Republic 58 2.4
Finland 24 1.0
France 450 18.8
Germany 392 16.3
Greece 131 5.5
Hungary 38 1.6
Ireland 36 1.5
Israel 52 2.2
Italy 468 19.5
Netherlands 204 8.5
Norway
Poland 152 6.3
Portugal 100 4.2
Romania 57 2.4
Russia 424 17.7
Serbia 55 2.3
Slovakia 33 1.4
Slovenia 16 0.7
Spain 317 13.2
Sweden 120 5.0
Switzerland 24 1.0
Turkey 66 2.8
UK 372 15.5
Total PM planned at M24 4200
Total FTE 175.0
28 countries, 175 FTE
NA12%
NA25% NA3
8%
NA419%
NA51%
SA149%
SA22%
SA39%
JRA16%
Enabling Grids for E-sciencE
EGEE-III INFSO-RI-222667 SA1 – Maite Barroso - EGEE-III First Review 24-25 June 2009 3
Grid Operations• Reliable, multi-VO, large scale production
infrastructure• Uninterrupted service• Operational processes, tools and documentation• Worldwide collaboration between ROCs and sites
Enabling Grids for E-sciencE
EGEE-III INFSO-RI-222667 SA1 – Maite Barroso - EGEE-III First Review 24-25 June 2009 4
Size of the infrastructureNumber of EGEE-III certified sites
Number of EGEE-III certified sites per region
Computing resources:• 155 MSI2k at the end of
January 2009• already more than the 124
MSI2k planned for the end of the project!
Storage resources:• Currently deployed information
providers have known issues, unreliable data
• Ongoing initiative, started by WLCG, to review and fix them
• Foreseen for Y2
Enabling Grids for E-sciencE
EGEE-III INFSO-RI-222667 SA1 – Maite Barroso - EGEE-III First Review 24-25 June 2009 5
Usage of the infrastructure (I)Monthly production normalized CPU time by VO
Number of EGEE-III certified sites per region
• Remarkable increase in the usage of the grid resources
Monthly production normalized CPU time by ROC
• Steady increase in the usage of the grid resources by most VOs
• Some of the larger VOs show considerable fluctuations, due to specific challenges
• Substantial increase for some VOs: ATLAS, LHCb and CMS
Enabling Grids for E-sciencE
EGEE-III INFSO-RI-222667 SA1 – Maite Barroso - EGEE-III First Review 24-25 June 2009 6
Usage of the infrastructure (II)Number of jobs
• Steadily increasing till October ‘08, stable since then
• 10 million jobs per month• 370.000 jobs/day (188.000
last year, doubled since then!)
Enabling Grids for E-sciencE
EGEE-III INFSO-RI-222667 SA1 – Maite Barroso - EGEE-III First Review 24-25 June 2009 7
Usage of the infrastructure (III)Data transfers
• The bulk of the data transported can be credited to the four LHC VOs
• Peaks of data transfer activity in Spring and Summer 2008, WLCG service challenges and stress tests in preparation of the start of the operational phase of the LHC
• Slowly increasing in the last months
• Sustained data rates of more than 0.9 GB/s with peaks up to 1 GB/s
Enabling Grids for E-sciencE
EGEE-III INFSO-RI-222667 SA1 – Maite Barroso - EGEE-III First Review 24-25 June 2009 8
Seed resources• Pool of compute and storage resources made available to new VOs
to ease the process of becoming a user of the EGEE e-Infrastructure (with dedicated funding)
• Resources (257 cores and 27 TB of disk space) allocated to 4 sites, with well defined usage policies, up and running since January ‘09
Metric Value VOsNumber VOs allocated to seed-resource 2 na4.vo.eu-egee.org,
eticsproject.euNumber of requests for seed-resource allocation 1 Climate-G VO
Number of jobs submitted from seed resources VOs 30150 na4.vo.eu-egee.org
eticsproject.euComputing power consumed within seed resources pool (KSI2K)
61420
na4.vo.eu-egee.orgeticsproject.eu
Disk storage used within seed resources (GB)
350
na4.vo.eu-egee.orgeticsproject.eu
Services VOs organized by their own
WMS = 1LFC = 1CE = 2SE = 2
WMS = 0LFC = 1CE = 0SE = 0
na4.vo.eu-egee.orgna4.vo.eu-egee.orgna4.vo.eu-egee.orgna4.vo.eu-egee.orgeticsproject.eueticsproject.eueticsproject.eueticsproject.eu
Enabling Grids for E-sciencE
EGEE-III INFSO-RI-222667 To change: View -> Header and Footer 9
SLA Roll-out• SLAs facilitate the establishment of a partnership between
infrastructure management structures and resource centres (sites) to provide a defined quality of services to the users of resources.
• Slow but steady progress in all regions• 127 sites out of 264 (48%) have signed the SLA:
– Some ROCs sign with the national grid organizations (UKI, Italy)– Others consider equivalent the signature of the WLCG MoU (France)
• Complete set of metrics defined– Site availability/reliability is gathered automatically every month– All others gathered quarterly, from different sources, some of them not
automated– Ongoing work at CESGA to provide an operations metrics portal collecting
all metric results
Enabling Grids for E-sciencE
EGEE-III INFSO-RI-222667 SA1 – Maite Barroso - EGEE-III First Review 24-25 June 2009 10
Site Availability / Reliability• Availability and reliability targets are defined in the
EGEE ROC-Site SLA (70% Availability, 75% Reliability)• Results published monthly as the EGEE League Table
– https://edms.cern.ch/document/963325/• Systematic review of results by ROCs and SA1
management• Since May 2008, steady, albeit irregular, improvement
of overall site availability.• Discovering limitations of weighting by CPU count due to
server consolidation
Enabling Grids for E-sciencE
EGEE-III INFSO-RI-222667 SA1 – Maite Barroso - EGEE-III First Review 24-25 June 2009 11
Site Availability Improvements
May 2008 April 2009
Figures show that the regular monitoring of the SAM tests results and the associated follow-up activity contributed to improve both the overall and the regional Availability and Reliability.
May 2008
Enabling Grids for E-sciencE
EGEE-III INFSO-RI-222667 SA1 – Maite Barroso - EGEE-III First Review 24-25 June 2009 12
Site Availability evolution
May-08 Jun-08 Jul-08 Aug-08 Sep-08 Oct-08 Nov-08 Dec-08 Jan-09 Feb-09 Mar-09 Apr-0950%
55%
60%
65%
70%
75%
80%
85%
90%
95%
100%
APCERNCEFranceDECHItalyNERussiaSEESWEUKIAverageEGEE Regional Availability Figures
Enabling Grids for E-sciencE
EGEE-III INFSO-RI-222667 SA1 – Maite Barroso - EGEE-III First Review 24-25 June 2009 13
Release and deployment management
• Releases of new middleware must not disrupt the operational state of the production infrastructure:– incremental updates of the middleware has proved to be effective– there were nevertheless a few incidents affecting the production
system during the deployment of some updates: post-mortems carried out with SA3 for these incidents standard mechanism to roll-back a middleware upgrade staged roll-out at selected sites, to detect critical incidents as early as
possible• This goes in the direction of the future model that SA1
is putting in place: including staged roll-outs, fine grained versioning of the grid services, and a reliable production repository
Enabling Grids for E-sciencE
EGEE-III INFSO-RI-222667 SA1 – Maite Barroso - EGEE-III First Review 24-25 June 2009 14
Pre-Production Service• Pilot services:
– New service: on-demand previews of new middleware functionalities to interested users
– 5 pilot services (WMS 3.1, Site Central Authorization Service (SCAS), CREAM CE, VOMS and SLC5 Worker Nodes)
– very successful, valuable to the user and operations community– Community effort based on common interests can work - with a
thin layer for planning, coordination and tracking.• Deployment testbed:
– due to improvements in certification, focus is changing– many regions undertake their own rollout tests before wide-scale
release– Will evolve into a ‘staged rollout’ composed of representative
sites from the regions that undertake the deployment of new certified software release in a timely manner
Enabling Grids for E-sciencE
EGEE-III INFSO-RI-222667 SA1 – Maite Barroso - EGEE-III First Review 24-25 June 2009 15
Operational security• Day-to-day operations focused on security
incidents and vulnerabilities reported– None involved the middleware as an infection vector– No significant impact on the infrastructure
• Security "drills" early 2009 Tier1s campaign: clear overall improvement from the sites
• Cooperation with the OAT for most of the security monitoring
• Collaboration with the NRENs identified as a priority by the ROCs
– Appropriate contact points identified and appointed on both sides
– Local and global cooperation being improved• Security training and dissemination
– Full scale security training event organised at EGEE 08
• Additional gLite-specific security recommendations published
Enabling Grids for E-sciencE
EGEE-III INFSO-RI-222667 SA1 – Maite Barroso - EGEE-III First Review 24-25 June 2009 16
Operational security• Software vulnerabilities
– 28 new security vulnerabilities handled by the team– Comprehensive vulnerability handing process published
• Joint Security Policy Group– New mandate adopted
Clarified the stake-holders of the group Confirmed the aim of preparing general policies for use on many Grids.
– Four policy documents were approved Approval of Certification Authorities Grid Security Traceability and Logging Policy VO Operations Policy and Policy on Grid Multi-User Pilot Jobs
• International Grid Trust Federation– Significant progress was made on policies for operation of authorization services
Enabling Grids for E-sciencE
EGEE-III INFSO-RI-222667 SA1 – Maite Barroso - EGEE-III First Review 24-25 June 2009 17
Global Grid User Support• Regional support with central
coordination• GGUS is the central integration
platform, connected to other support structures (regional helpdesk, VO support infrastructures, etc)
• Users can choose to submit a support request to the central GGUS, to their Regional Operations Centre (ROC), or to their Virtual Organisation (VO) support service
• Support procedures are continuously updated and improved.
• Best practices are shared between supporters, and documented in a knowledge base for all grid-related problems and their solutions.
Enabling Grids for E-sciencE
EGEE-III INFSO-RI-222667 SA1 – Maite Barroso - EGEE-III First Review 24-25 June 2009 18
Global Grid User Support
• Number of trouble tickets has been almost constant over time
• Not particularly affected by the increasing size of the EGEE e-Infrastructure and the number of users
• Most tickets belong to ENOC and CIC Support Units
Enabling Grids for E-sciencE
EGEE-III INFSO-RI-222667 SA1 – Maite Barroso - EGEE-III First Review 24-25 June 2009 19
Grid Operator on duty• Role of oversight and 1st level support for grid production
infrastructure– Critical activity in maintaining usability and stability of sites– first-line support model based on a central group of operators on duty (COD)
opening tickets to sites in case of grid monitoring alarms• Work in EGEE III to define a new model, based on the devolution to
regions– First-line support done by each region, plus common layer for procedures, tools,
escalation– New procedures and organizational scheme have been identified according to
the requirements from existing COD teams, ROCs and sites, together with a migration work plan
– Four pilot federations have been identified: Central Europe, Northern Europe, Asia-Pacific and South West Europe.
• Expected advantages:– improvement in terms of number of tickets handled and response time – preparation to a sustainable infrastructure based on the distribution of
responsibilities to federations.
Enabling Grids for E-sciencE
EGEE-III INFSO-RI-222667 SA1 – Maite Barroso - EGEE-III First Review 24-25 June 2009 20
Grid Operations Automation• Aims
– Improve reliability and availability of sites via improved operational tools
– Increase automation of operations infrastructure– Prepare operational tools for use in an EGI/NGI structure
• Operations Automation Team (OAT) with representatives from ROCs, sites, all operation tools, and related infrastructure projects
– Strategy document at PM1 outlining technical architecture to achieve these aims
– New regional operation monitoring and ticketing flows defined by COD team, and implemented by OAT tools Nagios, Regional Dashboard, GGUS
Enabling Grids for E-sciencE
EGEE-III INFSO-RI-222667
Operations Automation TeamFocus:• Site Monitoring via Nagios, a
commodity open-source monitoring framework
• Integration of operational tools via ActiveMQ, an open-source enterprise messaging system
Achievements:• providing sites with a ready-to-deploy
Nagios monitoring solution, which configures itself automatically and includes a reference set of grid probes
• Nagios couples grid service monitoring with local fabric monitoring
• 120 sites monitored at site• 174 sites monitored at ROCsNext Steps:• Phased release of updated operational
tools to meet the issues of a regional deployment
21
Enabling Grids for E-sciencE
EGEE-III INFSO-RI-222667
Regionalized operations tools• Architecture and design phase now finished• All tools have provided plans with functionality and milestones for
delivery• A set of milestone deliverables which give a complete
functionality– 3 month intervals, starting April 2009
• If timescales slip, we can stop at any of the milestones and have a functional solution– Sacrificing functionality or distribution
22
Enabling Grids for E-sciencE
EGEE-III INFSO-RI-222667 SA1 – Maite Barroso - EGEE-III First Review 24-25 June 2009 23
Plans for Y2• Main goal is to transition to the operating model and
infrastructure proposed by the EGI Blueprint, for all SA1 tasks, with no disturbance to the reliable EGEE production infrastructure– Define which other tasks/roles will be regionalized, and make a
plan to achieve it– Finalize the regionalization for the tasks already identified (COD,
user support)– Finalize operation tool developments necessary to enable
regionalization, and deploy them transparently in production– Revise the software release and deployment procedure that
uses a ‘staged rollout’ as opposed to the Deployment Testbed in the current PPS
Enabling Grids for E-sciencE
EGEE-III INFSO-RI-222667 SA1 – Maite Barroso - EGEE-III First Review 24-25 June 2009 24
Summary • EGEE Infrastructure has continued to increase in size,
scale, usage and reliability• Distribution and automation are the driving forces• Distribution: We are gradually evolving the operations
model to move responsibility to the regions, this has an impact in effort, tools, procedures– Intense program of work for Y2– Preserving the collaboration is essential for this and for the
future EGI/NGI model• Automation: by devolving a complete solution for grid
monitoring to sites/ROCs, and a complete operations toolkit integrated through well defined interfaces and using messaging
Top Related