Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 OSG-doc-498 Maite Barroso: Grid Operations LHCC...
-
Upload
garey-summers -
Category
Documents
-
view
216 -
download
1
Transcript of Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 OSG-doc-498 Maite Barroso: Grid Operations LHCC...
Maite Barroso: Grid Operations LHCC review, CERN,25th September 2006 1
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688OSG-doc-498
Operations EGEE and OSG
Maite Barroso, CERNRuth Pordes, Fermilab
LHCC Comprehensive Review
25th September, 2006
Maite Barroso: Grid Operations LHCC review, CERN,25th September 2006 2
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688OSG-doc-498
Outline
• EGEE operations• OSG operations• EGEE – OSG interoperations
Maite Barroso: Grid Operations LHCC review, CERN,25th September 2006 3
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688OSG-doc-498
EGEE: > 190 sites, 40 countries ~ 155 sites certified and in production > 28,000 processors, ~ 26 PB storage
EGEE Infrastructure: size
Maite Barroso: Grid Operations LHCC review, CERN,25th September 2006 4
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688OSG-doc-498
EGEE Infrastructure: usageJobs per day
LCG
BioMed
Other
0
10000
20000
30000
40000
50000
60000
May-05 Jun-05 Jul-05 Aug-05 Sep-05 Oct-05 Nov-05 Dec-05 Jan-06 Feb-06 Mar-06 Apr-06 May-06 Jun-06 Jul-06 Aug-06
Normalised CPU: k.SI2k Hours
LCG
BioMed
Other
0
500,000
1,000,000
1,500,000
2,000,000
2,500,000
3,000,000
3,500,000
4,000,000
4,500,000
5,000,000
May-05 Jun-05 Jul-05 Aug-05 Sep-05 Oct-05 Nov-05 Dec-05 Jan-06 Feb-06 Mar-06 Apr-06 May-06 Jun-06 Jul-06 Aug-06
~6000 cpu-months/month
Maite Barroso: Grid Operations LHCC review, CERN,25th September 2006 5
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688OSG-doc-498
EGEE operation: Key objectives
• Grid management – ROCs, relations with resource providers through negotiation of service-level
agreements (SLAs)
• Middleware deployment and introducing new resources • Operate a set of essential core infrastructure services • Grid monitoring and control • Resource and user support• International collaboration
– to drive collaboration with peer organisations in the Americas and the Asia-Pacific region to ensure the interoperability of Grid infrastructures and services so that the EGEE-II user communities
• Capture and provide middleware requirements • Grid security and incident response • Long term sustainability of the infrastructure
– to work both within the project and with the other related infrastructure projects and embryonic National Grid Infrastructures to put in place the necessary structures and organisation to ensure a long term sustainable infrastructure
Maite Barroso: Grid Operations LHCC review, CERN,25th September 2006 6
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688OSG-doc-498
Grid management: structure
• Operations Coordination Centre (OCC)
– responsible for the overall activity management, oversight of all operational and support activities
• Regional Operations Centres (ROC)
– providing the core of the support infrastructure, each supporting a number of resource centres within its region
• Resource centres – providing resources
(computing, storage, network, etc.);
• Grid User Support (GGUS)
– coordination and management of user support activities, single point of contact (portal) for users
Maite Barroso: Grid Operations LHCC review, CERN,25th September 2006 7
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688OSG-doc-498
Operations coordination
• ROC managers meeting– Biweekly– Discuss inter-ROC issues, general coordination, interfaces with
other activities
• WLCG-EGEE-OSG Operations meeting– Weekly, Mondays at 16:00 (Swiss time)– WLCG/OSG/EGEE– Pre-reports from sites, ROCs and VOs through CIC portal– Discuss, track and solve operation related issues from the previous
week
• Operation Workshops– Twice per year. Some joint between WLCG/OSG/EGEE– Last one: June 2006
http://agenda.cern.ch/fullAgenda.php?ida=a062031– Next one: Spring 2007
Maite Barroso: Grid Operations LHCC review, CERN,25th September 2006 8
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688OSG-doc-498
Middleware deployment
Development team 3
Development team 2
Development team 1
Certification
PPSAPT
repository
Softwarepasses
certification
TechnicalCoordinationGroup (TCG)
Longerterm
strategy
Certification APT
repository
Buildis ready
EMT
Steer nextrelease
Integration
TaggedRPMs
gLite Middleware Savannah
Bugs
Pre-prod. Service
Bugs
Production service
ProductionAPT
repository
SoftwareOK in PPS
Maite Barroso: Grid Operations LHCC review, CERN,25th September 2006 9
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688OSG-doc-498
Grid monitoring and control
The goal is to proactively monitor the operational state of the Grid and its performance, initiating corrective action to remedy problems arising with either core infrastructure or Grid resources
Regional Operations
Centre
… …Regional
Operations Centre
Resource Centre
Resource Centre
…
Regional Operations
Centre
Resource Centre
Resource Centre
…
OSCTGrid Operator on-duty (COD)
Monitoring shows a problem
Maite Barroso: Grid Operations LHCC review, CERN,25th September 2006 10
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688OSG-doc-498
Grid Operator on Duty
• Role:– Watch the problems detected by the grid monitoring tools
– Problem diagnosis
– Report these problems (GGUS tickets)
– Follow and escalate them if needed (well defined procedure)
– Provide help, propose solutions
– Build and maintain a central knowledge database (WIKI)
• Who does it?: – 9 ROC teams working in pairs (one lead and one backup) on
a weekly rotation
– CERN, France, Italy, UK, Russia, Asia-Pacific, Southeastern-Europe, Central-Europe, Germany-Switzerland
Maite Barroso: Grid Operations LHCC review, CERN,25th September 2006 11
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688OSG-doc-498
Grid monitoring tools
• Tools used by the Grid Operator on Duty team to detect problems
• Distributed responsibility
• CIC portal– single entry point– Integrated view of monitoring tools
• Site Functional Tests (SFT) -> Service Availability Monitoring (SAM)
• Grid Operations Centre Core Database (GOCDB)
• GIIS monitor (Gstat)
• GOC certificate lifetime
• GOC job monitor
• Others
Maite Barroso: Grid Operations LHCC review, CERN,25th September 2006 12
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688OSG-doc-498
Site Functional Tests
• Site Functional Tests (SFT)– Framework to test (sample)
services at all sites– Shows results matrix– Detailed test log available for
troubleshooting and debugging– History of individual tests is
kept – Can include VO-specific tests
(e.g. sw environment)– Normally >80% of sites pass
SFTs NB of 180 sites, some are
not well managed
• Very important in stabilising sites:
• Apps use only good sites• Bad sites are automatically excluded• Sites work hard to fix problems
Maite Barroso: Grid Operations LHCC review, CERN,25th September 2006 13
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688OSG-doc-498
Service Availability Monitoring
• Service Availability Monitoring (SAM)
– Will cover all core grid services
– measure availability by service, site, VO
– each service has associated service class defining required availability (Critical, highly available, etc.)
– Will be used to generate alarms
– to generate trouble tickets
– to call out support staff
Maite Barroso: Grid Operations LHCC review, CERN,25th September 2006 14
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688OSG-doc-498
Site availability
Maite Barroso: Grid Operations LHCC review, CERN,25th September 2006 15
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688OSG-doc-498
Operational procedures
• Described at the operations manual:
https://edms.cern.ch/document/701575
• Introducing new resources• Resource registration and contact information
– Stored in GOCDB
• Site downtime scheduling • Broadcast of planned and unplanned interventions
– EGEE broadcast tool
• Site suspension– The site is then removed from the top-level BDII and monitoring is turned off
• Escalation procedures
Maite Barroso: Grid Operations LHCC review, CERN,25th September 2006 16
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688OSG-doc-498
Operational securityFrom the EGEE Operational Security Coordination Team (OSCT)
• Recent security incident:– Many HEP sites affected by the recent incident– Local root compromises (on up to date machines)– Many compromised accounts (password sniffers)– Not a Grid attack as such but involved many LCG sites
• What went well?– Many people worked very hard– Collaboration was excellent– Sharing of necessary information was good– The Grid csirts list (and HEPIX security list) kept people informed
• What did not go so well? (matters for OSCT)– UK site decided (on the basis of following guidance) not to inform the Grid csirts– No incident handling team created (but CERN took the lead)– Private information leaked out on to several public mail lists and google searchable
archives and web sites– Discussion supposed to happen on “contacts” list not “csirts” list – much activity on
csirts list– Concern that sites who said they were not involved had not looked carefully enough– Need to strive for the correct balance in Open vs Closed communication– But must encourage sites to report
Maite Barroso: Grid Operations LHCC review, CERN,25th September 2006 17
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688OSG-doc-498
Open Science Grid and WLCG
The Open Science Grid contributes to the WLCG as the US distributed facility infrastructure.
OSG delivers accountable resources and cycles for LHC experiment production and analysis.
OSG federates with other infrastructures and interoperates with managerial, operational and technical activities.
OSG cooperates with the EGEE to ensure an effective and transparent system for the experiments.
Maite Barroso: Grid Operations LHCC review, CERN,25th September 2006 18
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688OSG-doc-498
Current OSG deployment
96 Resources across
production & integration infrastructures
27 Virtual Organizations including operations and
monitoring groups
>15,000 CPUs
~6 PB MSS
~4 PB disk
Maite Barroso: Grid Operations LHCC review, CERN,25th September 2006 19
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688OSG-doc-498
August OSG Usage- 3 largest VOs
50K & 90K CPU Hours/day
ATLAS CDF CMS
Maite Barroso: Grid Operations LHCC review, CERN,25th September 2006 20
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688OSG-doc-498
Running Jobs of Rest of the VOs
OSG jobs are “jobs submitted via OSG interfaces or servicesOSG jobs are “jobs submitted via OSG interfaces or services
3 large VOs had ~3500 simultaneous jobs in same period
3 large VOs had ~3500 simultaneous jobs in same period
1000 jobs
Maite Barroso: Grid Operations LHCC review, CERN,25th September 2006 21
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688OSG-doc-498
Software Release & Patches
These are subsets of the VDT, tailored to OSG
2 OSG major releases a year.>4 minor releases a year.
Development releases for testingCritical patches have
separate path.
Maite Barroso: Grid Operations LHCC review, CERN,25th September 2006 22
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688OSG-doc-498
Site and Service Validation
• Validation services being packages for use by any VO.• Grid Operations runs the validations also:
– Site-Verify executed by Operations under the operations VO.
– Job execution and file transfer tests executed under the GridEx VO.
• GridCat displays results of validations for “red” “green” presentation display.
• Integration Grid provides system for Application validation of releases and patches to the software and new services.
Maite Barroso: Grid Operations LHCC review, CERN,25th September 2006 23
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688OSG-doc-498
Support Model in OSG
• Distributed set of Support Centers covers all aspects of OSG– VO, Resources, Services, Middleware, Community
– A support center may support multiple activities.
• The goal of the OSG support model is to provide OSG users and resources with rapid responses to reported issues.
• Each VO supports their own users and resources. • There is an OSG Grid Operations Center for coordination and
routing of issues along with critical infrastructure components.• OSG GOC has final responsibility for releases of the OSG
software stack (including patches).
Maite Barroso: Grid Operations LHCC review, CERN,25th September 2006 24
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688OSG-doc-498
OSG Grid Operations Center
• Supports Centralized Grid Services– Monitoring Tools (MonALISA, GridCat)
– Resource Information Tools (VORS, BDII)
– Centralized Trouble Ticketing
– Interaction with Peering Grids (EGEE/TeraGrid)
– Communication Hub
– Software Packaging
– Documentation of Operations Information
– Security Response
– Keeps Definitive Contact Directory for VOs, Resources, and Support Centers
– Releasing Critical Patches/Upgrades to OSG
• And supports the OSG VO
Maite Barroso: Grid Operations LHCC review, CERN,25th September 2006 25
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688OSG-doc-498
Support Mechanisms in OSG
• Distributed set of Support Centers for all production activities in OSG– VO, Resources, Services, Middleware, Community– A support center may support multiple activities.
• When VOs, Resources, or Services are registered they identify a Support Center (may be Community Support).
• All Support Centers participate in OSG Operations.
Maite Barroso: Grid Operations LHCC review, CERN,25th September 2006 26
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688OSG-doc-498
Examples Support Services
• Middleware– VDT is core-middleware support center. Other direct middleware support contacts e.g. Monalisa. – VOs and other support centers are provided with a path to the middleware representatives– VDT has Weekly office hours and independent trouble ticket system
• Community Support– Open support for Users and Resources not covered by an specific support center.– Voluntary Participation on mail lists & Community Chat Room
• User Support – VO Users Contact their VO support center to begin the troubleshooting process – Problems are routed by the OSG-GOC to the responsible Support Center if problem moves
outside the VO– Support Documents should be made available from VO Support Center and recorded on the
OSG Twiki along with VO policy– Local Ticketing Systems for some VOs
• Application Support– Application questions go directly to the VO Support Center for routing/troubleshooting.
Maite Barroso: Grid Operations LHCC review, CERN,25th September 2006 27
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688OSG-doc-498
Security Operations
• Security Officer plans and coordinates Integrated Security Management consisting of Risk Assessment of vulnerabilities resulting in Management, Operations and Technical controls.
• Equivalence of Site and VO responsibilities and procedures.• Incident Response includes identified security contacts of all OSG
organizations.
Maite Barroso: Grid Operations LHCC review, CERN,25th September 2006 28
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688OSG-doc-498
EGEE – OSG interoperations
• Coordination– WLCG-EGEE-OSG operations meeting– Operations workshop
Focused of last one was OSG-EGEE interoperations, much progress achieved
– Regular phone calls to make progress on specific areas
• Operations tools: common and/or interoperable– Global BDII extracted from EGEE and OSG registration DBs– GGUS interfaced to OSG FootPrints – Site/service monitoring tools interfacing being discussed
Security: work is underway to share security contact information and incident information– Cross population of mail lists– EGEE sites in the OSG lists
And vice-versa
– Technical details still to be agreed Read access to GOC-DB etc
– Ensure consistent (and many times common) policies through joint working groups.
Maite Barroso: Grid Operations LHCC review, CERN,25th September 2006 29
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688OSG-doc-498
Problem Reports
• 3 WLCG ROCs in the US: US-ATLAS, US-CMS, OSG-GOC.• All tickets routed from WLCG through OSG-GOC. OSG GOC and
EGEE GGUS exchange and automatically route tickets.• OSG-GOC automatically routes tickets to US-CMS-ROC and,
currently, manually routes tickets to US-ATLAS-ROC
Maite Barroso: Grid Operations LHCC review, CERN,25th September 2006 30
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688OSG-doc-498
EGEE OSG Activities
• Completed– Interoperation of information published in BDII for use by WLCG Resource
Brokers.
• In progress– Operations VO, “Ops” on EGEE and OSG for common tests and
validations.
– Programmatic interface to trouble ticket sysetm which allows retrieval of EGEE - OSG resource scheduled downtimes.
• To watch for– How do communicate and test interoperability of changes (interfaces and
capabilities) before they get to production?
– How do we communicate about new s/w developments in time to have common approaches & avoid duplication & divergence?
– How do we manage ourselves to not give in to “panic mode” responses & give ourselves time to not organize “just in time”.
– How do we prioritize support for our non-WLCG stakeholders during data taking?
Maite Barroso: Grid Operations LHCC review, CERN,25th September 2006 31
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688OSG-doc-498
Summary
• WLCG Operations is a focus of EGEE and OSG Operations.• The 2 grid infrastructures are working together to ensure smooth,
scalable, and effective production support.