Post on 18-Jan-2016
INFSO-RI-508833
Enabling Grids for E-sciencE
www.eu-egee.org
EGEE Operations
EGEE/LCG II OPERATION WORKSHOP – 26th May 2005Operation WG Wrap upC. Vistoli
EGEE Operations 2
Enabling Grids for E-sciencE
INFSO-RI-508833
Operations issues covered
• Operation:– GOC DB and site registration– Site Management Workflow– Interaction with OSG
• VO management– Freedom of choice– GRAT– CIC web portal– Resource allocation
• Deployment procedure• Metrics• Accounting• Monitoring (out of time)
EGEE Operations 3
Enabling Grids for E-sciencE
INFSO-RI-508833
GOCDB
• GOCDB 2: presentation P. Strange• A lot of discussion and decision about meaning and
type of information to add or insert in the DB
EGEE Operations 4
Enabling Grids for E-sciencE
INFSO-RI-508833
New features of GOCDB2
• Main feature is a new roles based authentication system– Roles are granted to contacts to grant permissions– Roles system is expandable to contain new roles and
permissions
• Extra functions now exposed for ROC staff and COD/CIC– Creating new sites– Managing status of existing sites (production status, monitoring
status etc)
• Other improvements– Many bugfixes!– SQL transaction support– UI improvements
EGEE Operations 5
Enabling Grids for E-sciencE
INFSO-RI-508833
System architecture
• MySQL 4.x database with transaction support– GOCDB2 now fully supports transactions for enhanced data
integrity– Contact GOC team if you want your tool to link directly to the
MySQL database
• PHP 4 front end interface• Apache 2.x web server• Gridsite security layer
– Grants HTTPS access only to EGEE recognised certificate holders
– Does not map users to roles, this is a function provided by GOCDB
EGEE Operations 6
Enabling Grids for E-sciencE
INFSO-RI-508833
GOC How-to
GOC requirements -grid-wide needs and features from a centralized source of info-« how-to » to be settled, followed up.GOC/COD to set up a identified process and validated by ROCs
Variable status : Organisation/Group/Production/Type status« LCG2, PPS »needs « closer to reality » variables definition to lead to coherent info filling from ROCs,sites that does not lead to inconsistencies e.g. into monitoring tools to be followed up to ROC managers and COD based on the following.
*****
Needs how to state PPS ? Laurence how to set their IS?look at another site . MS: refer to the Glue-schema names. MT – LCG2 nodes « group » monitored.« Type » = accepted w regards to a given grid w regards with a given ROC. Cf jdl attribute-glueschema ce status not at the site level : confusing as it is. « Type gluevalueperce»= variable dynamically enforced onto IS similar as glue ce site status – COD and CIC – open/closed/ Bdii notification to site admin on daily operations.« Status » = ROC decision + COD decision : candidate,certified, uncertified, suspended, glitepre-production, at the site level.« Group Gridscope» = many-to-many. Site level.« organisation project organisation or funding partners » = hosting organization. Legal body.
EGEE Operations 7
Enabling Grids for E-sciencE
INFSO-RI-508833
Operations issues
• Open issue obout GOCDB and downtimes• ROC: how they collect and certifiy sites• Escalation procedure • Core services management• SA1 requirements escalation
EGEE Operations 8
Enabling Grids for E-sciencE
INFSO-RI-508833
Scheduled Downtimes and GOC DB
- No roll-out list mailing- EGEE Broadcast to the affected VO managers and grid
services users – coordination with user support ? - GOC DB filling
****
GOC DB requirement :needed to separate CE and SE. Silently assumed that
scheduled downtimes meant « CE downtimes ».detailed scheduled DT for all nodes at the service level.
Production and inventory and follow-up of CIS from GIIS tool.
IF proved not be sufficient then requirements on IS developpers.
EGEE Operations 9
Enabling Grids for E-sciencE
INFSO-RI-508833
Adding a site + site monitoring- ROC management negociation– status « candidate » - Site is established, – « uncertified »- GOC DB - « Site form » complete- SFT local instance by ROCs or the Egee one ok for 1 week
*****- ROCs are to exchange experience with their own local certification process and tests to come up with suggestions to put in
common.- several SFT tests instances need to run concurrently. To be taken care by SFT developpers. Documentation to install a local instance
to be available to willing ROCs.****Input : certified means « not harmful for the structure »SEE: local instance of SFT : good practice. To become a suggestion.TWN: same.PN: site has to be registered in a local bdii –ROC responsability. Then registered into the general bdii automatically no rollout published.CE : same for 3 days.UK :same as CEIt : certification bdii for it sites. Specific set of tests and SFT.NE:D+CH: regular project SFT. Sites for the time bing in « pre-production ». Suggests to set a regional SFT set of tests.Needs documention for PN for this installation.Cern: a.k.a PNFrance: TBDRussia:SWE:
******OSG: will provide the GIP and collect info the same way as EGEE. E.g : OSG could enforce the Egee dteam registration
(OSG Egee: could run native tests and Osg would appear as a given entity in the monitoring system– something similar as a ROC)(Egee OSG : need to pack libraries).
How-to: Follow-up of tickets between footprint/remedy //GGUS can be done manually in the first phase.
EGEE Operations 10
Enabling Grids for E-sciencE
INFSO-RI-508833
Site Quarantine and Escalation
Deadline to be dependant on site size.Prioritization of COD work to be dependant on site size.need for a proposal from the COD to be handed to the ROC managers and back to the COD:>100 CPU site deadline 1 day to be changed be top 10 sites deadline 1 day.-----How to take care of non european sites – non existing EGEE ROCsGD Team to give feedback on to ROC managers-----Escalation:1rst mail to site and ROC2nd mail to ROCPhone call to ROC------OSG : contact the registration manager to put pressure on the ROC managerROC to close the ticket. Need to update deadline for ROC manually through COD – OKCOD to put a specific site under observation – for 3 days, into quarantine, when recurrent pbs.OSG : to publicize the « bad reputation sites » on a web pageROCs agreement on metrics to be agreed upon, URGENT.-----To large sites organisation to deal with this constraints and ensure 24/7 like behaviour.---Deadline before CA upgrade becomes « a critical test » is 1 week and need COD actionRGMA no deadline:
Need for accounting : Important. However, registry failures cannot be blamed on sites : refinement on SFT tests needed by SFT developpers. Contact with RGMA team needed.RGMA Tutorial link to be sent to the ROC managers.
EGEE Operations 11
Enabling Grids for E-sciencE
INFSO-RI-508833
Collecting requirements through SA1
• CL – almost no official requirements existing from SA1• SZ - implying ROC managers to collect - How-to “legitimate”
requirements.• JT - getting experts to meet -------------------------------------------------STARTING NOW: Issues that have arisen in this workshop1/ M/W security policy – tracability and datamanagement2/ VO Fair share and site implementation3/ Requirement on environement variables on to the batch systems-------Input:
The process of JRA1 enforcement is “frozen” or uneffective. and for Egee2 it is unclear.Send them to SZ to eradicate duplicates. To get feedback from a mailing list and take them to the PTF. SZ to rewrite them and to do the follow-up.1/ depends on JSG – 2/ irrelevant – site level relevancy
EGEE Operations 12
Enabling Grids for E-sciencE
INFSO-RI-508833
VO management
1/CIC web
2/Freedom of Choice
Atlas is used in production,
The sites should be aware that they are blacklisted –
to be implemented –medium sized sites
VO should be able to define their customized set of tests
3/ GridAT
Could have been run as dteam VO,and allow better
development effort (e.g adding history features) of
monitoring and alarm tool
EGEE Operations 13
Enabling Grids for E-sciencE
INFSO-RI-508833
• Workflow
CIC portal : Support a new VO
CIC Portal Site ROC OAG VO Manager
Request
Validation
Broadcast
Validated ?
yesno
MAIL MAILMAIL
Cc
Contact & dialog initialization
EGEE Operations 14
Enabling Grids for E-sciencE
INFSO-RI-508833
• Workflow
CIC portal: Publish DC
CIC Portal SitesOther VOs OAGVO Manager
MAIL MAILMAIL
Infos on DC
Publication
Publication
form
Broadcast
Validation
Calendar
NEWS
Authentication
EGEE Operations 15
Enabling Grids for E-sciencE
INFSO-RI-508833
Freedom of choice - VO Page – 1/3
EGEE Operations 16
Enabling Grids for E-sciencE
INFSO-RI-508833
Freedom of choice - Final List - 2/3
EGEE Operations 17
Enabling Grids for E-sciencE
INFSO-RI-508833
Freedom of choice - CIC Page- 3/3
EGEE Operations 18
Enabling Grids for E-sciencE
INFSO-RI-508833
GridAT
GridAT (Grid Application Test) definitions:
• GridAT aims to simplify the addition of new tests for new or existing applications.
• GridAT can be used for validating grid site, from VO software viewpoint, submitting a test job and evaluating if its output matches the expected results.
• GridAT is designed to certificate, on-demand, installed grid applications.
EGEE Operations 19
Enabling Grids for E-sciencE
INFSO-RI-508833
GridAT Web Interface
WEB portal gives an overview of the Italian GRID from VO viewpoint.
Summary table contains the results for each site of last tests grouped by Virtual Organisation.
More details can be obtained just clicking on the test date.
EGEE Operations 20
Enabling Grids for E-sciencE
INFSO-RI-508833
Resource Allocation Process
• Resource allocation policy– Overview of status and requirements from VO at the OAG
Scratch WN space + MPI + licensed software+
secure data access
****
OAG contact point to set up!!
How to check resource actual allocation – Inventory of actual services from GOC and GIIS– Workflow blackbone
EGEE Operations 21
Enabling Grids for E-sciencE
INFSO-RI-508833
Resource negotiation: Problems
• Only general percentages by region available• Interpretation of numbers not always the same• No indication on availability of specific resources (MPI,
licensed software)• Allocation has to be done site by site• Gap between OAG and sites, to be filled by ROCs• ROCs don’t decide on scientific priorities• Exact workflow description missing
EGEE Operations 22
Enabling Grids for E-sciencE
INFSO-RI-508833
Resource negotiation: Implementation
• Implementation via CIC portal, “OAG view” (and VO/RC)– Readable to everybody– Role-specific actions reserved to authorized people
Sites: support yes/no, free cycles only / more detailed description OAG/ROCs: (re-)trigger requests to sites/ROCs (specific broadcast) VOs: contact point for discussions with specific sites
– Shows site status by region: Solicited, Answered (+answer details)
• Automate statistics, summaries, steps of the procedure
EGEE Operations 23
Enabling Grids for E-sciencE
INFSO-RI-508833
Deployment and Process
• LCG-2_4_0 first release using the new process (5 days late)– Release was picked up at a slow pace (˜2.5 sites/day)– Differences between regions
Repacking and adaptation takes time and is needed – Release was not sufficiently tested
2 test sites for a deployment test are not sufficient
• Release Preparation:• More, early involvement of sites required
– Have to be see the list of potential components very early – Regular progress reports to the ROC managers telephone conference
• Very early announcement of new releases needed– 3 weeks complete list of components and changes
Problematic, because this means certification has to be finished
– 2 weeks before a new release the release has to go to: ROC-IT, ROC-SE, ROC-UK for:
• Test deployment (1 week)• Testing on ROCs testbeds• Fixing bugs will take 1 week
EGEE Operations 24
Enabling Grids for E-sciencE
INFSO-RI-508833
Deployment and Process
• Deployment of new releases:
• The ROCs will drive the deployment – Announcement of releases through the ROC managers– Sites that are (much) too late will be excluded by their ROCs
• Next Releases:• By mid June an extra release is needed for the SC3
– FTS, LFC service, VOMS (RFC compliant proxy extensions), bug fixes– tier1s and tier2s participating in SC3 have to upgrade quickly
• Regular release 1st July – Like mid June + updates– All sites – (may contain gLite WLM for voluntary parallel deployment)
• Transition to gLite will require changes to the process– More frequent releases– step by step introduction of new components
EGEE Operations 25
Enabling Grids for E-sciencE
INFSO-RI-508833
Metric
• Two sets needed• Complex, detailed set
– Used for pinpointing problems– Used by ROCs, CICs and site admins (experts)
• Coarse Summary– Measure overall performance– Small, easy to understand set– Hierarchical (Grid, ROCs, CICs, RCs)– Targeted at users to show progress (or lack of)
EGEE Operations 26
Enabling Grids for E-sciencE
INFSO-RI-508833
Metric
• General Agreement on the concept– detailed discussions on:
time windows• Sliding windows (week, month, 3 month)
quantities to watch for (RCs, ROCs, CICs…..)• ROCs based on RCs
• CICs based on services
• Release quality has to be measured
• To make progress: workgroup to define quantities– Organized by: Ognjen Prnjat (oprnjat@admin.grnet.gr) – Small (˜5), Ognjen, Markus, Helene, Jeff T. and ???– Ognjen will collect input– ROCs, CICs and OMC have to agree on ONE set of quantities –
EGEE Operations 27
Enabling Grids for E-sciencE
INFSO-RI-508833
Metering: Gianduia
EGEE Operations 28
Enabling Grids for E-sciencE
INFSO-RI-508833
DGAS deployment
VO1
VO2
VO3
site1
site2
site3
HLR 1
HLR 2
HLR 3 HLR 5
HLR 4
CE
CE
CECE
CE
CE
APEL
Aggregate site accounting
EGEE Operations 29
Enabling Grids for E-sciencE
INFSO-RI-508833
EGEE Operations 30
Enabling Grids for E-sciencE
INFSO-RI-508833
How APEL Works?
• PBS/LSF log processed daily on site CE to extract required data, filter acts as R-GMA DBProducer -> PbsRecords table
• Gatekeeper log processed daily on site CE to extract required data, filter acts as R-GMA DBProducer -> GkRecords table
• Message log processed daily on site CE to extract required data, filter acts as R-GMA DBProducer -> MessageRecords table
• Site GIIS interrogated daily on site CE to obtain SpecInt and SpecFloat values for CE, acts as DBProducer -> SpecRecords table, one dated record per day
• These three tables joined daily on MON to produce LcgRecords table. As each record is produced program acts as StreamProducer to send the entries to the LcgRecords table on the GOC site.
• Site now has table containing its own accounting data; GOC has aggregated table over whole of LCG.
• Interactive and regular reports produced by site or at GOC site as required.
EGEE Operations 31
Enabling Grids for E-sciencE
INFSO-RI-508833
APEL and gLite
– Is APEL integrated in g-Lite? Work currently in progress. We have ported the APEL code into the gLite CVS repository but
need to understand functional differences e.g. WMS and use of Condor
3 Components: Core + PBS plugin + LSF plugin Sent our requirements to Erwin Laure….waiting for information.
– What about its deployment plan? As soon as possible …but would also like to add some new features
• Global Job ID to link with L&B
• DN to VO mapping
EGEE Operations 32
Enabling Grids for E-sciencE
INFSO-RI-508833
GridICE Architecture
Resource
Site Publisher
Sensor
event collector
event provider
consumer
publisher
WAN
LAN
publishers
Lemon srv
Lemon agt
LDAP Client
MDS GRIS
scripts
HTTP:HTML/XMLNS
GridICE on LCG 2
logical components
roles
GridICE Server
Consumer
WAN
xML: pull,aperiodic,unicastNS: push,aperiodic,unicast
Browser
Data delivery model
pull,periodic,unicast
push,periodic,unicast
application
consumers
EGEE Operations 33
Enabling Grids for E-sciencE
INFSO-RI-508833
GridICE and DGAS Common Metering for Grid jobs
• DGAS is an accounting system, therefore is interested in knowing the usage-related parameters of a job after its execution
• GridICE is a monitoring system, therefore is interested in knowing the job-related information since the job is created in the queue– The information should be updated frequently and provided to
users respecting the security concerns
queued
running aborted
deletedexecuted
GridICE
DGAS
EGEE Operations 34
Enabling Grids for E-sciencE
INFSO-RI-508833
GridICE on gLite
Resource
Site Publisher
Sensor
event collector
event provider
consumer
publisher
WAN
LAN
publishers
Lemon srv
Lemon agent
CEMon
scripts
HTTP:HTML/XMLNS
GridICE on gLite
logical components
roles
GridICE Server
Consumer
WAN
xML: pull,aperiodic,unicastNS: push,aperiodic,unicast
Browser
Data delivery model
pull,periodic,unicast
push,periodic,unicast
RGMA
application
consumers
MDS2
consumers
G
EGEE Operations 35
Enabling Grids for E-sciencE
INFSO-RI-508833
Summary
GridICE in LCG 2.x GridICE in gLite
Schema GLUE 1.1++ GLUE 1.1++/GLUE 1.2++
Local Area Distribution Lemon
UDP/TCP
Lemon
UDP/TCP
Site Publisher MDS GRIS (LDAP)
no security
CEMon/R-GMA
http+gsi+proxy+voms ext.
Discovery BDII (LDAP) Service Discovery API
Wide Area Data Distribution
LDAP/Pull SOAP/pull (push)
Notification fixed number of events content-based
subscription
EGEE Operations 36
Enabling Grids for E-sciencE
INFSO-RI-508833
NPM Architecture
• JRA4/NPM provides uniform access to network performance information from a heterogeneous set of monitoring frameworks
EGEE Operations 37
Enabling Grids for E-sciencE
INFSO-RI-508833
We need your help
• We have some idea of requirements from networking experts within JRA4
• Draft requirements document available here:– https://edms.cern.ch/document/593620/1
• Draft use case document available here:– https://edms.cern.ch/document/591777/1
• We’re looking for more input from NOCs and GOCs• If you have requirements, use cases or opinions on
interfaces or needed metrics, please send them to us• Even if you don’t have ideas at the moment, but would
like to be involved in the process, please get in contact• Contact details are at the end of the talk
EGEE Operations 38
Enabling Grids for E-sciencE
INFSO-RI-508833
Operations Summary
• CIC On Duty is now well established– COD is just 6 month old!!!!! – Tools have evolved at a dramatic pace
Portal, SFT,……• Many rapid iterations
Truly distributed effort Integration of new COD partner (Russia) went smoothly
– Tuning of procedures is an ongoing process No dramatic changes (take resource size more into account)
EGEE Operations 39
Enabling Grids for E-sciencE
INFSO-RI-508833
Operations Summary
• Accounting– Last November still an area of concern– APEL now well established
Support for batch systems is improving Several privacy related problems have been understood and solved
– gLite Accounting: DGAS Some concerns about amount of information published
• Can be handled by proper authorization? Collaboration with APEL on batch sensors (BBQS, Condor,..)
• DGAS agreed to provide them Will be introduced initially on a voluntary basis
• Sites will give feedback (including privacy issues)
EGEE Operations 40
Enabling Grids for E-sciencE
INFSO-RI-508833
Operations Summary
Tools!!!!!!!!!!!– GOC-DB, monitoring, monitoring, testing, testing….– Many impressive tools
Lots of overlap, we should focus and fuse some of them R-GMA based “monitoring bus” emerging
• Releases, Deployment– ROCs will drive the deployment– ROCs will contribute to the release preparation
Testing Reviewing the proposed contents
• Performance Metric– Measure service quality (RC, ROCs, CICs,…)– Ognjen organizes small workgroup to define details
EGEE Operations 41
Enabling Grids for E-sciencE
INFSO-RI-508833
Operations Summary
OSG • Similar problems, Interesting Tools• Linking of operations between LCG(EGEE) Grid3(OSG)
– Concept: OSG treated like a ROC, LCG like a SC Details will be worked out during interoperation tests
• Worries• Resources
– Activities:– Service Challenges, LCG production, gLite pre-production, gLite
transition, scaling up for LHC scale………
• Duplication of effort– Especially pronounced in the area of tools
EGEE Operations 42
Enabling Grids for E-sciencE
INFSO-RI-508833
Conclusions
• GOCDB2 FEATURES AND LINK WITH OPERATIONS
• OPERATIONS PROCEDURE – CODS Monitoring tools useful : current development effort to be provided
to the COD management.• DEPLOYMENT• PERFORMANCE MEASUREMENT • ACCOUNTING
Development coordinated with It and UK for accounting purposes. Dgas deployed as a facultative component for the time being.Current operations: Rgma and Apel. Apel specifities not covered. Next items on could not be covered by the agenda.
• OSG interoperability : underlying on topics covered– Possibility to inegrate site verification of OSG and site certification in
EGEE - SFT developpers– Interfacing needed somehow between the respective operations
support tools – Footprint and Remedy