10. January 2011

Tutorial Structure‣ Today :

‣ Brief Introduction to CMS Computing

‣ General Description of Computing Shift Procedure

‣ Subscription to the CMS Computing E-Log

‣ Organization of Vidyo access from local CMS center

‣ Questions

‣ After this tutorial and >= 2 months prior to 1st shift :

‣ New shifters go through the Shift Procedure and shadow experienced CSP by taking „passive“ shifts (only E-log reports, NO alarms)

‣ After 2 „passive“ shifts :

‣ Sign off by Peter/Oli

‣ Full participation as CSP

‣ Possibility to sign-up via the WEB

Brief Introduction to CMS Computing

Overview of the CMS Distributed Computing System


‣ Multi-tiered distributed computing infrastructure based on GRID technologies for resource access and data movement

‣ Many new challenges compared to established HEP experiments:

‣ Data distribution, user localization, site monitoring, support responsibilities

Overview of the CMS Distributed Computing System

‣ Data archival (cold copy)

‣ Prompt reconstruction

‣ Time critical calibration & alignment


Tier-0 / CAF

Overview of the CMS Distributed Computing System

‣ Data archival (hot copy)

‣ Reprocessing, skimming, MC production

‣ Data serving



Overview of the CMS Distributed Computing System

‣ Centralized Simulation

‣ Distributed Data Analysis



Overview of the CMS Distributed Computing System

‣ Transfer rates

‣ Processing resources


Tier-1 level:~35k jobs/day

Tier-2 level:~100k


300 MB/s

600 MB/s

Down: 50-500 MB/s burstsUp: 20 MB/s sustained


Overview of the CMS Distributed Computing System

In total:7 Tier-1 across 3 In total:7 Tier-1 across 3 continents~50 Tier-2 continents~50 Tier-2 across 4 continentsacross 4 continents


CSP introduction


CSP Role and Required expertise‣ The CSP is mainly monitoring systems and raising alarms

‣ monitor computing infrastructure and services at checkpoint hours by going through a set of checklists

‣ identify problems

‣ Create E-Log reports

‣ trigger actions

‣ open Savannah tickets, in particular to CMS Sites

‣ contact CRC, Core Computing Operators & Experts, Computing Experts On Call

➞ We are working on making the CSP role even more active in problem trouble-shooting

‣ Required expertise of the CSP

‣ Fair understanding of CMS distributed computing infrastructure + services required for data processing, transfers and analysis

‣ Physicist or technician from a collaborating CMS institute

‣ Tutorial + 2-3 assisted “passive” shifts


CMS policy for Computing Shifts‣ The Computing Shifts are accounted within standard MoA

service work defined by CMS (as Central CMS Shifts) see

‣ Standard requirement : 8 points per author per institute

‣ 1 CSP shift == 0.75 points week / 1.25 points week-end

‣ no extra credit for night shifts since covering all time zones

‣ (special arrangements not excluded)

‣ During Data taking computing shifts are carried out :

‣ From Main CMS Centres : CMS CC or FNAL/ROC

‣ From Remote CMS Centres : see

‣ In 8 hours shifts (09-17/17-01/01-09), with 1 CSP per shift

‣ With the support of a Computing Run Coordinator who is on duty at CERN during 1 week periods

‣ With the support of CMS Core Computing Operators & Experts


Other Roles & interactions with CSP‣ Computing Run Coordinator (CRC)

‣ Subscribes to all CSP E-log sub-sections

‣ Assists CSP in raising alarms/tickets for complex cases

‣ Calls EOC during off-working hours (see below)

‣ Core Computing Operator or Expert (FacOps, DataOps, AnaOps)

‣ Subscribes to relevant CSP E-Log sub-sections

‣ Supports CSP during working hours

‣ Computing Expert On Call (EOC)

‣ Responsible of a particular service

‣ Alarmed by CSP via Email/IM/Tel during working hours

‣ Alarmed by CRC if really needed off-working hours

‣ CMS Site Contact Person

‣ Responds to alarms (e.g. Savannah, GGUS tickets)

‣ Other shifters (DQM, Online, Detector, …)

‣ In temporary absence of CRC, the CSP is the Core Computing contact for any shifter at P5/CMS Center/FNAL ROC

‣ CSP procedure responsible

‣ Assigns CSP shifts


CSP tools


Prerequisites‣ The CSP should be

‣ CMS member

‣ if you don’t, please fill up the WEB registration form

‣ After the form has been submitted, an email is sent to your Institute Representative (Team Leader) for approval

‣ If you have never been to CERN, it is necessary to send a copy of your passport to Anastasia Dolya, CMS Secretariat, CERN - PH Department, CH -1211 Geneva 23, Switzerland

‣ have a CMS Computer account

‣ for the Computer account, please contact [email protected]

‣ a Hypernews account

‣ a GRID certificate + CMS VO registration

‣ Please follow the link for a guideline on how to proceed


Most important CSP tools‣ Main CSP Shift Instructions


‣ Vidyo connection to the Tandberg system (other CMS Centres)


‣ Shift Sign-Up tool


‣ Instant Messenger under “FacOpsShifter” account


‣ Computing Plan of the Day


‣ Account in the CSP E-log


‣ Savannah account ( “cmscompinfrasup” member) for opening tickets


‣ Membership in e-group [email protected]

‣ subscribe via


Shift Subscription tool‣


‣ Shift selection : Blue == available on any slot that day / Green == available on a particular slot that day

‣ Preferably, please always check the Green box corresponding to your time zone slot to avoid being approved for other time zones

‣ Warning : when selecting Green, Blues get automatically selected, so please deselect it to avoid confusion


Shift Subscription policies

By end 2010, we actually have more demand for shifts than available slots (95 potential shifters !), so approvals need to follow stricter policies :

➡ shift requests can be made anytime for any open shift period➡ shift approvals will follow a monthly schedule, where shifts are approved two months in advance to allow for a reasonable planning horizon for all shifters

- example : all shift requests for January are reviewed beginning of November, the shift requests are balanced between the different groups/regions and shifts are approved

➡ In the monthly approval process, we would like to follow the following procedure:-shift requests from shifters in their own time zone have priority-within a time zone, balance shift requests first on group/institute level, then on the level of individual shifters➡We are also regularly publishing the CSP shift planning and accounting tables, per time zone, per group and per shifter, see next slide.


CSP Planning and Accounting‣

Example for European time zone :


The CMS Computing Logbook‣

‣ 2 (unpleasant) features : need to enter your elog pwd the first time accessing a given section

‣ need to regularly re-load your browser to see updates


The Savannah ticketing tool

‣ main tool to communicate with sites and DataOps/FacOps/AnaOps to solve infrastructure problems

‣ Savannah Instructions for CSP :


Submit a ticket


SavannahCategory: mostly

SAM tests, Job Robot, Data transfers, ...

Severity: You judge !

Privacy: “Public”

Assigned to: either DataOps, FacOps,

AnaOps or T1/T2 site squad

Use GGUS: YES for T1s, NO for T2s

Site: T1/T2 site squad

‣ Subject: if connected to a specific site, begin with [SITE]

‣ Example: [T1_US_FNAL]

‣ For Tier-1, please systematically bridge to GGUS (WLCG ticketing) via Use GGUS: Yes

‣ More information about that here :


The Vidyo interface‣ We have setup a

permanent Vidyo ➞ MCU video bridge

‣ Connects to the permanent video feed between the main CMS Centers and P5

‣ Remote shifters can be in direct contact with CMS Centers at CMS CC, P5, FNAL ROC shifters

‣ To avoid having too many connections, only one CSP shifter is allowed to connect at all times

‣ CSP has to log on at the beginning of shift and log off at end

‣ Every remote CMS Center needs a Remote Video Admin (to connect to MCU) :

‣ Responsible to check that system is used properly and holding the connection details

‣ Vidyo-capable PC (Window and MAC client OK, Linux client still Beta version)

‣ Sites with existing “Tanberg” or “Polycom” devices will be connected to MCU directly


CSP procedures


Checklist I: Core

‣ CERN/Core infrastructure monitoring :

‣ Main checks: CERN/IT SSB, CMS Service Gridmaps, CMS Services scheduled upgrade, CASTORCMS instances


Checklist 2 : Tier-0

‣ Tier-0 workflows monitoring :

‣ Main checks: Storage Manager, T0Mon, tier0export pool, networking, batch/LSF farm, jobs


Checklist 3 : CAF

‣ CAF workflows monitoring :

‣ Main checks: free space/usage per CAF stakeholder on cmscaf pool, networking, batch/LSF farm, jobs


Checklist 4 : Data Transfers

‣ Distributed Data Transfer monitoring :

‣ Main checks: Queued based monitoring for Tier-1s (not for T2s), Status of PhEDEx agents at sites

Soon O







New Checklist 4 : Data Transfers

‣ Distributed Data Transfer monitoring. Main checks :

‣ Status of PhEDEx agents at sites

‣ Queued based monitoring for Tier-1s (not for T2s)

‣This new tool will be tested with shifters during November and deployed by end of 2010, replacing the existing tool.


Checklist 5 : Grid Sites

‣ Distributed Grid sites monitoring :

‣ Main checks: SAM, JobRobot, Downtimes, Commissioning links, Savannah


Checklist 5 : Grid Sites

‣ Important

‣ CSP is asked to investigate the problem in as much detail as possible

‣ This helps the admin which will receive any Savannah tickets to quickly and easily solve the problem


‣ Report that site x shows failures in the <to be filled> SAM test

‣ In the body, investigate further what the problem is by clicking through the information provided till you reach the detailed error report

1 2


Checklists 6&7 : T1/T2 workflows

‣ Tier-1 workflows monitoring :

‣ Main checks: not covered so far, currently relying on T1 admins, T1 coordinators, DataOps

‣ Plan to add ProdMon/Dashboard monitoring + GlideIn Fabric monitoring

‣ Tier-2 workflows monitoring :

‣ Main checks: not covered so far, currently relying on T2 admins, T2 coordinators and CRAB support team

‣ Plan to collaborate with AnalysisOps monitoring

‣ Plan to add ProdMon/Dashboard monitoring


Some real examples


CAF monitoring• Free space on CMS CAF disk starts to shrink, due to an unexpected


• CSP instructions (CAF) : If the fraction of free space on cmscaf as shown in URL1 goes below 10% and if this was not already mentioned in the Computing Plan of the Day and there is no already opened Savannah ticket, open an ELOG in the "CAF" category


• If no detection/alarm by CSP, the free space might shrink to 0, with the consequence that the critical Tier-0 to CAF data flow breaks

• This really happened ! …and some uncontrolled emergency data flushing on the CAF had to be done ➞ WORST CASE



Computing Plan of the Day

• Note : 3 Russian sites in downtime !


Grid Site Monitoring• Example CMS Site Status

Board :

JINR in Scheduled downtime Ignore Waiting Room

T2_CN_Beijing shows a red ball !Known by Comp. Plan of Day?

No ! So what to do ?


Grid Site Monitoring‣ Investigate further:

‣ Click on link next to “red ball”

‣ Check the different problem categories and even drill further down to check for the real problem

‣ Report in E-log

‣ Advanced CSP can open Savannah ticket to site

‣ Subject should include: [SITE] and as specific short description of the problem as possible

‣ Do not only mention that the site has a “red ball” !!!

‣ Ticket should contain as many details as found out during investigation


Other news on GRID site monitoring

• “lens symbol” == already known issue. NO Elog/ticket needed (still check if it is still the same problem)

• “At work symbol” == Site scheduled downtime. NO Elog/ticket needed

Note : Unscheduled downtimes are not yet marked with the “At work symbol”, so double-check with the Computing Plan of the Day and with CMS Google Downtime Calendar (see next slide) before opening Elog/ticket.

• If T1 red, small ball, CSP should open Elog/Savannah quasi immediately (1-2h)

• If T2, follow instructions when/how open Elog/Savannah


Other news on GRID site monitoring

CMS Google Downtime Calendar


PhEDEx Components Status Page All Russian T2s have their PhEDEx componentsdown since ~3h What to do ?

Check Computing Plan of the Day!


Evolution of CSP procedure


Where we stand and where we go‣ Summer 08: CMS Computing shift procedures created

‣ Fall 08: introduced the concept of Computing Shift Person (CSP) and Computing Run Coordinator (CRC)

‣ Winter 08: ~100 shifts done by pool of ~30 computing experts at CMS Centre@CERN & FNAL/ROC

‣ 2009: CSP shifts covered by CMS collaborators at remote CMS Centres

‣ Pool of 45 CSPs from 3 time-zones (Asia, America, Europe)

‣ CMS Centres : Beijing, Rio, Sao Paulo, Texas Tech, Univ. of Florida, Aachen, DESY, FNAL, CERN

‣ 2010: extend above philosophy

‣ Pool of 70 CSPs (new remote Centres: GridKa, INFN Bologna, ... )

‣ Encourage strong remote teams who can provide local CSP support

‣ Strengthen role of CSP in trouble-shooting issues

‣ Enforce 24/7 coverage of critical services in shift procedures

‣ Move away from “Twiki” to DQM-like monitoring (in progress)


Critical Services and Sites• We are currently revising the Criticality Level of all

CMS services• CSP instructions will be adapted accordingly

– Frequency of checks– List of experts to contact– Type of alarm : Elog, Savannah, telephone to CRC (who

might raise GGUS alarm or call Expert on Call)• As a general rule : the closer you are to the

detector data stream, the more critical :– Tier-0 : processing and storage– CAF : processing and storage– Central Services at CERN (Core) : DBS, PhEDEx, …– Tier-0 – Tier-1 transfers– Tier-1 Site Availability

➞ Please pay special attention to these workflows• And always read the Computing Plan of the Day



24/7 Critical Services&Sites Coverage (II)Service/Facilities

MonitoringCSP checks

every 2 hoursStatus Green ?

E-LogBook & Ticketing tool

Expert answer within

1 hour ?



Service/Site Alarm Procedure


Expert Computing Operations

Problem solved ?NoCore System Alarming



Computing Run Coordinator (CRC) reachable 24/7 for :- Critical Service recovery procedure- Priority (GGUS-Team) ticket to site

CMS Core Computing experts / CMS Site admins(*) : - Apply routine service / infrastructure operations and monitoring- Respond as On-Call Experts to Alarms





(*) CMS has dedicated site-contacts and site-admins(**) highly critical alarms to Tier-0/1s are sent via GGUS-Alarm tickets and can trigger phone calls(***) CRC, Service Expert or Site Admin actions are systematically reported back to the E-LogBook or Savannah or GGUS, for transparency purposes.





What CSP should always do ?• Subscribe to CSP shifts well in advance (> 1 week). If

cancel, consult P.Kreuzer/O.Gutsche AND remove shift subscription

• Carefully read the Computing Plan of the Day and keep an eye on it during the whole shift. If Plan missing, read report by previous shifter and complain via AIM or email to CRC

• Always connect to the instant messenger CSP account “FacOpsShifter”. When leaving the shift desk, inform outside world by changing status of messenger (e.g. to “away for lunch”)

• When reporting an issue in the proper Elog section, provide details of the observed problem (not just the link)

• Regularly read Elog responses or announcements by CRC or Computing Experts, in all Elog sections (reload browser !)

• Write detailed final shift reports in Elog; even if nothing new has occurred during shift, report on main open issues

• Once trained (2-3 passive shifts), open Savannah tickets in case of well identified site issue, by carefully following the instructions


What CSP should never do ?

• Ignore a suspicious problem because too complex to understand solution : inform CRC or Computing experts via Elog

• Open a Savannah ticket without following the CSP instruction to identify a site problem (PhEDEx Component, SAM) or if confused about an observed problem solution : consult CRC, Computing Experts via Elog

• Cancel shifts or being replaced without reporting solution : inform shift responsible in advance and cancel subscription in shiftlist


Last steps


Passive shifts

‣ Passive shifts

‣ Go through already signed up shifts and determine CSP time slot for doing passive shifts

‣ Contact CSP shifter and check if she/he is willing to act as passive shift host

‣ Confirm with O.Gutsche/P.Kreuzer

‣ Shift Subscription

‣ Once passive shifts done, subscribe to shifts (ideally 2 months in advance) via


Subscriptions‣ Assumption:

‣ Shifter already has CERN account and HyperNews account

‣ Sign up for elog access:


‣ Sign up for e-group [email protected]


‣ Sign up for correct Savannah access to write tickets:

‣ Login to Savannah (CERN afs login)


‣ under "Request for inclusion" type "CMS" and "search", this will display all groups, then click on "CMS Computing Infrastructure Support"

‣ Peter & Oli will approve the request

‣ Get a valid Grid Certificate and CMS VO registration



