Overview of day-to-day operations Suzanne Poulat.
-
Upload
delphia-reynolds -
Category
Documents
-
view
213 -
download
0
Transcript of Overview of day-to-day operations Suzanne Poulat.
![Page 1: Overview of day-to-day operations Suzanne Poulat.](https://reader035.fdocuments.us/reader035/viewer/2022062719/56649ec55503460f94bd0612/html5/thumbnails/1.jpg)
Overview of day-to-day Overview of day-to-day operationsoperations
Suzanne Poulat
![Page 2: Overview of day-to-day operations Suzanne Poulat.](https://reader035.fdocuments.us/reader035/viewer/2022062719/56649ec55503460f94bd0612/html5/thumbnails/2.jpg)
OverviewOverview
Human resources Operation group role Services during Out of Office Hours Monitoring Tickets and alarms Documentation Conclusion
CMS meeting 2
![Page 3: Overview of day-to-day operations Suzanne Poulat.](https://reader035.fdocuments.us/reader035/viewer/2022062719/56649ec55503460f94bd0612/html5/thumbnails/3.jpg)
Human resources in Operation groupHuman resources in Operation group
8 people in the Operation group (7.8 FTE)– 2 for contributing to operations of global and
national grids– 4 for site operation (3.8 FTE)– From June 2009 : 2 technicians instead of 2.8
FTE operators
CMS meeting 3
![Page 4: Overview of day-to-day operations Suzanne Poulat.](https://reader035.fdocuments.us/reader035/viewer/2022062719/56649ec55503460f94bd0612/html5/thumbnails/4.jpg)
Operation group roleOperation group role
Operation– Monitor the availability of all services (storage, batch, grid
services…)– Configure and regulate the system for optimizing the usage– Coordinate actions during service interruptions– Overall coordination of the « on call » service– Other tasks: Monitoring and day-to-day management of tape
libraries, Create and manage user accounts and AFS disk space…
Tight cooperation between operation and support groups
CMS meeting 4
![Page 5: Overview of day-to-day operations Suzanne Poulat.](https://reader035.fdocuments.us/reader035/viewer/2022062719/56649ec55503460f94bd0612/html5/thumbnails/5.jpg)
Services during Out of Office HoursServices during Out of Office Hours
On site night security guard from 6PM to 8AM and weekends – no computing actions : Alerting and Messaging
1 on-duty person for facilities management 1 on-duty computing engineer (evenings,
weekends)– Corrective actions if possible (documentation, Training)– else call an expert … if available
Result is a « Best effort » coverage
CMS meeting 5
![Page 6: Overview of day-to-day operations Suzanne Poulat.](https://reader035.fdocuments.us/reader035/viewer/2022062719/56649ec55503460f94bd0612/html5/thumbnails/6.jpg)
MonitoringMonitoring
NAGIOS Monitoring tools for BQS’s farms Dashboard of storage services
– HPSS – dCache – Tape and drive monitoring with StorSentry
CMS meeting 6
![Page 7: Overview of day-to-day operations Suzanne Poulat.](https://reader035.fdocuments.us/reader035/viewer/2022062719/56649ec55503460f94bd0612/html5/thumbnails/7.jpg)
Monitoring with NAGIOSMonitoring with NAGIOS
NAGIOS has replaced NGOP in March 2009 to monitor the services– Batch– Storage : HPSS, dCache, AFS, Xrootd, SRB– Grid : CE, SRM, TOP BDII– Databases– Others : Tape libraries, Saphir (privileges and location of services)
CMS meeting 7
![Page 8: Overview of day-to-day operations Suzanne Poulat.](https://reader035.fdocuments.us/reader035/viewer/2022062719/56649ec55503460f94bd0612/html5/thumbnails/8.jpg)
CMS meeting 8
Nagios monitoring :Web site interface
![Page 9: Overview of day-to-day operations Suzanne Poulat.](https://reader035.fdocuments.us/reader035/viewer/2022062719/56649ec55503460f94bd0612/html5/thumbnails/9.jpg)
Monitoring tools for BQS farms (1)Monitoring tools for BQS farms (1)
CMS meeting 9
Detect slow and stalled jobs, show overload of a service, problem with users’code or disfunctionning worker. Users are automatically notified and their jobs are deleted after a delay.For CMS’s jobs, an alarm is sent to CMS support staff when the number of « slow » jobs is too high.
![Page 10: Overview of day-to-day operations Suzanne Poulat.](https://reader035.fdocuments.us/reader035/viewer/2022062719/56649ec55503460f94bd0612/html5/thumbnails/10.jpg)
Monitoring tools for BQS farms (2)Monitoring tools for BQS farms (2)
CMS meeting 10
CMS jobs
![Page 11: Overview of day-to-day operations Suzanne Poulat.](https://reader035.fdocuments.us/reader035/viewer/2022062719/56649ec55503460f94bd0612/html5/thumbnails/11.jpg)
Remarks (in October)Remarks (in October)
Short jobs (CPU < 30s) : – T1 prod : 25% – T2 prod : 30%
Use of memory (Long class)– 0.6% used more than 2GB– 8.5% used more than 1.5GB
Slow jobs (all cms users)
CMS meeting 11
Problem with WMS
Blue = running jobsGreen = slow
![Page 12: Overview of day-to-day operations Suzanne Poulat.](https://reader035.fdocuments.us/reader035/viewer/2022062719/56649ec55503460f94bd0612/html5/thumbnails/12.jpg)
Storage services (1) Storage services (1) HPSS dashboardHPSS dashboard
CMS meeting 12
Number of connexions on rfio server dedicated to dCache
T10K -Btape drive usage
Treqs activity for files on TK10-B tapes
![Page 13: Overview of day-to-day operations Suzanne Poulat.](https://reader035.fdocuments.us/reader035/viewer/2022062719/56649ec55503460f94bd0612/html5/thumbnails/13.jpg)
Storage services (2) Storage services (2) dCache Monitoring dCache Monitoring
CMS meeting 13
![Page 14: Overview of day-to-day operations Suzanne Poulat.](https://reader035.fdocuments.us/reader035/viewer/2022062719/56649ec55503460f94bd0612/html5/thumbnails/14.jpg)
Tapes drives monitoringTapes drives monitoring
Done with StorSentry Regular reports on tapes and drives quality Drives can be changed before damaging
tapes Tapes can be changed before losing data
CMS meeting 14
![Page 15: Overview of day-to-day operations Suzanne Poulat.](https://reader035.fdocuments.us/reader035/viewer/2022062719/56649ec55503460f94bd0612/html5/thumbnails/15.jpg)
Tickets and AlarmsTickets and Alarms
Integration of GGUS tickets into the XHELP interface: all the treatment is done at XHELP level and synchronised with GGUS interface.
Only one Team ticket for CMS in August Setup the procedure for fast answers to the LHC Alarm
tickets− Ticket is acknowledged by a robot− Mail to Operation team and on-duty person− SMS to on-duty person− Answer to this mail automatically updates the ticket
15CMS meeting
![Page 16: Overview of day-to-day operations Suzanne Poulat.](https://reader035.fdocuments.us/reader035/viewer/2022062719/56649ec55503460f94bd0612/html5/thumbnails/16.jpg)
CMS Alarm tickets (Tests)CMS Alarm tickets (Tests)
CMS meeting 16
CMS - Alarm tickets
5/3/09 9:35
1/4/09 19:05
2/10/09 17:55
9/10/09 10:25
0
5
10
15
20
25
30
05/0
3/20
09
19/0
3/20
09
02/0
4/20
09
16/0
4/20
09
30/0
4/20
09
14/0
5/20
09
28/0
5/20
09
11/0
6/20
09
25/0
6/20
09
09/0
7/20
09
23/0
7/20
09
06/0
8/20
09
20/0
8/20
09
03/0
9/20
09
17/0
9/20
09
01/1
0/20
09
tim
e to
so
lve
it (
min
)
![Page 17: Overview of day-to-day operations Suzanne Poulat.](https://reader035.fdocuments.us/reader035/viewer/2022062719/56649ec55503460f94bd0612/html5/thumbnails/17.jpg)
DocumentationDocumentation
Documentation system includes :– Documentation on services– Actions on services in case of interruptions– Communication procedures– All Grid services integrated to general documentation
Each documentation has several levels :– Experts– Recipes for first level repair or restart of services by on-duty
people
Generalization of the use of electronic logbooks to follow up the activities
CMS meeting 17
![Page 18: Overview of day-to-day operations Suzanne Poulat.](https://reader035.fdocuments.us/reader035/viewer/2022062719/56649ec55503460f94bd0612/html5/thumbnails/18.jpg)
Electronic LogbooksElectronic Logbooks
CMS meeting 18
![Page 19: Overview of day-to-day operations Suzanne Poulat.](https://reader035.fdocuments.us/reader035/viewer/2022062719/56649ec55503460f94bd0612/html5/thumbnails/19.jpg)
ConclusionConclusion
During the past year :– Big improvements on procedures and monitoring tools
Future :– Cross-monitoring with SYMOD– Training for Operation people in Quality procedures
(Information Technology Infrastructure Library)
CMS meeting 19