LCG Monitoring and Accounting
description
Transcript of LCG Monitoring and Accounting
![Page 1: LCG Monitoring and Accounting](https://reader033.fdocuments.us/reader033/viewer/2022051623/568157d3550346895dc558b0/html5/thumbnails/1.jpg)
Dave Kant
LCG Monitoring and Accounting
Dave KantCCLRC e-Science Centre, UK
HEPSYSMAN April 2005
![Page 2: LCG Monitoring and Accounting](https://reader033.fdocuments.us/reader033/viewer/2022051623/568157d3550346895dc558b0/html5/thumbnails/2.jpg)
2
Introduction
• Overview of some of the monitoring tools in action in the LHC Computing Grid– GOCDB– GPPMON– GRIDICE– GSTAT– CERTIFICATION TESTING– REAL TIME GRID MONITOR
• Accounting Use Case• Future Plans
![Page 3: LCG Monitoring and Accounting](https://reader033.fdocuments.us/reader033/viewer/2022051623/568157d3550346895dc558b0/html5/thumbnails/3.jpg)
3
Monitoring the LCG Grid is a Challenge!
Number of participating sites is growing every day:
August 2003 => 12 sites ~100 CPUs
October 2004 => 83 sites ; ~8000 CPUs
April 2005 => 138 site; ~14000 CPUs 4TB Disk
Grid Operations Centre
Monitor the operational status of sites;
Fault detection
Problem Management
Identify problems; escalate; track;
![Page 4: LCG Monitoring and Accounting](https://reader033.fdocuments.us/reader033/viewer/2022051623/568157d3550346895dc558b0/html5/thumbnails/4.jpg)
4
• With so many sites participating, there is a requirement for operational information in order to manage a grid environment
• What are the core grid services • e.g. RBs/SEs/BDIIs the VOs are using for data challenges.
• Who do we contact when there is a security incident?• Require a toolkit test specific core services.• We have to concentrate on functional behaviour of services
e.g. If an RB sends your job to a CE, then we must assume the RB is working fine. Is this the only test of a RB?
• Not all the tests that we perform are effective at finding problems.
• We must develop tests which simulate the life cycle of real applications in a Grid environment.
• …and lots more
Monitoring Challenges
![Page 5: LCG Monitoring and Accounting](https://reader033.fdocuments.us/reader033/viewer/2022051623/568157d3550346895dc558b0/html5/thumbnails/5.jpg)
5
GOC Configuration Database
GOC GridSite MySQL
Resource CentreResources & Site Information
EDG, LCG-1, LCG-2, …
ce
se
bdii
rb
Monitoring Services
• Operations Maps
• Configure other Tools
• Organisation Structure
- People/Institites/Projects
• Secure services
- News
- Self Certification
- Accounting
http://goc.grid-support.ac.uk/gridsite/gocdb
Secure Database Management via HTTPS / X.509
Store a Subset of the Grid Information system
People, Contact Information, Resources
Scheduled Maintenance
RC
SQLhttps
SERVER
GOC DB can also contain information that is not present in the IS such as:Scheduled maintenance; News; Organisational Structures; Geographic coordinates for maps.
![Page 6: LCG Monitoring and Accounting](https://reader033.fdocuments.us/reader033/viewer/2022051623/568157d3550346895dc558b0/html5/thumbnails/6.jpg)
6
Operations Map – Job Submission Tests
GPPMON
Displays the results of tests against sites.
Test: Job Submission
Job is a simple test of the grid middleware components e.g. Gatekeeper service, RB service, and the Information System via JDL requirements.
This kind of test deals with the functional behaviour core grid services – do simple jobs run. They are lightweight tests which run hourly. However, they have certain limitations e.g. Dteam VO; WN reach (specialised monitoring queues).
![Page 7: LCG Monitoring and Accounting](https://reader033.fdocuments.us/reader033/viewer/2022051623/568157d3550346895dc558b0/html5/thumbnails/7.jpg)
8
GRIDICE – ArchitectureA different kind of monitoring tool – processes / low level metrics / grid metrics
Developed by the INFN-GRID Team http://infnforge.cnaf.infn.it/gridice
Data harvest via discovery service (postgreSQL)
Measurement service
monitoring sensor agents probe process table, memory, cpu
Publication service
![Page 8: LCG Monitoring and Accounting](https://reader033.fdocuments.us/reader033/viewer/2022051623/568157d3550346895dc558b0/html5/thumbnails/8.jpg)
9
GRIDICE – Global View
Display shows the processes belonging to the Broker service. Problems are flagged
List of Sites
Resource Usage CPU#, Load, Storage, Job Info
Different Views of the data: Site / VO / Geographic
![Page 9: LCG Monitoring and Accounting](https://reader033.fdocuments.us/reader033/viewer/2022051623/568157d3550346895dc558b0/html5/thumbnails/9.jpg)
11
GRIDICE – Expert View
Display shows the processes belonging to the Broker service. Problems are flagged
Node
Processes
![Page 10: LCG Monitoring and Accounting](https://reader033.fdocuments.us/reader033/viewer/2022051623/568157d3550346895dc558b0/html5/thumbnails/10.jpg)
12
Ganglia Monitoring
• http://gridpp.ac.uk/ganglia• Can use Ganglia to monitor a cluster
Scalable distributed monitoring system for clusters and grids using RRD for storage and visualisation.
RAL Tier-1 Centre
LCG PBS Server displays Job status for each VO
Get a lot for little effort
![Page 11: LCG Monitoring and Accounting](https://reader033.fdocuments.us/reader033/viewer/2022051623/568157d3550346895dc558b0/html5/thumbnails/11.jpg)
13
Federating Cluster Information• Can also use Ganglia to monitor clusters of clusters
Ganglia/R-GMA integration through Ranglia.
![Page 12: LCG Monitoring and Accounting](https://reader033.fdocuments.us/reader033/viewer/2022051623/568157d3550346895dc558b0/html5/thumbnails/12.jpg)
14
GIIS Monitor• Developed by MinTsai (GOC Taipei)
• Tool to display and check information published by the site GIIS (sanity checks, fault detection)
• http://goc.grid.sinica.edu.tw/gstat/
![Page 13: LCG Monitoring and Accounting](https://reader033.fdocuments.us/reader033/viewer/2022051623/568157d3550346895dc558b0/html5/thumbnails/13.jpg)
15
Regional Monitoring
• EGEE is made up of regions.• Each region contains many computing centres.• Regional Operational Centres is a focus for
operations.
USA
Dealing with the complexities of managing a grid.
![Page 14: LCG Monitoring and Accounting](https://reader033.fdocuments.us/reader033/viewer/2022051623/568157d3550346895dc558b0/html5/thumbnails/14.jpg)
16
http://goc.grid-support.ac.uk/roc_map/map.php Provide ROCs with a package to monitor the resources in the region
• Tailored Monitoring• GUIs to create organisations and populate them with sites
Hierarchical view of Resources• Example UK Particle Physics GridPP• Materialised Path encoding
Regional Monitoring Maps
EGEE (1)
France (1.1) UK/I (1.2) S.E.E (1.3)
GridPP (1.2.1)
LondonT2
ScotGrid
IMPERIAL
QMUL
Edinburgh
![Page 15: LCG Monitoring and Accounting](https://reader033.fdocuments.us/reader033/viewer/2022051623/568157d3550346895dc558b0/html5/thumbnails/15.jpg)
17
Site Functional Tests (SFT)
• In terms of middleware, the installation and configuration of a site is quite a complicated procedure. – When there is a new release, sites don’t upgrade at the same time– Some upgrades don’t always go smoothly– Unexpected things happen (who turned of the power?)– Day-to-day problems; robustness of service under load?
• Its necessary to actively hunt for problems • Site certification testing is by CERN deployment team on a
daily basis. First step toward providing this service involves running a series of replica manager tests which register files onto the grid, move them around, delete them; and 3rd party copies from remote SE.
• Unlike the simple job submission tests implemented in GPPMON, these tests are more heavy weight and attempt simulate the life cycle of real applications.
![Page 16: LCG Monitoring and Accounting](https://reader033.fdocuments.us/reader033/viewer/2022051623/568157d3550346895dc558b0/html5/thumbnails/16.jpg)
18
Certification Test Results
http://lcg-testzone-reports.web.cern.ch/lcg-testzone-reports/cgi-bin/listreports.cgi
![Page 17: LCG Monitoring and Accounting](https://reader033.fdocuments.us/reader033/viewer/2022051623/568157d3550346895dc558b0/html5/thumbnails/17.jpg)
19
Aggregator RSSReader (Windows Client)
GOC generates RSS feeds which clients can pull using an RSS aggregator.
How can we integrate feeds and ticketing systems?
Syndication of Monitoring Information
![Page 18: LCG Monitoring and Accounting](https://reader033.fdocuments.us/reader033/viewer/2022051623/568157d3550346895dc558b0/html5/thumbnails/18.jpg)
20
Real Time Grid Monitorhttp://www.hep.ph.ic.ac.uk/e-science/projects/demo/index.html
A Visualisation tool to track jobs currently running on the grid.
Applet queries the logging and bookkeeping service to get information about grid jobs.
Why are jobs failing?
Why are jobs queued at sites while others are empty?
![Page 19: LCG Monitoring and Accounting](https://reader033.fdocuments.us/reader033/viewer/2022051623/568157d3550346895dc558b0/html5/thumbnails/19.jpg)
21
Problems with existing tools
• Lots of monitoring tools have described – they have a few things in common:- all the information which they generate is hidden away or difficult to access- limited interfaces: the data can only be accessed in specific ways
• Therefore, its difficult to build “on-demand” services to allow communities “Players” to interact with the data.
• Examples include a) Job Accounting service : to allow an Organisation to compare resources usage for
each VOb) Certification Testing service: Secure service to allow a site administrator to run
the certification test suite against their site through a RB of their choice?
• The idea is for the services to collect information and put it into a common repository such as an RGMA Archiver. In this way, the information can be shared and accessible to all.
• Services (EGEE parlance: ROC and CIC services) munch the data and present it to the community.
• Example: GIIS is that its hard to drill down to the information you want e.g How much CPU in GridPP today? How much disk in the UKI ROC? The new paradigm solves this problem by allowing the data to be aggregated in different ways.
![Page 20: LCG Monitoring and Accounting](https://reader033.fdocuments.us/reader033/viewer/2022051623/568157d3550346895dc558b0/html5/thumbnails/20.jpg)
22
Monitoring Paradigm
A Better way to unify monitoring information.
GOC Services collect information and publish into an archiver.
ROC/CIC Services provide a means for the community to interact with this information on-demand. GOC provides services tailored to the requirements of the community.
Information Repository (RGMA)
Accounting
Monitoring
GSTATTesting
ROC Services
Self Certification
CIC Services
Communities
VOs
ROCs
EGEE
Sites
Organisations
GOC Services
![Page 21: LCG Monitoring and Accounting](https://reader033.fdocuments.us/reader033/viewer/2022051623/568157d3550346895dc558b0/html5/thumbnails/21.jpg)
23
GOC UseCase Job Accounting
• An accounting package for LCG has been developed by the GOC at RAL
• There are two main parts– the accounting data-gathering infrastructure
based on R-GMA which brings the data to a central point
– a web portal to allow on-demand reports for a variety of players.
![Page 22: LCG Monitoring and Accounting](https://reader033.fdocuments.us/reader033/viewer/2022051623/568157d3550346895dc558b0/html5/thumbnails/22.jpg)
24
Requirements
• A historical record of grid usage to identify the use of individual sites by VOs as a function of time
• To demonstrate the total delivery of resources by that site to the Grid• Aggregated views of the collected data by:
– VO– Country – a requirement of LCG which has a country-based structure– EGEE Region – for use by EGEE Regional Operations Centre (ROC)
• A presentation front-end to the data to allow the selection on-demand of the views described above for different VOs and periods of time.
• To present the data as – A graphical view for interpretation– A tabular view for precision
• To support sites that already had their own methods of data collection by allowing arbitrary data collection techniques and insertion of the data in the standard schema into the central database.
![Page 23: LCG Monitoring and Accounting](https://reader033.fdocuments.us/reader033/viewer/2022051623/568157d3550346895dc558b0/html5/thumbnails/23.jpg)
27
APEL – Accounting Processor for Event Logs
![Page 24: LCG Monitoring and Accounting](https://reader033.fdocuments.us/reader033/viewer/2022051623/568157d3550346895dc558b0/html5/thumbnails/24.jpg)
34
65 Sites publishing data to GOC (April 2005)
Over 1.3 Million Job records
~ 50K records per week
http://goc.grid-support.ac.uk//
![Page 25: LCG Monitoring and Accounting](https://reader033.fdocuments.us/reader033/viewer/2022051623/568157d3550346895dc558b0/html5/thumbnails/25.jpg)
35
GOC Accounting Serviceshttp://goc.grid-support.ac.uk/gridsite/accounting/index.html
BaseCpuSeconds Aggregated across EGEE
Each Site, per VO, per Month
Simple interface to customise views of data: VO, time frame and Region (default = EGEE)
Each Region, per VO, per Month
On Demand Services to EGEE Community
Other Distributions
Normalised CPU
# Jobs
![Page 26: LCG Monitoring and Accounting](https://reader033.fdocuments.us/reader033/viewer/2022051623/568157d3550346895dc558b0/html5/thumbnails/26.jpg)
36
Provide Interface to the Data Driven by User Requirements
Materialised Path Library
Tier-1 View
Regional View
Country View
![Page 27: LCG Monitoring and Accounting](https://reader033.fdocuments.us/reader033/viewer/2022051623/568157d3550346895dc558b0/html5/thumbnails/27.jpg)
37
… Including Graphing Features
![Page 28: LCG Monitoring and Accounting](https://reader033.fdocuments.us/reader033/viewer/2022051623/568157d3550346895dc558b0/html5/thumbnails/28.jpg)
38
Number of Sites per Country Publishing Accounting Records to GOC
![Page 29: LCG Monitoring and Accounting](https://reader033.fdocuments.us/reader033/viewer/2022051623/568157d3550346895dc558b0/html5/thumbnails/29.jpg)
39
GridPP Accounting Status – April 2005• Sites that have never published or have not published recently. • CAVENDISH-LCG2 -- never published• Dublin-CSTCDIE -- never published• DURHAM – last published 18 th Feb 2005• IC-LCG2 -- last published 9 th April 2005• RAL-LCG2 – last published 16th March 2005• HP-Bristol -- never published• Lancs-LCG2 – never published• LivHep-LCG2 – never published• QMUL-eScience – never published• RHUL-LCG2 -- never published• ScotGrid-Glas – last published 17th Jan 2005• UCL-CCC – last published 12 th Feb 2005• UCL-HEP – never published
Contact Dave if you need advise about installing Apel [email protected] Tel: 01235 778178
![Page 30: LCG Monitoring and Accounting](https://reader033.fdocuments.us/reader033/viewer/2022051623/568157d3550346895dc558b0/html5/thumbnails/30.jpg)
40
Batch System Support
• APEL supports PBS (Released) and LSF (Testing)
• Implementations are separate and independent of one another. Currently LCG2_4_0 has PBS support only.
• Re-factoring to a single package with plug-in batch specific components is currently in progress.
• What is the current status about LSF Support?
• LSF currently comes in three flavours (version 4, 5 and 6), each has a different usage record format
• New RPM edg-rgma-apel-lsf has been released to CERN for testing.
• Expect a release in the 2_4_1 tag next Month.
![Page 31: LCG Monitoring and Accounting](https://reader033.fdocuments.us/reader033/viewer/2022051623/568157d3550346895dc558b0/html5/thumbnails/31.jpg)
41
Issues
1. Which RPM Version?
Latest version on http://goc.grid-support.ac.uk/gridsite/accounting
• 3.4.44 for LCG2_4_0
• Change Log 3.4.37 to 3.4.43
o Apel 3.4.43 (April 6th) Startup script modified for RGMA 2_4_0 s/w release
o Apel 3.4.42 (Mar 20th) Improved core functionality
o Better handling of dn suppression
o Check flexible archiver on-line before attempting to send job records
o Apel 3.4.41 (Feb 2nd) Minor fix to SQL script
o Apel 3.4.40 (Jan 17th) Normalisation issue (see later) CatchAll specInt/specFloat set to value in GIIS rather than 0
o Apel 3.4.39 (Dec 16th) Current PBS log excluded from archive
o Apel 3.4.38 (Nov 19th) Bug in “reprocess” option during Join Added “cleanAll” option
o Apel 3.4.37 (Oct 14th) grant mechanism to allow GK and CE to connect to MySQL database
![Page 32: LCG Monitoring and Accounting](https://reader033.fdocuments.us/reader033/viewer/2022051623/568157d3550346895dc558b0/html5/thumbnails/32.jpg)
45
Issues
5. siteName Changes– Recent problem with presenting data from the French ROC where CCIN2P3 was
renamed to IN2P3-CC via GOCDB portal– All records associated with the site are updated in order for SQL queries to match
the new siteName.
6. Namespace Convention?– Naming scheme to identify data belonging to large sites which provide services for
different communities etc.– NIKHEF: lcgprod.nikhef.nl , lcg2prod.nikhef.nl, edgapptb.nikhef.nl– *SiteName* is a bad choice because we get multiple hits
o *IC-LCG2* gives multiple matches PIC-LCG2 and IFIC-LCG2– Request sites stick to the convention *.SiteName
o h1.desy.de, zeus.desy.de
![Page 33: LCG Monitoring and Accounting](https://reader033.fdocuments.us/reader033/viewer/2022051623/568157d3550346895dc558b0/html5/thumbnails/33.jpg)
48
Future Plans
1. Integration into gLite Framework? • Work started
2. “Apel” for Storage• Capturing billing information for dcache• Cron runs, publish recent data into R-GMA• SE snapshot e.g “df” of filesystem
o Use of disk and tapeo Cron runs on SE which is a script; but script tailored for
different SE e.g. dcache, tapestore etc
3. Web Services Interface to accounting data• How would such a thing work?• Any UseCases?
![Page 34: LCG Monitoring and Accounting](https://reader033.fdocuments.us/reader033/viewer/2022051623/568157d3550346895dc558b0/html5/thumbnails/34.jpg)
49
Accounting Issues1. A stable release of accounting package has been certified and tested at CERN;
Should sites wait for the official release of press ahead independently?
2. Package supports PBS only; initial implementation for LSF. 80 sites advertising 313 Job managers:
- 300 PBS (91% of sites)- 3 CONDOR (KFKI, FNAL, TRIUMF) - 7 LSF (GSI, LNL, CERN).
3. Accounting requires the R-GMA infrastructure to be deployed at the site.
4. The VO associated with a user’s DN is not available in the batch or gatekeeper logs. It will be assumed that the group ID used to execute user jobs, which is available, is the same as the VO name.
5. The global jobID assigned by the Resource Broker is not available in the batch or gatekeeper logs. This global jobID cannot therefore appear in the accounting reports. The RB Events Database contains this, but that is not accessible nor is it designed to be easily processed. [Andrea Guarise: JRA1 proposal]
![Page 35: LCG Monitoring and Accounting](https://reader033.fdocuments.us/reader033/viewer/2022051623/568157d3550346895dc558b0/html5/thumbnails/35.jpg)
50
Accounting Issues
6. Most sites keep GK/Batch logs but throw away message log files after 9 weeks due to default log rotation.
7. At present the logs provide no means of distinguishing sub-clusters of a CE which have nodes of differing processing power. Changes to the information logged by the batch system will be required before such heterogeneous sites can be accounted properly. At present it is believed all sites are homogeneous.
![Page 36: LCG Monitoring and Accounting](https://reader033.fdocuments.us/reader033/viewer/2022051623/568157d3550346895dc558b0/html5/thumbnails/36.jpg)
51
Summary
• Accounting Information gathering infrastructure has been developed
• It has been through the C&T cycle and should be deployed in the next release.
• A web portal for display of this information has been developed (work in progress)
• This is an EGEE deliverable (DSA1.3)• The display infrastructure can be deployed for other
monitoring information.• Development towards on-demand services to provide the
community with up-to-date information, aggregated at different levels.
• Development of Visualisation tools to enhance our understanding of the grid.