Overview of monitoring tools for Grid Systems Varenna , 12 May 2008 Antonio Pierro
description
Transcript of Overview of monitoring tools for Grid Systems Varenna , 12 May 2008 Antonio Pierro
May 12, 2008 Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI) 1
Overview of monitoring tools
for Grid Systems
Varenna, 12 May 2008
Antonio PierroINFN-BARI (Italy)
Antonio.pierro <at> ba.infn.it
May 12, 2008 Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI) 2
Outlines
Overview of EGEE monitoring tools:
SAM (Service Availability Monitoring)
GridMap
GStat (Global Grid Information Monitoring System)
GridView
GridICE (infrastructure and application monitoring)
May 12, 2008 3/19Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI)
Resource Utilization and Performance Evaluation
Resources observability is needed for an optimized Grid utilization
Management Decisions
To reduce time spent waiting for Resource Availability
Be always aware of what is happening
Debugging purposes
to help the operations team locate and troubleshoot the problems
Grid resources and services are subject to failures
Why do we need monitoring?
May 12, 2008 4/19Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI)
Requirements for a Grid Monitoring tool
Scalable
Dynamic
Robust
Should be integrated with other Grid Technologies
and middleware (security infrastructure, resource
brokers, schedulers, ...)
May 12, 2008 Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI) 5
SAM (introduction)
Service Availability Monitoring framework (SAM) :
Monitoring all grid services and nodes not only CE
It is used in the validation process of sites and services
SAM wiki : http://goc.grid.sinica.edu.tw/gocwiki/SAM
SAM portal : https://lcg-sam.cern.ch:8443/sam/sam.py
Service and Site status are recorded (several snapshots per
day)
Daily, weekly, monthly availability is calculated using
integration (averaging) over the given period
Official evaluation of T0,T1 and T2 sites.
May 12, 2008 Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI) 6
SAM(performed tests) 1/2
CE
job submission - UI->RB->CE->WN chain
version of CA certificates installed (on WN!) and software middleware (on WN!)
replica management tests-using lcg-utils,default SE defined on WN and a selected
“central” SE
accessibility of experiments software directory - environment variable, directory existence
accessibility of VO tag management tools
other tests: R-GMA client check, Apel accounting records
SE, SRM
storing file from the UI - using lcg-cr command with LFC registration
getting file back to the UI - using lcg-cp command
removing file - using lcg-del command with LFC de-registration
May 12, 2008 Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI) 7
SAM(performed tests) 2/2
LFC
directory listing - using lfc-ls command on /grid
creating file entry in /grid/<VO> area
FTS
checking if FTS is published correctly in the BDII
channel listing - using glite-transfer-channel-list command with ChannelManagement
service
transfer test (in development):
Standalone tests
GSTAT, RB
VO specific tests as well
SAM - CE sensor TestsFrance Region, VO OPS
SAM - CE sensor TestsFrance Region, VO OPS
OK: normal status
Errror: subject has failed and problem is localized
•*** Running R-GMA client test on alifarm57.ct.infn.it ***
Inserting tuple: ERROR: Could not contact R-GMA server at grid005.ct.infn.it:8443 –
(104, 'Connection reset by peer')
ERROR: Could not contact R-GMA server at grid005.ct.infn.it:8443 –
(104, 'Connection reset by peer') Failed Timeout when executing test
CE-sft-rgma after 600 seconds!
subject may fail
soon
May 12, 2008 10/19Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI)
It publishes the same data of SAM in a different way
Is a simple interactive and user-friendly interface to see the
state of Grid
Sites or services of the Grid are represented by rectangles
of different size and colour allowing two dimensions of data to
be visualized simultaneously.
This representation of monitoring data requires much less
space than conventional sorted tables or bar charts.
GridMAP
May 12, 2008 11/19Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI)
GridMAP
GridMap Prototype – visualizing the state of the grid
the state of the grid – SAM test
Daily availability
May 12, 2008 Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI) 12
GridView 1/2
It is a visualization system for viewing monitoring information
Approach:
Collections monitoring information from different sources,
e.g.:
SAM, GridFTP monitor, RB Logs
The records of monitoring information are in a central
Oracle database at CERN
Visualizations of summary data through Web interface
Target: Grid operators, Site administrators, VO managers
May 12, 2008 Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI) 13
GridView (web page) 2/2Statistic of data transfert
jobs running
service availability
May 12, 2008 Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI) 14
GStat 1/2
GStat is built using Python scripts that generate web based
reports used by Grid site administrators to troubleshoot Information
System issues or access usage information.
GStat scripts are executed periodically to query and collect
the information published by each site in the Grid Infrastructure.
The information published is then processed by extensible
analysis framework that checks for IS failures and errors.
Target:
Grid operators
Site administrators
May 12, 2008 Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI) 15
The main page of GStat shows the overall status and usage statistic for each site. GStat site detailed report GStat site resource status
GStat 2/2
May 12, 2008 Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI) 16
EGEE EGEE-SWE RDIG EGEE-SEE Grid.it GILDA CMS ATLAS EUMedGrid
EUChinaGrid EUIndiaGrid BalticGrid LIBI BioinfoGRID EELA
OMII BeGrid
It is a distributed monitoring tool for Grid systems
is evolving in the context of EU-EGEE and many other EU Grid
projects
fully integrated with the gLite-3.x Middleware
Self-configurable collection and presentation
just give the URL of the root Grid Information Service (GIS)
Installed servers are monitoring Grid resources in the scope of:
GridICE: Overview
May 12, 2008 17/19Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI)
Recent evolution of GridICE lightweight sensor + VOMS information
Attributes measured by the Job Monitoring sensor
To reduce its intrusiveness in terms of
resources consumption:
Two daemons running and a probe
executed periodically
They listen to a set of log files and
collect the relevant information
Few LRMS commands to retrieve
jobs status
The status of all jobs is stored in a
cache (stateful behaviour)
May 12, 2008 18/19Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI)
Integration with local monitoring systems (LEMON)
Grid monitoring integrated with local monitoring
The last server version is very simple to install
The client installation may be turned on in the standard middleware LCG
installation (no additional operation are needed)
The LEMON monitoring system and alarm management are integrated in the
new version of the GridICE server
The local sensor currently used for farm monitoring can be interfaced with
GridICE to collect all the available data
The back-end is realized with LEMON
Local farm monitoring that are using LEMON can be integrated with GridICE
May 12, 2008 Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI) 19
LRMSinfo
The LRMS Info sensor provides aggregated information of the Local Resource Manager System
May 12, 2008 Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI) 20
We focus on the following categories of users:
VO manager
actual set of resources accessible to VO members: “How many jobs
submitted by my users are running or queued?” (with details of the
VOMS groups and/or single user)
Grid operator
all resources under responsibility of a Grid Operator Center (“How many
resources are available?”)
Site administrator
site resources offered to a Grid (“Is there any service down?”)
Grid users
The status of their jobs on a grid.
How do we identify the user/role?
May 12, 2008 Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI) 21
The users are identified with the digital certificate installed
in its browser
a valid CA certificate
server based on https protocol
The new sensor are able to retrieve the VOMS information
VOMS information: groups and roles of users
submitting the jobs
The related role (e.g., site manager, VO manager) can
be retrieved by GridICE database.
May 12, 2008 22/19Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI)
“Standard user ” monitoring (1)
• User that has no jobs submitted and no role registered
May 12, 2008 23/19Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI)
“Standard user ” monitoring (2)
An authenticated user sees only his/her own jobs
May 12, 2008 24/19Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI)
“Standard user ” monitoring (3)
An authenticated user sees only his/her own jobs
exit status = 0 => successfully jobs
exit status <> 0 =>failure jobs
May 12, 2008 Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI) 25
Grid monitoring from the VO Manager perspectives
May 12, 2008 26Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI)
Grid monitoring from the Site Manager perspectives
May 12, 2008 Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI) 27
Acronyms and Abbreviations (1):
ACL - Access Control ListAPEL - Accounting Processor for Event LogsAPI - Application Programming InterfaceBDII - Berkeley Database Information IndexCA - Certificate AuthorityCE Computing Element: a Grid-enabled computing resourceCERN - European Organisation for Nuclear ResearchGIIS - Grid Index Information Service. MDS index node. Aggragates informationdCache - (disk pool management system)DN - Distinguished Name (X.500, LDAP)EGEE - Enabling Grids for E-sciencEFTS - File Transfer Service (EGEE)GARR - Gruppo per l'Armonizzazione delle Reti della RicercaGGUS - Global Grid User SupportGIIS - Grid Information Index ServerGILDA - Grid Infn Laboratory for Dissemination ActivitiesGRIS - Grid Resource Information Service. Collects information for MDS.IN2P3 - Institut National de Physique Nucléaire et de Physique des ParticulesINFN - Istituto Nazionale di Fisica Nucleare (in Italy)ISO - International Standardization OrganizationJDL - Job Description LanguageLB - Logging and Bookeeping serviceLEMON - LHC Era Monitoring
May 12, 2008 Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI) 28
Acronyms and Abbreviations (2):
LCG - LHC Computing GridLDAP - Lightweight Directory Access ProtocolLDIF - LDAP Data Interchange FormatLDN - Logical Dataset NameLFC - LCG File CatalogLFN - Logical File NameLHC - Large Hadron Collider. Under construction. Hosts CMS, ATLAS, and other experiments.LRMS - Local Resource Management SystemMDS - Meta Directory Service, or Monitoring and Discovery Service (Globus)MPI - Message Passing Interface (Globus)PhEDEx - Physics Experiment Data Export (CMS)RFIO - Remote File I/OR-GMA - Relational Grid Monitoring Architecture (EGEE). A monitoring system similar to MDSROC - Regional Operations CentreRLS - Replica Locator ServiceSE - Storage ElementSOAP - Simple Object Access ProtocolSRM - Storage Resource ManagementVO - Virtual Organization, e.g., an experimentVOBOX - VO boxVOMRS - Virtual Organization Management Registration ServiceVOMS - VO Management ServiceX.509 - (ITU-T standard for Public-key and attribute certificate frameworks)
May 12, 2008 Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI) 29
References
SAM
http://goc.grid.sinica.edu.tw/gocwiki/SAME_Planning
https://lcg-sam.cern.ch:8443/sam/sam.py?sensors=CE®ions=
GRIDMAP
http://gridmap.cern.ch/gm/
http://cerncourier.com/cws/article/cnl/31986
Gstat
http://goc.grid.sinica.edu.tw/gstat/
GridView:
Portal: http://gridview.cern.ch/
TWiki: https://twiki.cern.ch/twiki/bin/view/LCG/GridView
GridICE:
http://gridice.forge.cnaf.infn.it/
May 12, 2008 Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI) 30
Conclusions
There are several monitoring tools available for
the Grid system
Which tool do you use?
It depends by your role in grid
Sometimes you could use more tools at the
same time to satisfy your needs
May 12, 2008 Overview on monitoring tools for Grid Systems - Antonio Pierro (INFN-BARI) 31
Thank You