1 A lightweight Monitoring and Accounting system for LHCb DC'04 production V. Garonne R. Graciani...

19
1 A lightweight Monitoring and Accounting system for LHCb DC'04 production V. Garonne R. Graciani Díaz J. J. Saborido Silva M. Sánchez García R. Vizcaya Carrillo

Transcript of 1 A lightweight Monitoring and Accounting system for LHCb DC'04 production V. Garonne R. Graciani...

Page 1: 1 A lightweight Monitoring and Accounting system for LHCb DC'04 production V. Garonne R. Graciani Díaz J. J. Saborido Silva M. Sánchez García R. Vizcaya.

1

A lightweight Monitoring and Accounting system for LHCb DC'04 production

V. GaronneR. Graciani Díaz

J. J. Saborido SilvaM. Sánchez GarcíaR. Vizcaya Carrillo

Page 2: 1 A lightweight Monitoring and Accounting system for LHCb DC'04 production V. Garonne R. Graciani Díaz J. J. Saborido Silva M. Sánchez García R. Vizcaya.

2

Outline

Manifesto Monitoring

Web interface Internals

Accounting Web interface Internals

Outlook URLs

Page 3: 1 A lightweight Monitoring and Accounting system for LHCb DC'04 production V. Garonne R. Graciani Díaz J. J. Saborido Silva M. Sánchez García R. Vizcaya.

3

Manifesto

Monitoring and Accounting are tasks in DIRAC DIRAC is a Production grid for LHCb

The Monitoring reports the status of jobs while in the WMS (Workload Management System) Instantaneous snapshot of the system No historic records

The Accounting records the status of jobs after leaving the WMS Provides historic record, accumulated statistics

and evolution of recorded variables with time Main users: production and site managers

Page 4: 1 A lightweight Monitoring and Accounting system for LHCb DC'04 production V. Garonne R. Graciani Díaz J. J. Saborido Silva M. Sánchez García R. Vizcaya.

4

Design choices

Monitoring Job information stored centrally in the WMS

Info Provided directly by the job and the WMS Passive services: no pushpushing of information

No need for a common consumer API Job and Application state stored together

Accounting Separate infrastructure from the monitoring

Jobs can never be on the Accounting and the Monitoring

Domain specific: LHCb production jobs

Page 5: 1 A lightweight Monitoring and Accounting system for LHCb DC'04 production V. Garonne R. Graciani Díaz J. J. Saborido Silva M. Sánchez García R. Vizcaya.

5

Information FlowDIRAC

WMS

Web interface Web interface

Job Database Accounting Database

Cleaner Agent

Accounting

Write Read

Monitoring

Read Write

Job

Use

rsB

ack

en

dS

erv

ices

& A

gen

ts

Job Heart-beat

Page 6: 1 A lightweight Monitoring and Accounting system for LHCb DC'04 production V. Garonne R. Graciani Díaz J. J. Saborido Silva M. Sánchez García R. Vizcaya.

6

Monitoring Web Interface 1 Interface to query monitoring service

JobId popup a window with job details if clicked

Page 7: 1 A lightweight Monitoring and Accounting system for LHCb DC'04 production V. Garonne R. Graciani Díaz J. J. Saborido Silva M. Sánchez García R. Vizcaya.

7

Monitoring Web Interface 2

The overview shows predefined plots on the production Generated

every few minutes

PyPyCCharthart used as graphics engine

100% python Supports SVG

Running jobs by site

Page 8: 1 A lightweight Monitoring and Accounting system for LHCb DC'04 production V. Garonne R. Graciani Díaz J. J. Saborido Silva M. Sánchez García R. Vizcaya.

8

Monitoring Web Interface 3 Job status by site and production id

Page 9: 1 A lightweight Monitoring and Accounting system for LHCb DC'04 production V. Garonne R. Graciani Díaz J. J. Saborido Silva M. Sánchez García R. Vizcaya.

9

Monitoring Internals

It consists of a XML-RPC service exposing whatever parameters are known to DIRAC

Job parameters stored internally by DIRAC Primary parameters

Execution site, job status, job owner etc. Fixed, centrally defined: fast access Can query on them

Secondary parameters Number of steps, internal job state, etc Defined by the production job itself Stored as key-value pairs Slower access. Cannot query on them

Page 10: 1 A lightweight Monitoring and Accounting system for LHCb DC'04 production V. Garonne R. Graciani Díaz J. J. Saborido Silva M. Sánchez García R. Vizcaya.

10

JMS basic API example

from xmlrpclib import ServerProxyserver = ServerProxy(monitoring_url)

#Retrieve list of jobs verifying some conditionsconditions = {'Status': 'running', 'Site': 'DIRAC.CERN.ch' }jobreq = server.getJobs(conditions)

#Print some parameters for each jobif jobreq['Status']: for jobid in jobreq['Value']: print server.getJobSite(jobid) print server.getJobParameter(jobid, 'LocalBatchId')

#Bulk operationssum = server.getJobsPrimarySummary(jobreq['Value'])

~3 s to select 95 out of 50k jobs

~0.7 s

~40 s

Page 11: 1 A lightweight Monitoring and Accounting system for LHCb DC'04 production V. Garonne R. Graciani Díaz J. J. Saborido Silva M. Sánchez García R. Vizcaya.

11

Accounting Web Interface 1

GUI for querying the Accounting

Shows results As graphics As table As Excel sheet

Several types of report Only a few shown

here

Page 12: 1 A lightweight Monitoring and Accounting system for LHCb DC'04 production V. Garonne R. Graciani Díaz J. J. Saborido Silva M. Sánchez García R. Vizcaya.

12

Accounting Web Interface 2

Used resources by site

Page 13: 1 A lightweight Monitoring and Accounting system for LHCb DC'04 production V. Garonne R. Graciani Díaz J. J. Saborido Silva M. Sánchez García R. Vizcaya.

13

Accounting Web Interface 3

Used resources by event type Mb/job CPU/job Failed jobs CPU vs. Exec

time Input and

Output data vs. CPU

Page 14: 1 A lightweight Monitoring and Accounting system for LHCb DC'04 production V. Garonne R. Graciani Díaz J. J. Saborido Silva M. Sánchez García R. Vizcaya.

14

Accounting Web Interface 4

Produced data by production ID Rates Cumulative Number of

events Gb of output

Page 15: 1 A lightweight Monitoring and Accounting system for LHCb DC'04 production V. Garonne R. Graciani Díaz J. J. Saborido Silva M. Sánchez García R. Vizcaya.

15

Accounting Web Interface 5

WMS statistics on DIRAC's performance Plots

Job execution time vs. WMS waiting time Job execution time vs. WMS matching time

Granularity Per site Per production Integral

Allows assessment of DIRAC's performance

Page 16: 1 A lightweight Monitoring and Accounting system for LHCb DC'04 production V. Garonne R. Graciani Díaz J. J. Saborido Silva M. Sánchez García R. Vizcaya.

16

Accounting Internals

Job and DIRAC statistics kept in a database Site contribution Data produced and used by jobs and steps Timing for jobs, steps and DIRAC internals

Separate XML-RPC interfaces to populate and query the accounting tables Both interfaces have restricted access

Jobs are moved to the accounting system by a cleaner agent after being validated

Page 17: 1 A lightweight Monitoring and Accounting system for LHCb DC'04 production V. Garonne R. Graciani Díaz J. J. Saborido Silva M. Sánchez García R. Vizcaya.

17

Accounting Usage About 10 hits per day Time to generate daily static reports:

8 min 60-70% of the time querying the

database 30-40% of the time in the drawing

packageServer load<0.2

Total: 169 kjobs

Page 18: 1 A lightweight Monitoring and Accounting system for LHCb DC'04 production V. Garonne R. Graciani Díaz J. J. Saborido Silva M. Sánchez García R. Vizcaya.

18

Outlook

Monitoring page Transactions in monitoring updates Further optimisation (bulk operations...) Search for a faster rendering package Make the web page dynamic: Less

reloads Accounting

New report types Normalized CPU Contribution by country Rate by site, country etc...

Page 19: 1 A lightweight Monitoring and Accounting system for LHCb DC'04 production V. Garonne R. Graciani Díaz J. J. Saborido Silva M. Sánchez García R. Vizcaya.

19

URLs

Monitoring page http://fpegaes1.usc.es/dmon/DC04/joblist.

html Mirror on:

http://lhcb02.usc.cesga.es/dmon/DC04/joblist.html

Direct link to overview pages http://lhcb.ecm.ub.es/DC04/Monitoring

Accounting page http://lhcb.ecm.ub.es/DC04/Accounting/