Global ADC Job Monitoring Laura Sargsyan (YerPhI).

21
Global ADC Job Monitoring Laura Sargsyan (YerPhI)

description

30/11/2010ATLAS Software & Computing Workshop 3 Architecture of Dashboard Job monitoring Dashboard Job Repository (Dashboard DB) PanDA DB Instrumented GANGA UI Instrumented jobs MSG (ActiveMQ) Dashboard Collector Historical view UI Job summary UI Analysis Job Monitoring UI

Transcript of Global ADC Job Monitoring Laura Sargsyan (YerPhI).

Page 1: Global ADC Job Monitoring Laura Sargsyan (YerPhI).

Global ADC Job Monitoring

Laura Sargsyan (YerPhI)

Page 2: Global ADC Job Monitoring Laura Sargsyan (YerPhI).

30/11/2010 ATLAS Software & Computing Workshop

2

Motivation

Provide an overview of job processing in scope of ATLAS regardless of the submission tools and execution backends

Adapt CMS Job monitoring for the analysis users, preserving the content but improving visualization.

Page 3: Global ADC Job Monitoring Laura Sargsyan (YerPhI).

30/11/2010 ATLAS Software & Computing Workshop

3

Architecture of Dashboard Job monitoring

Dashboard JobRepository

(Dashboard DB)

PanDA DB

Instrumented GANGA UI

Instrumented jobs

MSG(ActiveMQ)

Dashboard Collector

Historical view UI

Job summary UI

Analysis JobMonitoring UI

Page 4: Global ADC Job Monitoring Laura Sargsyan (YerPhI).

30/11/2010 ATLAS Software & Computing Workshop

4

Main components (1)

Dashboard GANGA reporting: GANGA plugin, publishes master/subjobs statuses, meta information to MSG

MSG: published and consumed messages

Data collectors msg-consume2db: listens to messages from MSG PanDA collector: retrieves data from PanDA DB

Watchdog scripts: scheduled procedures that send alarms by SMS, e-mail in case of problem

DB scheduled services collector alarms cronjob scripts

Page 5: Global ADC Job Monitoring Laura Sargsyan (YerPhI).

30/11/2010 ATLAS Software & Computing Workshop

5

Data repository: DB Triggers populate data from GANGA job reporting and PanDA

DB into the database tables

Web Application layer: responsible for the HTTP entry point to the data and exposes them in different formats (JSON, XML, CSV)

User interfaces Provides user centric view;

Main components (2)

Page 6: Global ADC Job Monitoring Laura Sargsyan (YerPhI).

30/11/2010 ATLAS Software & Computing Workshop

6

Implementation

Importing PanDA data into the schema, which is validated and tuned for monitoring purposes

Instrumentation of GANGA jobs for MSG reporting, submitted via WMS, Local submission, CREAM CE

Data from MSG is collected in the monitoring repository

As a result all ATLAS job monitoring data, both for analyses and production is collected in common monitoring schema

Setup aggregated procedures for data accounting

Adapting CMS dashboard interactive and accounting user interfaces for ATLAS(adding sorting and filtering by cloud)

Page 7: Global ADC Job Monitoring Laura Sargsyan (YerPhI).

30/11/2010 ATLAS Software & Computing Workshop

7

Job Summary (1)

Interactive view : What is going on now regarding job processing in the scope of ATLAS

Aimed at different types of users:•individual scientists using the Grid for data analysis, user support teams, site admins, VO managers, managers of different computing projects.

Job Summary enables very flexible access to recent monitoring data and shows the job processing of a VO at run-time

Page 8: Global ADC Job Monitoring Laura Sargsyan (YerPhI).

30/11/2010 ATLAS Software & Computing Workshop

8

Job Summary (2) http://dashb-atlas-jobdev.cern.ch/dashboard/request.py/jobsummary

Page 9: Global ADC Job Monitoring Laura Sargsyan (YerPhI).

30/11/2010 ATLAS Software & Computing Workshop

9

Analysis Job Monitoring (1)

Collects and exposes a user-centric set of information to the user regarding submitted tasks.

Focused on the user's perspective.

Offers a wide selection of graphical plots.

User-driven development.

Provides a consistent way of following a user’s analysis jobs regardless of the submission tool.

Detailed information on twiki:

https://twiki.cern.ch/twiki/bin/view/ArdaGrid/TaskMonitoringWebUI

Analysis Job Monitoring “ web interface will be presented today on ” Distributed Analysis “ session by Jakub MOSCICKI

Page 10: Global ADC Job Monitoring Laura Sargsyan (YerPhI).

30/11/2010 ATLAS Software & Computing Workshop

1010

Analysis Job monitoring (2)

meta information

http://dashb-atlas-jobdev.cern.ch/templates/client/index.html

Includes•Full bookmarking capability•Working 'refresh' capability•“Breadcrumbs” navigation element•Easy search•History support

“time period” selection for from-till and time range selection

Page 11: Global ADC Job Monitoring Laura Sargsyan (YerPhI).

30/11/2010 ATLAS Software & Computing Workshop

11

Analysis Job monitoring (3)

Resubmission history

Link to the PanDA monitoring page for

each (panda) job

Page 12: Global ADC Job Monitoring Laura Sargsyan (YerPhI).

30/11/2010 ATLAS Software & Computing Workshop

12

Historical views

Functionality Number of terminated, submitted, pending, running jobs Distribution of failed jobs by failure codes/reasons/categories CPU/Wall clock consumption, efficiency as cpu versus wallclock Processed events : number of processed events as a function of time, CPU/wallclock time spent on a

single event Resource utilization, number of used slots, efficiency of site usage compared to pledges Activities at the site. Single site view with job processing metrics . Data transfer distributions will be

added soon. All data can be filtered by site, activity, cloud Any time range can be selected Available granularities are hourly/daily/weekly/monthly All data is available in machine-readable format All plots are available via direct link

Page 13: Global ADC Job Monitoring Laura Sargsyan (YerPhI).

30/11/2010 ATLAS Software & Computing Workshop

13

Terminated, Submitted, Pending, Running JobsClick to the appropriate button to create plot

http://dashb-atlas-job-dev.cern.ch/dashboard/request.py/dailysummary

Granularity:Hourly,

Daily,Monthly

13

Historical views

Time Range: 24 h, 48 h,

week, month, custom

Click on the plotTo zoom in and out

Page 14: Global ADC Job Monitoring Laura Sargsyan (YerPhI).

30/11/2010 ATLAS Software & Computing Workshop

14

Status of Terminated jobs

Chosen parameters :

All T1 +T0

Time Range 48 hours

Granularity -hourly

Chosen parameters :

All T1 +T0

Time Range 48 hours

Granularity -hourly

Sorted by activities

(production jobs)

Click on the header of the plot to get links to machine-readable format or direct link to a plot

Historical views

Page 15: Global ADC Job Monitoring Laura Sargsyan (YerPhI).

Failed/Aborted jobs. Error codes

Chosen parameters :

All T1 +T0

Time Range 48 hours

Granularity -hourly

2 kinds of failure:● application (transExitCode)

GRID (pilot, brokerage, ddm, jobDispatcher, supervisor, execution, taskBuffer)

Application failure should be grouped by component (e.g. site,user,application )

Historical views

Page 16: Global ADC Job Monitoring Laura Sargsyan (YerPhI).

30/11/2010 ATLAS Software & Computing Workshop

16

Milestones and achievements

Monitoring plugin for ATLAS Ganga users publishes information about jobs to MSG since 17/05/2010

All ATLAS job monitoring data is collected in the common schema in real time since 26/09/2010

Aggregation procedures for feeding summary db tables setup since 8/10/2010

Interactive and accounting UI are available for ATLAS Community from 11/10/2010

Historical views now contain data imported from PanDA archive started from 1/01/2010.

Page 17: Global ADC Job Monitoring Laura Sargsyan (YerPhI).

30/11/2010 ATLAS Software & Computing Workshop

17

Job Monitoring data starting from 1/01/2010

Historical views

Page 18: Global ADC Job Monitoring Laura Sargsyan (YerPhI).

30/11/2010 ATLAS Software & Computing Workshop

18

Page 19: Global ADC Job Monitoring Laura Sargsyan (YerPhI).

30/11/2010 ATLAS Software & Computing Workshop

19

Plans for the next year

Migrate DB the production server after validation by ATLAS (January 2011)

Improve performance of the Interactive UI (January 2011)

Add user interface for the production shifters (February-March 2011)

Page 20: Global ADC Job Monitoring Laura Sargsyan (YerPhI).

30/11/2010 ATLAS Software & Computing Workshop

20

Effort and sustainability

Developers:n Laura Sargsyan (ATLAS)n Julia Andreeva (IT)n Edward Karavakis (IT)n Lukasz Kokoszkiewicz (IT)n Jakub Moscicki (IT) All applications (apart of data collectors, analysis users' web

interface) are shared with CMS. CERN IT ES provides support for these applications.

Page 21: Global ADC Job Monitoring Laura Sargsyan (YerPhI).

30/11/2010 ATLAS Software & Computing Workshop

21

Waiting for your feedback

E-mail:[email protected]@cern.ch