Http://monalisa.caltech.edu Monitoring, Accounting and Automated Decision Support for the ALICE...

22
http://monalisa.caltech.ed http://alien.cern.ch Monitoring, Accounting and Automated Decision Support for the ALICE Experiment Based on the MonALISA Framework Catalin Cirstoiu, Costin Grigoras, Catalin Cirstoiu, Costin Grigoras, Latchezar Betev, Alexandru Costan, Latchezar Betev, Alexandru Costan, Iosif Legrand Iosif Legrand 25/06/2007 25/06/2007 HPDC 2007 Workshop on Grid Monitoring HPDC 2007 Workshop on Grid Monitoring Monterey, California Monterey, California

Transcript of Http://monalisa.caltech.edu Monitoring, Accounting and Automated Decision Support for the ALICE...

http://monalisa.caltech.edu

http://alien.cern.ch

Monitoring, Accounting and Automated Decision Support

for the ALICE Experiment Based on the MonALISA Framework

Catalin Cirstoiu, Costin Grigoras, Latchezar Betev, Catalin Cirstoiu, Costin Grigoras, Latchezar Betev, Alexandru Costan, Iosif LegrandAlexandru Costan, Iosif Legrand

25/06/200725/06/2007

HPDC 2007 Workshop on Grid MonitoringHPDC 2007 Workshop on Grid Monitoring

Monterey, CaliforniaMonterey, California

25/06/2007 [email protected] 2

Contents

Monitoring requirements MonALISA overview Application monitoring Monitoring architecture in AliEn

Jobs monitoring Traffic monitoring Services monitoring Nodes monitoring

Actions framework Feature snapshots

25/06/2007 [email protected] 3

Monitoring Requirements

Global view of the entire distributed system Least-intrusive As accurate as possible Best-effort data transport Minimizing the requirements for open ports

Providing Near real-time information Long-term history of aggregated data

On key parameters like System status Resource usage

Helping with Correlating events System debugging Generating reports

Taking automated actions based on the monitored data

25/06/2007 [email protected] 4

Data Store

MonALISA Overview MonALISA is a dynamic distributed framework Collects any type of information from different systems Aggregates and analyzes it in near-real time Provides support for automated control decisions and global

optimization of workflows in complex distributed systems.

Data CacheService & DB

Configuration Control (SSL)

Predicates & Agents

Data (via ML Proxy)

Applications Java Client(other service)

Agents Filters DataModules

WS Client(other service)

WebService

WSDLSOAP

LookupService

LookupService

Registration

Discovery

Postgres MySQL

25/06/2007 [email protected] 5

ML Discovery System & Services Hierarchical structure of loosely coupled services Independent & autonomous entities able to

Publish their existence Discover other available Jini-enabled services Use a dynamic set of proxies to cooperate with them

Network of JINI-LUSs Secure & Public

MonALISA services

Proxies

Clients, Repositories, HL services

Global Services orGlobal Services orClientsClients

Dynamic load balancing Dynamic load balancing Scalability & ReplicationScalability & ReplicationSecurity AAA for ClientsSecurity AAA for Clients

Distributed System Distributed System for gathering and for gathering and Analyzing InformationAnalyzing Information

Distributed Dynamic Distributed Dynamic Discovery-based on a lease Discovery-based on a lease Mechanism and RENMechanism and REN

Agents

25/06/2007 [email protected] 6

ApMon – Application Monitoring

MonALISAService

MonALISAService

ApMon

ApMon

APPLICATION

APPLICATION

MonitoringData

UDP/XDR

Mbps_out: 0.52 Status: reading

App. Monitoring

MB_inout: 562.4

ApMonConfig

parameter1: value parameter2: value

App. Monitoring

...

Time;IP;procIDMonitoring

Data

UDP/XDR

MonitoringData

UDP/XDR

load1: 0.24 processes: 97

System Monitoring

pages_in: 83

MonALISA

hosts

Config Servlet dynamic reloading

ApMon configuration generated automatically by a servlet / CGI script

0

10

20

30

40

50

60

70

0 1000 2000 3000 4000 5000 6000

Messages per second

Mo

nA

LIS

A C

PU

Usa

ge

(%)

No Lost Packages

Lightweight library of APIs (C, C++, Java, Perl, Python) that can be used to send any information to MonALISA Services

High comm. performance Flexible Accounting Sys Mon

25/06/2007 [email protected] 7

Long HistoryDB

Monitoring architecture in AliEn

http://pcalimonitor.cern.ch:8889/LCG Tools

MonALISA @Site

ApMon

AliEn Job Agent

ApMon

AliEn Job Agent

ApMon

AliEn Job Agent

MonALISA @CERN

MonALISA

LCG Site

ApMon

AliEn CE

ApMon

AliEn SE

ApMon

ClusterMonitor

ApMon

AliEn TQ

ApMon

AliEn Job Agent

ApMon

AliEn Job Agent

ApMon

AliEn Job Agent

ApMon

AliEn CE

ApMon

AliEn SE

ApMon

ClusterMonitor

ApMon

AliEn IS

ApMon

AliEn Optimizers

ApMon

AliEn Brokers

ApMon

MySQLServers

ApMon

CastorGridScripts

ApMon

APIServices

MonaLisaMonaLisaRepositoryRepository

Aggregated Data

rss

vsz

cpu

time

run

tim

e

job

slots

free

spac

e

nr.

of

file

s

op

en

files

Queued

JobAgents

cpu

ksi2k

jobstatus

disk

used

pro

cesses

loadn

etIn

/ou

t

jobsstatussockets

migratedmbytes

active

sessions

MyP

roxy

status

Alerts

Actions

25/06/2007 [email protected] 8

Job status monitoring

Global summaries For each/all conditions For each/all sites For each/all users Running & cumulative

Error status From job agents From central services Real-time map view Integrated pie charts History plots

25/06/2007 [email protected] 9

Real-time Map

25/06/2007 [email protected] 10

Integrated Pie Charts

25/06/2007 [email protected] 11

History Plots, Annotations

25/06/2007 [email protected] 12

Job Resource Usage

Cumulative parameters CPU Time & CPU KSI2K Wall time & Wall KSI2K Read & written files Input & output traffic (xrootd)

Running parameters Resident memory Virtual memory Open files Workdir size Disk usage CPU usage

Aggregated per site

25/06/2007 [email protected] 13

Job Network Traffic

Based on the xrootd transfer from every job

Aggregated statistics for Sites (incoming, outgoing,

site to site, internal) Storage Elements

(incoming, outgoing) Of

Read and written files Transferred MB/s

25/06/2007 [email protected] 14

Individual job tracking

Based on AliEn shell cmds. top, ps, spy, jobinfo, masterjob

Using the GUI ML Client Status, resource usage, per job

25/06/2007 [email protected] 15

AliEn & LCG Services monitoring

AliEn services Periodically checked PID check + SOAP call Simple functional tests SE space usage Efficiency

LCG environment and tools Integrating the VoBOX tests previously run by ML within the SAM framework

Proxy lifetime, gsiscp, LCG CE/SE, Job submission, BDII, Local catalog, software area etc. Error messages in case of failure Efficiency ML Alerts are used for problems notification

.

25/06/2007 [email protected] 16

FTD/FTS Monitoring

Status of the transfers Transfer rates Success/failures Efficiency via ARDA

Experiment Dashboard

25/06/2007 [email protected] 17

VOBox/Head node monitoring

Machine parameters, real-time & history Load, memory & swap usage, processes, sockets

25/06/2007 [email protected] 18

Actions framework

Based on monitoring information, actions can be taken in ML Service ML Repository

Actions can be triggered by Values above/below given

thresholds Absence/presence of values Correlation between multiple

values Possible actions types

Alerts e-mail Instant messaging RSS Feeds

External commands Event logging

ML ML RepositoryRepository

ML ServiceML Service

ML ServiceML Service

Actions based onActions based onglobal informationglobal information

Actions based onActions based onlocal informationlocal information

• Traffic• Jobs• Hosts• Apps

• Temperature• Humidity• A/C Power• …

SensorsSensors Local Local decisionsdecisions

Global Global decisionsdecisions

25/06/2007 [email protected] 19

Alerts and actions

MySQL daemon is automatically restartedwhen it runs out of memoryTrigger: threshold on VSZ memory usage

ALICE Production jobs queue is automaticallykept full by the automatic resubmissionTrigger: threshold on the number of aliprod waiting jobs

Administrators are kept up-to-date on the services’ statusTrigger: presence/absence of monitored information

25/06/2007 [email protected] 20

Fact figures Raw parameters when running 4K Jobs

Unique data series: 300K with frequency 1-15 minutes Message rate: 16K / minute

Site aggregated parameters Message rate: 2K / minute Bandwidth rate: 300Kbps

Repository DB Size: 70 GB ~ 700M records Data Reduction Schema

2 Months with 2 minutes bins 1 Year with 30 minutes bins ~Forever with 2 hours bins

Response time App -> ML Service – network speed ML Service -> ML Clients

Subscribed parameters – network speed One shot requests (history requests) – ~5 seconds

Repository dynamic history requests – ~300ms / page No incoming ports are required

25/06/2007 [email protected] 21

Summary

The MonALISA framework is used as a primary monitoring tool for the ALICE Grid since 2004

Presently the system is used for monitoring of all (identified) services, jobs and network parameters necessary for the Grid operation and debugging

The add-on tools for automatic events notification allow for more efficient reaction to problems

The framework design and flexibility answers all requirements for a monitoring system

The accumulated information allows to construct and implement automated decision making algorithms, thus increasing further the efficiency of the Grid operations

25/06/2007 [email protected] 22

Thank you!

Questions?

http://alien.cern.ch http://monalisa.caltech.edu