Costin Grigoras ALICE Offline. In the period of steady LHC operation, The Grid usage is constant and...

16
Consolidation of Grid operations Costin Grigoras ALICE Offline

Transcript of Costin Grigoras ALICE Offline. In the period of steady LHC operation, The Grid usage is constant and...

Consolidation of Grid operations

Costin GrigorasALICE Offline

Consolidation of Grid operations 2

PreambleIn the period of steady LHC operation, The Grid usage is constant and high and, as foreseen, is used for massive RAW and MC production and also (quite successfully) for end user analysisTo help the Grid users and administrators, many applications have been developed in the early years of the Grid. ALICE has made an effort to consolidate all of these in a coherent set of monitoring and control toolsThe following presentation is a quick overview of some of them

2010-10-19

Consolidation of Grid operations 3

Central production management - LPM

Speed is of the essence – the RAW reconstruction follows promptly the data taking, allowing for immediate QA and physics analysisLPM (Lightweight Production Manager)

Several triggers to assure RAW and conditions data integrityFully automatic Does also replication of RAW to T1Manages not only Pass1, but all central RAW and MC productions and the organized analysis trainsUp to now, 360 production cycles have been handled by LPM

2010-10-19

Consolidation of Grid operations 4

Dependent tasks - LPM chains

Data processing jobs which must be launched only when a previous process has successfully completed

For example, the QA tasks are ‘cascaded’ after Pass1 RAW reco. is completedSame for AOD production, data merging

The depth of cascading is unlimitedSpeeds up considerably the data production!

2010-10-19

LPM chains logic

5

Reco.1job/chunk

QA1job/chunk

QAmergin

g

Deletepartialoutput

Merge ROOT tags

AOD1job/chunk

AODMergin

g

Delete

partial

output

Resubmit error jobs

Same mechanism is used also for MonteCarlo productions and analysis trains on MC and RAW data

Whe

n co

mpl

ete,

star

t in

para

llel

Consolidation of Grid operations 6

LPM chains logic – exampleParallel productions are possible

With different weights / prioritiesBranches can be temporarily disabledTasks can be simple JDLs or more complex code that prepares the execution (creating collections, checking conditions)

2010-10-19

Consolidation of Grid operations 7

Integration of Grid status monitoringMonitoring data (MonALISA) is used to trigger the LPM activity

New jobs are submitted when the number of waiting tasks pass below a thresholdPre-staging of data from tape is triggered before the reconstruction jobs are submitted

Running jobs are tracked individually for resources usage

Automatic alerts in case of unreasonable disk/memory/CPU consumption, jobs can be terminated…

2010-10-19

Consolidation of Grid operations 8

Resource usage alerts

Trigger now at 2GB RSSMail sent toboth adminsand the user

2010-10-19

Consolidation of Grid operations 9

Opportunistic storage discovery

A client-to-storage metric allows the automatic discovery of the closest (working) storage elements from every job

Based on the network topology information collected by MonALISAContinuous functional tests of storagesSE occupancy status

Users specify the number of output replicas and type of storage (disk, custodial), but not the SEs

2010-10-19

Consolidation of Grid operations 10

France

Italy

Nordic Countries

Russia

USA

2010-10-19

User catalogue and job managementWeb-based access to the AliEn catalogue (with certificate authentication)

Insert your favorite plugin (ROOT) here

Consolidation of Grid operations 12

Catalogue browser – view and edit

Viewer with syntax highlight and catalogue linksSE discovery syntax is highlighted

2010-10-19

Consolidation of Grid operations 13

Jobs managementFull job tracking, with submission and resubmission capabilities

2010-10-19

Jobs management

Detailed view of a particular masterjobAll trace logs can be accessed online

14

Consolidation of Grid operations 15

Summary

The Grid is in a full production mode since almost one yearIts operation is very successful, providing millions of CPU days and PBs of storage To efficiently use there resources, consolidated tools

2010-10-19

Thanks a lot for your attention!

http://alimonitor.cern.ch/

Questions please?