Costin Grigoras ALICE Offline. In the period of steady LHC operation, The Grid usage is constant and...
-
Upload
elinor-knight -
Category
Documents
-
view
214 -
download
0
Transcript of Costin Grigoras ALICE Offline. In the period of steady LHC operation, The Grid usage is constant and...
Consolidation of Grid operations 2
PreambleIn the period of steady LHC operation, The Grid usage is constant and high and, as foreseen, is used for massive RAW and MC production and also (quite successfully) for end user analysisTo help the Grid users and administrators, many applications have been developed in the early years of the Grid. ALICE has made an effort to consolidate all of these in a coherent set of monitoring and control toolsThe following presentation is a quick overview of some of them
2010-10-19
Consolidation of Grid operations 3
Central production management - LPM
Speed is of the essence – the RAW reconstruction follows promptly the data taking, allowing for immediate QA and physics analysisLPM (Lightweight Production Manager)
Several triggers to assure RAW and conditions data integrityFully automatic Does also replication of RAW to T1Manages not only Pass1, but all central RAW and MC productions and the organized analysis trainsUp to now, 360 production cycles have been handled by LPM
2010-10-19
Consolidation of Grid operations 4
Dependent tasks - LPM chains
Data processing jobs which must be launched only when a previous process has successfully completed
For example, the QA tasks are ‘cascaded’ after Pass1 RAW reco. is completedSame for AOD production, data merging
The depth of cascading is unlimitedSpeeds up considerably the data production!
2010-10-19
LPM chains logic
5
Reco.1job/chunk
QA1job/chunk
QAmergin
g
Deletepartialoutput
Merge ROOT tags
AOD1job/chunk
AODMergin
g
Delete
partial
output
Resubmit error jobs
Same mechanism is used also for MonteCarlo productions and analysis trains on MC and RAW data
Whe
n co
mpl
ete,
star
t in
para
llel
Consolidation of Grid operations 6
LPM chains logic – exampleParallel productions are possible
With different weights / prioritiesBranches can be temporarily disabledTasks can be simple JDLs or more complex code that prepares the execution (creating collections, checking conditions)
2010-10-19
Consolidation of Grid operations 7
Integration of Grid status monitoringMonitoring data (MonALISA) is used to trigger the LPM activity
New jobs are submitted when the number of waiting tasks pass below a thresholdPre-staging of data from tape is triggered before the reconstruction jobs are submitted
Running jobs are tracked individually for resources usage
Automatic alerts in case of unreasonable disk/memory/CPU consumption, jobs can be terminated…
2010-10-19
Consolidation of Grid operations 8
Resource usage alerts
Trigger now at 2GB RSSMail sent toboth adminsand the user
2010-10-19
Consolidation of Grid operations 9
Opportunistic storage discovery
A client-to-storage metric allows the automatic discovery of the closest (working) storage elements from every job
Based on the network topology information collected by MonALISAContinuous functional tests of storagesSE occupancy status
Users specify the number of output replicas and type of storage (disk, custodial), but not the SEs
2010-10-19
User catalogue and job managementWeb-based access to the AliEn catalogue (with certificate authentication)
Insert your favorite plugin (ROOT) here
Consolidation of Grid operations 12
Catalogue browser – view and edit
Viewer with syntax highlight and catalogue linksSE discovery syntax is highlighted
2010-10-19
Consolidation of Grid operations 13
Jobs managementFull job tracking, with submission and resubmission capabilities
2010-10-19
Consolidation of Grid operations 15
Summary
The Grid is in a full production mode since almost one yearIts operation is very successful, providing millions of CPU days and PBs of storage To efficiently use there resources, consolidated tools
2010-10-19