The LEGO Train Framework

Post on 12-Sep-2021

3 views 0 download

Transcript of The LEGO Train Framework

The LEGO Train Framework

Andrei Gheata

Costin Grigoras

Jan Fiete Grosse-Oetringhaus

The LEGO Framework - Jan Fiete Grosse-Oetringhaus 2

Idea

• Manage trains using MonALISA – Users register wagons

– Train operators compose trains

• Automatic testing per wagon

• Train file generation

• Submission managed by ML (existing LPM infrastructure)

• Merging managed by LPM

• Aim: allow operators easy running of analysis trains (~weekly) getting output on the scale of 1-2 days

The LEGO Framework - Jan Fiete Grosse-Oetringhaus 3

Configuration & Testing

• Train Configuration – New class AliAnalysisTaskCfg

• Contains description of wagons (add task macro, libraries, dependencies)

• See talk by Andrei on Monday

• Testing – Uses alientest04 machine

– Downloads AliEn packages (ROOT, AliRoot)

– Copies a part of the input data set to the local machine

– Runs tests per wagon

– Uses syswatch to extract mem/cpu information

– Tests also "base line" task which is empty

Base line

Phys Sel

Centr Sel

User A

User B

User C

The LEGO Framework - Jan Fiete Grosse-Oetringhaus 4

Workflow

MonALISA

User

Train operator

Test machine

AliEn

1. adds wagons

2. composes train

4. recompose after test

3. generates test files + executes test

5. generates train jdl + scripts

6. runs train

config

test results

train files

LPM

The LEGO Framework - Jan Fiete Grosse-Oetringhaus 5

Screenshot

Handler configuration

Wagon configuration

Data configuration

Testing and running status

The LEGO Framework - Jan Fiete Grosse-Oetringhaus 6

Handler

The LEGO Framework - Jan Fiete Grosse-Oetringhaus 7

Wagon

The LEGO Framework - Jan Fiete Grosse-Oetringhaus 8

Dataset

The LEGO Framework - Jan Fiete Grosse-Oetringhaus 9

Run

The LEGO Framework - Jan Fiete Grosse-Oetringhaus 10

Syswatch

The LEGO Framework - Jan Fiete Grosse-Oetringhaus 11

Operator Workflow

Select dataset

Select wagon

Start testing

Inspect output

The LEGO Framework - Jan Fiete Grosse-Oetringhaus 12

Operator Workflow (2)

status of

analysis

status of

merging

intermediate

merging steps Submit final

merge job

(to be automatized)

final merging

status

check output

The LEGO Framework - Jan Fiete Grosse-Oetringhaus 13

Demo…

• Enough theory, let's do some clicking…

http://alimonitor.cern.ch/trains

The LEGO Framework - Jan Fiete Grosse-Oetringhaus 14

Some More Details

• Train runs with an analysis tag

– All code + "AddTask" macro has to be in the tag (no

par file!)

• Output per run stored in the input data directory

(like AOD, QA trains). E.g.: /alice/data/2010/LHC10h/000137366/ESDs/pass2/PWG4/

CorrelationTrain/7_20111117_1350

• All merged runs found in /alice/cern.ch/user/a/alitrain/PWG4/CorrelationTrain/

7_20111117_1350/merge

The LEGO Framework - Jan Fiete Grosse-Oetringhaus 15

Operations

• After 10-12h most jobs are done (~90-98%) – Few running, few waiting

– This situation can persist for days killer for merging the output

– Solutions

• Kill jobs that have waited longer than X (being tested on the level of the LPM, better as a JDL tag)

• Remove CE requirement after a certain time (thx Latchezar for this idea), to be implemented

• Merge jobs have the same tails of few jobs that wait a long time – Ideas: same as above or run them on any CE (problem with

splitting, Pablo is investigating)

• Output available after ~2 days – 25% (real time) spend in running

– 75% in merging

– I believe this can still be improved!

The LEGO Framework - Jan Fiete Grosse-Oetringhaus 16

Operations (visually…) Analysis jobs

Waiting

Running

Done

Error

Merging jobs

Waiting

Running

Done

Error

Analysis jobs

Waiting

Running

Done

Error

hours since submission

hours since submission

hours since submission

here we kill the remaining ones

80% done

in 4 hours

The LEGO Framework - Jan Fiete Grosse-Oetringhaus 17

Current Trains

• Four active beta testers

– Jets (Christian KB)

– D2H (Zaida)

– Correlations in pp (Eva)

– Correlations in PbPb (JF)

• We got a lot of feedback, improved the system

The LEGO Framework - Jan Fiete Grosse-Oetringhaus 18

TODO

• Graphs for CPU/Wall/Mem consumption of user

tasks as function of AliRoot tag

• Some improvements in the web interface

• Automatic launching of final job

The LEGO Framework - Jan Fiete Grosse-Oetringhaus 19

Documentation

• Mailing list (for operators)

– alice-analysis-train-operators@cern.ch

• TWiki (Users + operators)

– https://twiki.cern.ch/twiki/bin/viewauth/ALICE/Analysis

Trains