CREAM: ALICE Experience WLCG GDB Meeting, CERN 11th November 2009 Stefano Bagnasco (INFN-Torino),...

16
CREAM: ALICE Experience WLCG GDB Meeting, CERN 11th November 2009 Stefano Bagnasco (INFN-Torino), Jean-Michel Barbet (Subatech), Latchezar Betev (ALICE), Catalin Condurache (RAL), Sergio Fantinel (INFN-Legnaro), Stefano Lusso (INFN-Torino), Patricia Méndez Lorenzo (CERN, IT/GS), Francesco Noferini (INFN-CNAF), Derek Ross (RAL) and Massimo Sgaravatto (CREAM development team, INFN-Padova)

Transcript of CREAM: ALICE Experience WLCG GDB Meeting, CERN 11th November 2009 Stefano Bagnasco (INFN-Torino),...

Page 1: CREAM: ALICE Experience WLCG GDB Meeting, CERN 11th November 2009 Stefano Bagnasco (INFN-Torino), Jean-Michel Barbet (Subatech), Latchezar Betev (ALICE),

CREAM: ALICE ExperienceWLCG GDB Meeting, CERN 11th November 2009

Stefano Bagnasco (INFN-Torino), Jean-Michel Barbet (Subatech), Latchezar Betev (ALICE), Catalin Condurache (RAL), Sergio Fantinel

(INFN-Legnaro), Stefano Lusso (INFN-Torino), Patricia Méndez Lorenzo (CERN, IT/GS), Francesco Noferini (INFN-CNAF), Derek Ross

(RAL) and Massimo Sgaravatto (CREAM development team, INFN-Padova)

Page 2: CREAM: ALICE Experience WLCG GDB Meeting, CERN 11th November 2009 Stefano Bagnasco (INFN-Torino), Jean-Michel Barbet (Subatech), Latchezar Betev (ALICE),

ThanksThanks This talk includes the feedback and the contributions from

Subatech: Jean-Michel Barbet INFN-Torino: Stefano Lusso and Stefano Bagnasco INFN-CNAF: Francesco Noferini RAL-LCG2: Catalin Condurache and Derek Ross INFN-Legnaro: Sergio Fantinel INFN-Padova and CREAM CE developers team: Massimo Sgaravatto

11/11/09CREAM: ALICE Experience 2

Page 3: CREAM: ALICE Experience WLCG GDB Meeting, CERN 11th November 2009 Stefano Bagnasco (INFN-Torino), Jean-Michel Barbet (Subatech), Latchezar Betev (ALICE),

CREAM-CE: CREAM-CE: Deployment statusDeployment status

Current CREAM-CE service Production version: CREAM1.5 (glite-CREAM-3.1.20) Deployed in production by the 6th of October (patch #3259 for SLC4/i386) https://savannah.cern.ch/patch/?func=detailitem&item_id=3259#options

Features: Important bug and security fixes (pointed by the GSVG)

http://www.gridpp.ac.uk/gsvg/advisories/advisory-55615.txt http://www.gridpp.ac.uk/gsvg/advisories/advisory-55616.txt

Migration of sites to CREAM1.5 was highly encouraged by that time and ALICE fully support it for all sites providing this service for the experiment

Outlook of this talk: During the last GDB (14/10/09) we made a list of all issues reported by the site

admins in terms of CREAM-CE and based on the experiences gained with the ALICE production

Now (one month of operations later) we have collected the feedback from several sites already using CREAM1.5

11/11/09CREAM: ALICE Experience 3

Page 4: CREAM: ALICE Experience WLCG GDB Meeting, CERN 11th November 2009 Stefano Bagnasco (INFN-Torino), Jean-Michel Barbet (Subatech), Latchezar Betev (ALICE),

CREAM-CE: Future CREAM-CE: Future versionversion

Future CREAM-CE service Production version: CREAM1.6 Status ready for certification/certified (expected) by December

2009 TASK #9734:

https://savannah.cern.ch/task/?9734 PATCHES #3179 and #3209

https://savannah.cern.ch/patch/?3179 Release 1.6 of CREAM CE for sl5_x86_64

https://savannah.cern.ch/patch/?3209 YAIM-CREAM-CE for release 1.6 of CREAM CE

Features: Many of the issues reported during the last GDB (and not included

in CREAM1.5) will be now solved11/11/09CREAM: ALICE Experience 4

Page 5: CREAM: ALICE Experience WLCG GDB Meeting, CERN 11th November 2009 Stefano Bagnasco (INFN-Torino), Jean-Michel Barbet (Subatech), Latchezar Betev (ALICE),

CREAM-CE: site CREAM-CE: site admins and admins and

developers reports (I) developers reports (I) Purge issues:

ALICE REPORT: Wrong report of job status. CREAM’s vision of running jobs de-synchronized

ALICE REQUIREMENT: Method to purge jobs in a non terminal status CREAM STATUS:

CREAM job status can be wrongly reported because of some misconfigurations or because of these two bugs in the BLAH Blparser candidates for CREAM1.6

BUG #55078: « Possible final state not considered in BLParserPBS and BUpdaterPBS » CURRENT STATUS: Integration Candidate included in patch #3179

BUG #54949: « Some job can remain in running state when BLParser is restarted for both lsf and pbs » CURRENT STATUS: Integration candidate included in patch #3179

There is an specific bug which covers the ALICE requirement

BUG #55420: « Allow admin to purge CREAM jobs in a non terminal status » (Solution Status: in progress) CURRENT STATUS: Integration Candidate included in patch #3179

CURRENT RISK FOR ALICE: Low once the developers provided site admins with the corresponding purge script (very high before)

11/11/09CREAM: ALICE Experience 5

Page 6: CREAM: ALICE Experience WLCG GDB Meeting, CERN 11th November 2009 Stefano Bagnasco (INFN-Torino), Jean-Michel Barbet (Subatech), Latchezar Betev (ALICE),

CREAM-CE: site CREAM-CE: site admins and admins and

developers feedback developers feedback (I) (I)

Purge issues: Site admin reports Desynchronization issues has not been observed recently at sites

running CREAM1.5 Several sites have used the script created by the CREAM

developers to purge manually the CREAM DB Very good feedback on regard with this toolkit It requires however a manual operation and the purge criteria

variates from site to site

11/11/09CREAM: ALICE Experience 6

Page 7: CREAM: ALICE Experience WLCG GDB Meeting, CERN 11th November 2009 Stefano Bagnasco (INFN-Torino), Jean-Michel Barbet (Subatech), Latchezar Betev (ALICE),

CREAM-CE: site admins CREAM-CE: site admins and developers report and developers report

(II) (II) DISK SPACE issues: Areas to monitor and purge or clean

ALICE REPORT: The local mysql DB grown up to 2.5 GB CREAM STATUS: Issue associated to mysql engine. While deleting

entries from the DB, the relevant disk space is not released (therefore the CREAM DB does not decrease). But the space is reused when new data added in the DB

CURRENT RISK FOR ALICE: low

ALICE REPORT: purge of the input Sandboxes in /opt/glite/var/cream_sandbox

CREAM STATUS: Solved in CREAM1.5 #48144: « Problems with purge in CREAM when the mapped group

name is different than the VO name » RISK FOR ALICE: none once sites upgrade to CREAM1.5

11/11/09CREAM: ALICE Experience 7

Page 8: CREAM: ALICE Experience WLCG GDB Meeting, CERN 11th November 2009 Stefano Bagnasco (INFN-Torino), Jean-Michel Barbet (Subatech), Latchezar Betev (ALICE),

CREAM-CE: site admins CREAM-CE: site admins and developers and developers

feedback (II) feedback (II) Disk space issues: Site admins report

Grow up of the local mysql DB Some tables in the DB still growing up

Purge of the input Sandboxes in /opt/glite/var/cream_sandbox area Sandbox auto-purge procedure included in CREAM1.5 working fine

now (after 10 days outputs are purged) No further issues observed by the site admins on regards with the

purge of the Sandbox after the migration to CREAM1.5

11/11/09CREAM: ALICE Experience 8

Page 9: CREAM: ALICE Experience WLCG GDB Meeting, CERN 11th November 2009 Stefano Bagnasco (INFN-Torino), Jean-Michel Barbet (Subatech), Latchezar Betev (ALICE),

CREAM-CE: site admins CREAM-CE: site admins and developers report and developers report

(III) (III) DISK SPACE issues (cont.)

ALICE REPORT: issues regarding /opt/glite/var/log and /var/log ALICE REQUIREMENT: Cleaning policy required for these files, otherwise files

can grow forever CREAM STATUS: policies exist for all these files and can be customized file by

file: Only the blah accounting log files are out of the CREAM developer’s control (files

cannot be deleted before having been processed by the accounting system) For /opt/glite/var/log/glite-ce-cream.log and /opt/glite/var/log/glite-ce-

monitor.log, the policy is defined under /var/lib/tomcat5/webapps/ce-cream/WEB-IFN/classes/log4j.properties and the default values can be changed Relevant info under: http://grid.pd.infn.it/cream/field.php?n=Main.KnownIssues

For /opt/glite/var/log/glite-xxxparser.log the policy is available under /opt/logrotate.d/glite-xxxparser

For /etc/logrotate.d/globus-gridftp manages the gridftp log files under /var/log RISK FOR ALICE: low since the size is manageable by site admins

11/11/09CREAM: ALICE Experience 9

Page 10: CREAM: ALICE Experience WLCG GDB Meeting, CERN 11th November 2009 Stefano Bagnasco (INFN-Torino), Jean-Michel Barbet (Subatech), Latchezar Betev (ALICE),

CREAM-CE: site admins CREAM-CE: site admins and developers report and developers report

(IV) (IV) DISK SPACE issues (cont.)

ALICE REPORT: issues regarding /opt/glite/var/cream/user_proxy CREAM STATUS: bug reported and accepted not available in

CREAM1.5 #49497: « User proxies on CREAM do not get cleaned up »

CURRENT STATUS: Already solved, it will be included in CREAM1.6 (bug fix implementation still pending)

CREAM developers could increase the priority of this bug if needed

DISK SPACE issues: site admins report No issues observed by the sites in the last month

11/11/09CREAM: ALICE Experience 10

Page 11: CREAM: ALICE Experience WLCG GDB Meeting, CERN 11th November 2009 Stefano Bagnasco (INFN-Torino), Jean-Michel Barbet (Subatech), Latchezar Betev (ALICE),

CREAM-CE: site admins CREAM-CE: site admins and developers report and developers report

(V) (V) LOAD issues (reported by Subatech):

ALICE REPORT: UNIX load going up to 5 (during start up or high rate of submission) CREAM STATUS: problem reported by GRNET and the origin of the problem was a

missed index in the CREAM DB #52876: « The extra attribute table in the CREAM DB has no key/indexes defined »

CURRENT STATUS: solved in CREAM1.5 RISK FOR ALICE: low once upgrading the CREAM version

ALICE REPORT: When tomcat restarted the system can take up to 15 min before submitting new jobs

CREAM STATUS: The slow start of CREAM is also due to the problems coming from jobs reported in wrong status #51978: «CREAM can be slow to start» bug in progress

CURRENT STATUS: not included in CREAM1.5 but will be released in CREAM1.6

RISK FOR ALICE: Purge actions should speed this start up and therefore decrease the risk for the experiment

11/11/09CREAM: ALICE Experience 11

Page 12: CREAM: ALICE Experience WLCG GDB Meeting, CERN 11th November 2009 Stefano Bagnasco (INFN-Torino), Jean-Michel Barbet (Subatech), Latchezar Betev (ALICE),

CREAM-CE: site admins CREAM-CE: site admins and developers and developers

feedback (V) feedback (V) Load issues: Site admins report

Grow up of the UNIX load Reported by Subatech, still visible at the site

Load increases during automatic purge operations. Also visible during high job submission rates

Site admin report: At this site CREAM is running in a Vmware VM and the load might be due to lack of MySQL performance in such environment. Slow down of MySQL could increase the Unix load

CREAM-CE developers report: Issue tracked in bug #58103. the GRNET report “CREAM performance report”: very heavy queries are performed during purge operations CURRENT STATUS: Fix already committed to CVS and will be released with the next

CREAM1.6. Developers have not yet assessed the level of optimization of this fix to reduce the load

Report from Legnaro After closing the queues the load increased without saturating the CPU (60% CPU

load) for about 12h. The issues seems to come from the ALICE submissions which continued although the queues were closed.

Tomcat restart slows down the submission of jobs Solved in CREAM1.6 No further reports from the site admins

11/11/09CREAM: ALICE Experience 12

Page 13: CREAM: ALICE Experience WLCG GDB Meeting, CERN 11th November 2009 Stefano Bagnasco (INFN-Torino), Jean-Michel Barbet (Subatech), Latchezar Betev (ALICE),

Some other interesting Some other interesting feedbackfeedback

We asked the site admins for: Requirements for the system maintenance

Since the last update sites spend much less time monitoring CREAM-CE

Keeping control of the disk space basically and consistency between jobs reported by CREAM and the local batch system (Subatech)

In some cases, the baby-sitting of the site is almost negligible (Legnaro and CNAF)

Issues observed at RAL before the upgrade of the system seems to be gone after the deployment of CREAM1.5 i.e.,Tomcat related issues already solved with this new version

11/11/09CREAM: ALICE Experience 13

Page 14: CREAM: ALICE Experience WLCG GDB Meeting, CERN 11th November 2009 Stefano Bagnasco (INFN-Torino), Jean-Michel Barbet (Subatech), Latchezar Betev (ALICE),

Some other interesting Some other interesting feedback (II)feedback (II)

We also asked the site admins for: Monitoring applied to the system at the sites

In some cases (Subatech and RAL) the site is using Nagios with also specific probes: gLite-LB-logd and tomcat daemons User_nbfiles: number of files used for ALICE production Inactive_jobs: jobs not consuming CPU Open_file_desc: number of file descriptors used

Standard fabric (Ganglia) for Legnaro

11/11/09CREAM: ALICE Experience 14

Page 15: CREAM: ALICE Experience WLCG GDB Meeting, CERN 11th November 2009 Stefano Bagnasco (INFN-Torino), Jean-Michel Barbet (Subatech), Latchezar Betev (ALICE),

Some other interesting Some other interesting feedback (III)feedback (III)

In addition CNAF reported: blparser is not automatically restarted at boot time (only tomcat).

Blparser has to be restarted by hand in order to recover the queue info Developers feedback: issue included in bug #56518

CURRENT STATUS: Fix already committed and will be provided in CREAM1.6

Finally INFN-Torino feedback Running CREAM-CE since one day Stefano Lusso has reported the useful added value of the script

CheckCreamConf.pl used at the site to set variables: http://grid.pd.infn.it/cream/field.php?

n=Main.CheckYourCREAMCEConfiguration11/11/09CREAM: ALICE Experience 15

Page 16: CREAM: ALICE Experience WLCG GDB Meeting, CERN 11th November 2009 Stefano Bagnasco (INFN-Torino), Jean-Michel Barbet (Subatech), Latchezar Betev (ALICE),

SummarySummary ALICE remarks again their high interest in the generalized

deployment of CREAM-CE

Vibrant and very involved user community provides helpful feedback

Fantastic quality developer support and advice

ALICE and the sites involved want a fast version certification and deployment cycle

Time is very short, data is coming

11/11/09CREAM: ALICE Experience 16