CREAM: ALICE Experience WLCG GDB Meeting, CERN 11th November 2009 Stefano Bagnasco (INFN-Torino),...
-
Upload
erick-golden -
Category
Documents
-
view
216 -
download
0
Transcript of CREAM: ALICE Experience WLCG GDB Meeting, CERN 11th November 2009 Stefano Bagnasco (INFN-Torino),...
CREAM: ALICE ExperienceWLCG GDB Meeting, CERN 11th November 2009
Stefano Bagnasco (INFN-Torino), Jean-Michel Barbet (Subatech), Latchezar Betev (ALICE), Catalin Condurache (RAL), Sergio Fantinel
(INFN-Legnaro), Stefano Lusso (INFN-Torino), Patricia Méndez Lorenzo (CERN, IT/GS), Francesco Noferini (INFN-CNAF), Derek Ross
(RAL) and Massimo Sgaravatto (CREAM development team, INFN-Padova)
ThanksThanks This talk includes the feedback and the contributions from
Subatech: Jean-Michel Barbet INFN-Torino: Stefano Lusso and Stefano Bagnasco INFN-CNAF: Francesco Noferini RAL-LCG2: Catalin Condurache and Derek Ross INFN-Legnaro: Sergio Fantinel INFN-Padova and CREAM CE developers team: Massimo Sgaravatto
11/11/09CREAM: ALICE Experience 2
CREAM-CE: CREAM-CE: Deployment statusDeployment status
Current CREAM-CE service Production version: CREAM1.5 (glite-CREAM-3.1.20) Deployed in production by the 6th of October (patch #3259 for SLC4/i386) https://savannah.cern.ch/patch/?func=detailitem&item_id=3259#options
Features: Important bug and security fixes (pointed by the GSVG)
http://www.gridpp.ac.uk/gsvg/advisories/advisory-55615.txt http://www.gridpp.ac.uk/gsvg/advisories/advisory-55616.txt
Migration of sites to CREAM1.5 was highly encouraged by that time and ALICE fully support it for all sites providing this service for the experiment
Outlook of this talk: During the last GDB (14/10/09) we made a list of all issues reported by the site
admins in terms of CREAM-CE and based on the experiences gained with the ALICE production
Now (one month of operations later) we have collected the feedback from several sites already using CREAM1.5
11/11/09CREAM: ALICE Experience 3
CREAM-CE: Future CREAM-CE: Future versionversion
Future CREAM-CE service Production version: CREAM1.6 Status ready for certification/certified (expected) by December
2009 TASK #9734:
https://savannah.cern.ch/task/?9734 PATCHES #3179 and #3209
https://savannah.cern.ch/patch/?3179 Release 1.6 of CREAM CE for sl5_x86_64
https://savannah.cern.ch/patch/?3209 YAIM-CREAM-CE for release 1.6 of CREAM CE
Features: Many of the issues reported during the last GDB (and not included
in CREAM1.5) will be now solved11/11/09CREAM: ALICE Experience 4
CREAM-CE: site CREAM-CE: site admins and admins and
developers reports (I) developers reports (I) Purge issues:
ALICE REPORT: Wrong report of job status. CREAM’s vision of running jobs de-synchronized
ALICE REQUIREMENT: Method to purge jobs in a non terminal status CREAM STATUS:
CREAM job status can be wrongly reported because of some misconfigurations or because of these two bugs in the BLAH Blparser candidates for CREAM1.6
BUG #55078: « Possible final state not considered in BLParserPBS and BUpdaterPBS » CURRENT STATUS: Integration Candidate included in patch #3179
BUG #54949: « Some job can remain in running state when BLParser is restarted for both lsf and pbs » CURRENT STATUS: Integration candidate included in patch #3179
There is an specific bug which covers the ALICE requirement
BUG #55420: « Allow admin to purge CREAM jobs in a non terminal status » (Solution Status: in progress) CURRENT STATUS: Integration Candidate included in patch #3179
CURRENT RISK FOR ALICE: Low once the developers provided site admins with the corresponding purge script (very high before)
11/11/09CREAM: ALICE Experience 5
CREAM-CE: site CREAM-CE: site admins and admins and
developers feedback developers feedback (I) (I)
Purge issues: Site admin reports Desynchronization issues has not been observed recently at sites
running CREAM1.5 Several sites have used the script created by the CREAM
developers to purge manually the CREAM DB Very good feedback on regard with this toolkit It requires however a manual operation and the purge criteria
variates from site to site
11/11/09CREAM: ALICE Experience 6
CREAM-CE: site admins CREAM-CE: site admins and developers report and developers report
(II) (II) DISK SPACE issues: Areas to monitor and purge or clean
ALICE REPORT: The local mysql DB grown up to 2.5 GB CREAM STATUS: Issue associated to mysql engine. While deleting
entries from the DB, the relevant disk space is not released (therefore the CREAM DB does not decrease). But the space is reused when new data added in the DB
CURRENT RISK FOR ALICE: low
ALICE REPORT: purge of the input Sandboxes in /opt/glite/var/cream_sandbox
CREAM STATUS: Solved in CREAM1.5 #48144: « Problems with purge in CREAM when the mapped group
name is different than the VO name » RISK FOR ALICE: none once sites upgrade to CREAM1.5
11/11/09CREAM: ALICE Experience 7
CREAM-CE: site admins CREAM-CE: site admins and developers and developers
feedback (II) feedback (II) Disk space issues: Site admins report
Grow up of the local mysql DB Some tables in the DB still growing up
Purge of the input Sandboxes in /opt/glite/var/cream_sandbox area Sandbox auto-purge procedure included in CREAM1.5 working fine
now (after 10 days outputs are purged) No further issues observed by the site admins on regards with the
purge of the Sandbox after the migration to CREAM1.5
11/11/09CREAM: ALICE Experience 8
CREAM-CE: site admins CREAM-CE: site admins and developers report and developers report
(III) (III) DISK SPACE issues (cont.)
ALICE REPORT: issues regarding /opt/glite/var/log and /var/log ALICE REQUIREMENT: Cleaning policy required for these files, otherwise files
can grow forever CREAM STATUS: policies exist for all these files and can be customized file by
file: Only the blah accounting log files are out of the CREAM developer’s control (files
cannot be deleted before having been processed by the accounting system) For /opt/glite/var/log/glite-ce-cream.log and /opt/glite/var/log/glite-ce-
monitor.log, the policy is defined under /var/lib/tomcat5/webapps/ce-cream/WEB-IFN/classes/log4j.properties and the default values can be changed Relevant info under: http://grid.pd.infn.it/cream/field.php?n=Main.KnownIssues
For /opt/glite/var/log/glite-xxxparser.log the policy is available under /opt/logrotate.d/glite-xxxparser
For /etc/logrotate.d/globus-gridftp manages the gridftp log files under /var/log RISK FOR ALICE: low since the size is manageable by site admins
11/11/09CREAM: ALICE Experience 9
CREAM-CE: site admins CREAM-CE: site admins and developers report and developers report
(IV) (IV) DISK SPACE issues (cont.)
ALICE REPORT: issues regarding /opt/glite/var/cream/user_proxy CREAM STATUS: bug reported and accepted not available in
CREAM1.5 #49497: « User proxies on CREAM do not get cleaned up »
CURRENT STATUS: Already solved, it will be included in CREAM1.6 (bug fix implementation still pending)
CREAM developers could increase the priority of this bug if needed
DISK SPACE issues: site admins report No issues observed by the sites in the last month
11/11/09CREAM: ALICE Experience 10
CREAM-CE: site admins CREAM-CE: site admins and developers report and developers report
(V) (V) LOAD issues (reported by Subatech):
ALICE REPORT: UNIX load going up to 5 (during start up or high rate of submission) CREAM STATUS: problem reported by GRNET and the origin of the problem was a
missed index in the CREAM DB #52876: « The extra attribute table in the CREAM DB has no key/indexes defined »
CURRENT STATUS: solved in CREAM1.5 RISK FOR ALICE: low once upgrading the CREAM version
ALICE REPORT: When tomcat restarted the system can take up to 15 min before submitting new jobs
CREAM STATUS: The slow start of CREAM is also due to the problems coming from jobs reported in wrong status #51978: «CREAM can be slow to start» bug in progress
CURRENT STATUS: not included in CREAM1.5 but will be released in CREAM1.6
RISK FOR ALICE: Purge actions should speed this start up and therefore decrease the risk for the experiment
11/11/09CREAM: ALICE Experience 11
CREAM-CE: site admins CREAM-CE: site admins and developers and developers
feedback (V) feedback (V) Load issues: Site admins report
Grow up of the UNIX load Reported by Subatech, still visible at the site
Load increases during automatic purge operations. Also visible during high job submission rates
Site admin report: At this site CREAM is running in a Vmware VM and the load might be due to lack of MySQL performance in such environment. Slow down of MySQL could increase the Unix load
CREAM-CE developers report: Issue tracked in bug #58103. the GRNET report “CREAM performance report”: very heavy queries are performed during purge operations CURRENT STATUS: Fix already committed to CVS and will be released with the next
CREAM1.6. Developers have not yet assessed the level of optimization of this fix to reduce the load
Report from Legnaro After closing the queues the load increased without saturating the CPU (60% CPU
load) for about 12h. The issues seems to come from the ALICE submissions which continued although the queues were closed.
Tomcat restart slows down the submission of jobs Solved in CREAM1.6 No further reports from the site admins
11/11/09CREAM: ALICE Experience 12
Some other interesting Some other interesting feedbackfeedback
We asked the site admins for: Requirements for the system maintenance
Since the last update sites spend much less time monitoring CREAM-CE
Keeping control of the disk space basically and consistency between jobs reported by CREAM and the local batch system (Subatech)
In some cases, the baby-sitting of the site is almost negligible (Legnaro and CNAF)
Issues observed at RAL before the upgrade of the system seems to be gone after the deployment of CREAM1.5 i.e.,Tomcat related issues already solved with this new version
11/11/09CREAM: ALICE Experience 13
Some other interesting Some other interesting feedback (II)feedback (II)
We also asked the site admins for: Monitoring applied to the system at the sites
In some cases (Subatech and RAL) the site is using Nagios with also specific probes: gLite-LB-logd and tomcat daemons User_nbfiles: number of files used for ALICE production Inactive_jobs: jobs not consuming CPU Open_file_desc: number of file descriptors used
Standard fabric (Ganglia) for Legnaro
11/11/09CREAM: ALICE Experience 14
Some other interesting Some other interesting feedback (III)feedback (III)
In addition CNAF reported: blparser is not automatically restarted at boot time (only tomcat).
Blparser has to be restarted by hand in order to recover the queue info Developers feedback: issue included in bug #56518
CURRENT STATUS: Fix already committed and will be provided in CREAM1.6
Finally INFN-Torino feedback Running CREAM-CE since one day Stefano Lusso has reported the useful added value of the script
CheckCreamConf.pl used at the site to set variables: http://grid.pd.infn.it/cream/field.php?
n=Main.CheckYourCREAMCEConfiguration11/11/09CREAM: ALICE Experience 15
SummarySummary ALICE remarks again their high interest in the generalized
deployment of CREAM-CE
Vibrant and very involved user community provides helpful feedback
Fantastic quality developer support and advice
ALICE and the sites involved want a fast version certification and deployment cycle
Time is very short, data is coming
11/11/09CREAM: ALICE Experience 16