Post on 14-Jan-2016
2
LHCb computing model
➟ CERN (Tier-0) is the hub of all activity Full copy at CERN of all raw data
and DSTs All T1s have a full copy of dst-s
➟ Simulation at all possible sites (CERN, T1, T2) LHCb has used about 120 sites on
5 continents so far➟ Reconstruction, Stripping and
Analysis at T0 / T1 sites only Some analysis may be possible at
“large” T2 sites in the future➟ Almost all the computing (except
for development / tests) will be run on the grid. Large productions : production
team Ganga (Dirac) grid user interface
3
LHCb on the grid
Small amount of activity over past year▓ DIRAC3 has been under development▓ Physics groups have not asked for new productions▓ Situation has changed recently...
4
LHCb on the grid➟DIRAC3
Nearing stable production release▓ Extensive experience with CCRC08 and follow-up
exercises▓ Used as THE production system for LHCb
Now testing of the interfaces by Ganga developers
➟Generic pilot agent frameworkCritical problems found with the gLite WMS 3.0,
3.1▓ Mixing of VOMS roles under certain reasonably common
conditions Cannot have people with different VOMS roles!
▓ Savannah bug #39641▓ Being worked on by developers
Waiting for this to be solved before restarting tests
7
LHCb storage at RAL
➟LHCb storage primarily on the Tier-1s and CERN
➟CASTOR used as storage system at RALFully moved out of dCache in May 2008
▓ One tape damaged and file on it marked lostWas stable (more or less) until 20 Aug 2008
▓ Not been able to take great load on servers Low upper limit (8) on lsf job slots on various castor
diskservers Too many jobs (>500) can come into the batch system.
The concerned service class hangs then Temporarily fixed for now. Needs to be monitored
(probably by the shifter on duty?)» Increase limit to >100 rfio jobs per server» Not all hardware can handle a limit of 200 jobs (start using
swap space) Problem seen many times now over the last few months
▓ Castor now in downtime▓ This is worrying given how close we are to data taking
8
LHCb at RAL➟ Move to srm-v2 by LHCb
Needed to retire srm-v1 endpoints, hardware for RAL When DIRAC3 becomes baseline for User analysis
▓ Already used for almost all production▓ Ganga working on submitting through DIRAC3▓ Needs LHCb also to rename files in the LFC
All space tokens, etc have been setup Target : Turn off srm-v1 access by end September
➟ Currently use srm-v1 for user analysis▓ DIRAC2 does not support srm-v2
➟ Batch system : Pausing of jobs during downtime?
▓ Not clear about the status of this For now, stop the batch system from accepting LHCb
jobs a few hours before scheduled downtimes▓ No LHCb job should run for >24 hours
Announce beginning and end of downtimes▓ Problems with broadcast tools▓ GGUS ticket opened by Derek Ross
9
LHCb and CCRC08
➟Planned tasks : Test the LHCb computing modelRaw data distribution from pit to T0 centre
▓ Use of rfcp into CASTOR from pit - T1D0 Raw data distribution from T0 to T1 centres
▓ Use of FTS - T1D0 Recons of raw data at CERN & T1 centres
▓ Production of rDST data - T1D0 ▓ Use of SRM 2.2
Stripping of data at CERN & T1 centres ▓ Input data: RAW & rDST - T1D0 ▓ Output data: DST - T1D1 ▓ Use SRM 2.2
Distribution of DST data to all other centres ▓ Use of FTS
11
LHCb CCRC08 Problems➟ CCRC08 highlighted areas to be improved
File access problems▓ Random or permanent failure to open files using gsidcap
Request IN2P3 and NL-T1 to allow dcap protocol for local read access
Now using xroot at IN2P3 – appears to be successful▓ Wrong file status returned by dCache SRM after a put
bringOnline was not doing anything Software area access problems
▓ Site banned for a while until problem is fixed Application crashes
▓ Fixed with new SW release and deployment Major issues with LHCb bookkeeping
▓ Especially for stripping
➟ Lessons learned Better error reporting in pilot logs and workflow Alternative forms of data access needed in emergencies
▓ Downloading of files to WN (used at IN2P3, RAL)
13
Communications➟ LHCb sites
Grid operations team keep track of problems Report to sites via GGUS and eLogger
▓ All posts are reported on lhcb-production@cern.ch▓ Please subscribe if you want to know what is going on
➟ LHCb users Mailing lists
▓ lhcb-distributed-analysis@cern.ch All problems directed here
▓ Specific lists for each LHCb application and Ganga
Ticketing systems (Savannah, GGUS) for DIRAC, Ganga, apps▓ User by developers and “power” users
Software weeks provide training sessions for using Grid tools Weekly distributed analysis meetings (starts Friday)
▓ DIRAC, Ganga, core software developers along with some users▓ Aims to identify needs and coordinate release plans
http://lblogbook.cern.ch/OperationsRSS feed available
http://lblogbook.cern.ch/Operations
14
Summary
➟ Concerned about CASTOR stability close to data taking
➟ DIRAC3 workload and data management system now online Has been extensively tested when running LHCb productions Now moving it into the user analysis system
▓ Ganga needs some additional development
➟ Grid operations team working with sites, users and devs to identify and resolve problems quickly and efficiently
➟ LHCb looking forward to imminent switch on of the LHC!