CERN IT Department CH-1211 Genève 23 Switzerland t Some Hints for “Best Practice” Regarding VO...
-
Upload
eleanore-dorsey -
Category
Documents
-
view
212 -
download
0
Transcript of CERN IT Department CH-1211 Genève 23 Switzerland t Some Hints for “Best Practice” Regarding VO...
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Some Hints for “Best Practice” Regarding VO Boxes
Running Critical Servicesand Real Use-cases
Véronique LefébureCERN-IT-FIO/FS
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
These slides are about:• Nothing new• Common sense
– Sometimes good to be repeated
• There are some little details that can make a big difference
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
VOBOX: the CERN-IT-FIO definition:• A box dedicated to a VO, running one (or more) VO
service(s)• IT-FIO “VOBOX Service” handles:
– Choice of hardware according to user specifications– Base OS installation & software upgrades– Hardware monitoring & maintenance – Installation & monitoring of common services
• Eg: apache
SLA document in preparation
• User-specific Service installation & configuration managed by the VO– in compliance with the SLA
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
VOBOX Hardware:• Resource requirements and planning
– it is not always easy to have an additional disk on demand because “/data” becomes full
• Hardware warranty– Plan for hardware renewal– Check warranty duration before moving to
production• Hardware naming and labeling
– Make use of aliases to facilitate hardware replacement
– Have a “good” name on the sticker• Eg. All lxbiiii machines may be switched off by hand in
case of a cooling problem
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
VOBOX software:• Be informed of coming software upgrades
– Register on the adequate announce mailing lists
• Test software upgrades– Have a “test” machine
– Check for no package conflicts
– Test that your applications are not broken
• Be ready for a reboot– Scheduled reboot: kernel upgrades, … (see SLA)
– Unscheduled reboot: power cut, human mistake, …
– Use init scripts
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
User Software and Data:• Be ready for a full OS reinstallation
– Hardware replacement– Security incidentHave important data and configuration files
regularly backed up
• Use central configuration database as much as possible – At CERN CC: Quattor/CDB
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Monitoring:• Have your daemons monitored
– Eg: with LEMON• Automatic restart of daemons• Automatic notification by email, by SMS
• Use the CC Operator service– The operator reacts to alarms
• Check that your machine is alarmed (i.e. not on “maintenance” state)
– Provide your procedures and exact contact information
– Use “hot-line” mailing list for emergency (one per VO)
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Service Reliability:• Where needed, have a fail-over system
– Use Load-balancing alias, …• TEST the fail-over mechanism• Make sure that no other machine is introduced under
that alias
– Have it on a different network switch
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Communication:• Regularly meet in person (or at least use the
phone from time to time)– Improved communication– Clarifications– Collaboration
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Use-Cases (1/6)
• CMS DBS: Criticality level ’10’boxes: vocms02 + vocms05
Hw warranty till Oct 2009
IP switches: S513-C-IP218 and S513-C-IP216
Load-balanced alias: “cmsdbsprod”
Load-balanced alias name defined at profile level
Contact information: [email protected]
Importance = “50” Piquet Call if needed? Monitoring
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Use-Cases (2/6)
• CMS “Cessy->T0 transfer system”: Criticality level ’10’ (lxgate39) Importance = “45” NO Piquet Call if needed Only ONE machine? Monitoring (xrootd monitored by LEMON)• CMS considerations
• machine essential for us, somehow part of the online system• software can't be load-balanced
– why? What if the machine breaks? Would a spare and test machine be useful ?
• once real data operations start, machine needs to be up whenever there is detector activity (beam, cosmics, calibration).
• We have buffer spaces to bridge downtime of component and machines and there are provisions to shutdown and restart our software.
• But we design for steady-state operations and everything that gets usout of steady-state is a very big deal as it causes rippleeffects through the rest of the system.
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Use-Cases (3/6)
• CMS “PHEDEX” Criticality level ‘9’boxes: vocms01 + vocms20
Hw warranty till Oct 2009
IP switches: S513-C-IP217 and S513-C-IP305
Vocms20 = hot spareContact information: [email protected] [email protected]
Importance = “50” Piquet Call if needed? Monitoring. “Phedex Monitoring” currently runs
on a CMS machine (not in CC yet)
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Use-Cases (4/6)
• LHCb : volhcb01 & volhcb02 “will be the critical boxes for CCRC08”– But not yet really in production; “these two
machines will be different and they will run the various DIRAC3 services : WMS, Bookkeeping, transfer agent “
HW warranty till May and Oct 2009Network switches: S513-C-IP36 & S513-C-IP218
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Use-Cases (5/6)
• ATLAS “voatlas10” Criticality level ‘high’DDM (Distributed Data Management)HW warranty till May 2009– ”During LHC period ATLAS will have
• 4 computers + 2 spares (hot backup) to run DDM central services,
• 10 computers + 3 spares to run site services (VO boxes)“
Note: dependency on many ARDA boxes still named “lxb7iii” etc …
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Use-cases (6/6)
• ALICE Voalice0i Criticality level ‘10’Box functionality not specified in CDB
(except for the 2 xrootd control nodes)User contact: one person only? Use of special procedures ?• recent experience: a kernel upgrade broke
Alice applicationsUsefulness of a test machines ! Now ALICE has 2 machines in our Preprod
name space