CERN IT Department CH-1211 Genève 23 Switzerland t Some Hints for “Best Practice” Regarding VO...

16
CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/ Some Hints for “Best Practice” Regarding VO Boxes Running Critical Services and Real Use-cases Véronique Lefébure CERN-IT-FIO/FS

Transcript of CERN IT Department CH-1211 Genève 23 Switzerland t Some Hints for “Best Practice” Regarding VO...

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

Some Hints for “Best Practice” Regarding VO Boxes

Running Critical Servicesand Real Use-cases

Véronique LefébureCERN-IT-FIO/FS

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

These slides are about:• Nothing new• Common sense

– Sometimes good to be repeated

• There are some little details that can make a big difference

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

VOBOX: the CERN-IT-FIO definition:• A box dedicated to a VO, running one (or more) VO

service(s)• IT-FIO “VOBOX Service” handles:

– Choice of hardware according to user specifications– Base OS installation & software upgrades– Hardware monitoring & maintenance – Installation & monitoring of common services

• Eg: apache

SLA document in preparation

• User-specific Service installation & configuration managed by the VO– in compliance with the SLA

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

VOBOX Hardware:• Resource requirements and planning

– it is not always easy to have an additional disk on demand because “/data” becomes full

• Hardware warranty– Plan for hardware renewal– Check warranty duration before moving to

production• Hardware naming and labeling

– Make use of aliases to facilitate hardware replacement

– Have a “good” name on the sticker• Eg. All lxbiiii machines may be switched off by hand in

case of a cooling problem

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

VOBOX software:• Be informed of coming software upgrades

– Register on the adequate announce mailing lists

• Test software upgrades– Have a “test” machine

– Check for no package conflicts

– Test that your applications are not broken

• Be ready for a reboot– Scheduled reboot: kernel upgrades, … (see SLA)

– Unscheduled reboot: power cut, human mistake, …

– Use init scripts

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

User Software and Data:• Be ready for a full OS reinstallation

– Hardware replacement– Security incidentHave important data and configuration files

regularly backed up

• Use central configuration database as much as possible – At CERN CC: Quattor/CDB

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

Monitoring:• Have your daemons monitored

– Eg: with LEMON• Automatic restart of daemons• Automatic notification by email, by SMS

• Use the CC Operator service– The operator reacts to alarms

• Check that your machine is alarmed (i.e. not on “maintenance” state)

– Provide your procedures and exact contact information

– Use “hot-line” mailing list for emergency (one per VO)

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

Service Reliability:• Where needed, have a fail-over system

– Use Load-balancing alias, …• TEST the fail-over mechanism• Make sure that no other machine is introduced under

that alias

– Have it on a different network switch

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

Communication:• Regularly meet in person (or at least use the

phone from time to time)– Improved communication– Clarifications– Collaboration

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

Use-Cases (1/6)

• CMS DBS: Criticality level ’10’boxes: vocms02 + vocms05

Hw warranty till Oct 2009

IP switches: S513-C-IP218 and S513-C-IP216

Load-balanced alias: “cmsdbsprod”

Load-balanced alias name defined at profile level

Contact information: [email protected]

Importance = “50” Piquet Call if needed? Monitoring

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

Use-Cases (2/6)

• CMS “Cessy->T0 transfer system”: Criticality level ’10’ (lxgate39) Importance = “45” NO Piquet Call if needed Only ONE machine? Monitoring (xrootd monitored by LEMON)• CMS considerations

• machine essential for us, somehow part of the online system• software can't be load-balanced

– why? What if the machine breaks? Would a spare and test machine be useful ?

• once real data operations start, machine needs to be up whenever there is detector activity (beam, cosmics, calibration).

• We have buffer spaces to bridge downtime of component and machines and there are provisions to shutdown and restart our software.

• But we design for steady-state operations and everything that gets usout of steady-state is a very big deal as it causes rippleeffects through the rest of the system.

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

Use-Cases (3/6)

• CMS “PHEDEX” Criticality level ‘9’boxes: vocms01 + vocms20

Hw warranty till Oct 2009

IP switches: S513-C-IP217 and S513-C-IP305

Vocms20 = hot spareContact information: [email protected] [email protected]

Importance = “50” Piquet Call if needed? Monitoring. “Phedex Monitoring” currently runs

on a CMS machine (not in CC yet)

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

Use-Cases (4/6)

• LHCb : volhcb01 & volhcb02 “will be the critical boxes for CCRC08”– But not yet really in production; “these two

machines will be different and they will run the various DIRAC3 services : WMS, Bookkeeping, transfer agent “

HW warranty till May and Oct 2009Network switches: S513-C-IP36 & S513-C-IP218

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

Use-Cases (5/6)

• ATLAS “voatlas10” Criticality level ‘high’DDM (Distributed Data Management)HW warranty till May 2009–  ”During LHC period ATLAS will have

• 4 computers + 2 spares (hot backup) to run DDM central services,

• 10 computers + 3 spares to run site services (VO boxes)“

Note: dependency on many ARDA boxes still named “lxb7iii” etc …

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

Use-cases (6/6)

• ALICE Voalice0i Criticality level ‘10’Box functionality not specified in CDB

(except for the 2 xrootd control nodes)User contact: one person only? Use of special procedures ?• recent experience: a kernel upgrade broke

Alice applicationsUsefulness of a test machines ! Now ALICE has 2 machines in our Preprod

name space

CERN IT Department

CH-1211 Genève 23

Switzerlandwww.cern.ch/

it

Conclusions:• Think of reviewing

– Configuration – Procedures– Hardware warranty

regularly, with the IT Service Manager• Foresee and use test machines