WLCG Service Requirements
description
Transcript of WLCG Service Requirements
WLCG Service Requirements
WLCG WorkshopMumbai
Tim Bell CERN/IT/FIO
11th February 2006 Service Checklist [email protected] 2
Agenda
LCG Memorandum of Understanding
Defining what needs to be delivered
Checking the plan Tracking delivery using a
dashboard
11th February 2006 Service Checklist [email protected] 3
What the MoU provides
A high level definition of the service Basis for estimating Tier investments
Tier responsibilities Overall capacity
Basic support structure Implementation schedule Governance
Roles *B
11th February 2006 Service Checklist [email protected] 4
Tier0 service levels
Service Maximum delay in responding to operational problems Average availability2
Service interruption Degradation of the capacity of the service
by more than 50%
Degradation of the capacity of the service
by more than 20%
During accelerator operation
At all other times
Raw data recording 4 hours 6 hours 6 hours 99% n/a
Event reconstruction or distribution of data to Tier-1 Centres during accelerator operation
6 hours 6 hours 12 hours 99% n/a
Networking service to Tier-1 Centres during accelerator operation
6 hours 6 hours 12 hours 99% n/a
All other Tier-0 services 12 hours 24 hours 48 hours 98% 98%
All other services3 – prime service hours4
1 hour 1 hour 4 hours 98% 98%
All other services – outwith prime service hours
12 hours 24 hours 48 hours 97% 97%
11th February 2006 Service Checklist [email protected] 5
Tier1 service levels
Maximum delay in responding to operational problems
Average availability measured on an annual
basis
Service
Service interruption
Degradation of the capacity of the service by more than
50%
Degradation of the capacity of the service by
more than 20%
During accelerator operation
At all other times
Acceptance of data from the Tier-0 Centre during accelerator operation
12 hours 12 hours 24 hours 99% n/a
Networking service to the Tier-0 Centre during accelerator operation
12 hours 24 hours 48 hours 98% n/a
Data-intensive analysis services, including networking to Tier-0, Tier-1 Centres outwith accelerator operation
24 hours 48 hours 48 hours n/a 98%
All other services – prime service hours6
2 hour 2 hour 4 hours 98% 98%
All other services – outwith prime service hours
24 hours 48 hours 48 hours 97% 97%
11th February 2006 Service Checklist [email protected] 6
The MoU is not …
An implementation bible What grid services at which site How to run the services How to deploy
Magic recipe for service delivery Application 99% = 1.5 hours down /
week Administrator 40 hours/week = 24% up
11th February 2006 Service Checklist [email protected] 7
What is your quest ?
11th February 2006 Service Checklist [email protected] 8
We seek the holy grail !
A stable and functional Grid
11th February 2006 Service Checklist [email protected] 9
Define the site services
What services do we provide ? Who is responsible ? What level of service is required ? What capacity of service ? What is the support structure ? Who pays for what ?
11th February 2006 Service Checklist [email protected] 10
Service catalog approach
A service catalog consists Service Class – Criticality Calendar – Variation with time Product – What application Customer – Which VO Service =
Service Class x Calendar x Product x Customer
11th February 2006 Service Checklist [email protected] 11
Service class
https://uimon.cern.ch/twiki/bin/view/LCG/ScFourServiceDefinition
Class Description
Downtime
Reduced Degraded Avail
C Critical 1 hour 1 hour 4 hours 99%
H High 4 hours 6 hours 6 hours 99%
M Medium 6 hours 6 hours 12 hours 99%
L Low 12 hours 24 hours 48 hours 98%
U Unmanaged
None None None None
11th February 2006 Service Checklist [email protected] 12
Class notes Downtime defines the time between the start
of the problem and restoration of service at minimal capacity (i.e. basic function but capacity < 50%)
Reduced defines the time between the start of the problem and the restoration of a reduced capacity service (i.e. >50%)
Degraded defines the time between the start of the problem and the restoration of a degraded capacity service (i.e. >80%)
Availability defines the sum of the time that the service is down compared with the total time during the calendar period for the service. Site wide failures are not considered as part of the availability calculations.
None means the service is running unattended
11th February 2006 Service Checklist [email protected] 13
Service calendar
Calendar
Description AccOn Prime
AP Accelerator operating, prime shift
Y Y
AS Accelerator operating, second shift
Y N
OP Accelerator off, prime shift N Y
OS Accelerator off, second shift N N Some services are critical only during accelerator shift
Other services are less critical outside working hours
11th February 2006 Service Checklist [email protected] 14
ProductsProduct Name Product
Short CodeDescription
Resource Broker RB Farms out jobs to sites+logging and book-keeping
MyProxy PX Renew/acquire credentials
BDII BDII Grid Information System
Compute Element CE Gateway to local batch systems
Mon Box MONB Grid Monitoring including archiver
Grid View GRVW Monitoring of Grid activity
Site Functional Test SFT Regular test of components per site
Grid Peek GRPK Storage of outputs of running jobs
VOMS VOMS Manage user/roles for VOs
11th February 2006 Service Checklist [email protected] 15
Products (cont)Product Name Product
Short CodeDescription
LCG File Catalog LFC Maps file names to storage locations
File Transfer Service FTS Reliable file transfer delivery
Storage Element SE SRM Compatible Storage Service
11th February 2006 Service Checklist [email protected] 16
Products notes
Provides 1st level breakdown of the grid to smaller units
Suprisingly dynamic list. New products arriving weekly.
Short codes provide basis for naming conventions
11th February 2006 Service Checklist [email protected] 17
Service catalog
Service Instance Product Cst AP AS OP OS
RBP Production Resource Broker RB SH C C C C
PXP Production My Proxy PX SH C C C C
BDIIP Production Global BDII DBII SH C C C C
BDIIS Production Site BDII DBII SH H H H H
CEP Production Compute Element CE SH C C C C
MONBP Production Monbox MONB SH M M M M
GRVWP Production Grid View GRVW SH M L M L
SFTP Production Site Func Test SFT SH M M M M
GRPKP Production Grid Peek Service GRPK SH M M M M
VOMSP Production VOMS VOMS SH C C C C
Match product with customer and service class in each calendar slot
Multiple services (e.g. production, test, site…) for single product
11th February 2006 Service Checklist [email protected] 18
Service catalog (cont)Service Instance Product Cst AP AS OP OS
LFCP-ALICE
Alice Production LCG File Catalog
LFC Alice H H H H
LFCP-ATLAS
Atlas Production LCG File Catalog
LFC Atlas H H H H
LFCP-CMS
CMS Production LCG File Catalog
LFC CMS H H H H
LFCP-LHCB
LHCb Production LCG File Catalog
LFC LHCb C C C C
FTSP Production file transfer service FTS SH C C C C
CSTRP Production Castor + SRM SE SH C C C C
11th February 2006 Service Checklist [email protected] 19
Questionnaire
Simple questions to assess readiness for production
It is not actually necessary to fill out the answers but the questions should be asked
Focus is on the infrastructure
11th February 2006 Service Checklist [email protected] 20
Service questions
What service levels are required for each calendar period ?
Who is providing support for the application ?
Who supports the infrastructure ? How should the support be
contacted? What support service do they
provide?
11th February 2006 Service Checklist [email protected] 21
Configuration questions
What are the application interfaces?
What server does the application run on ?
Is there a picture of the configuration?
What are the application parameters and how are they set up?
11th February 2006 Service Checklist [email protected] 23
Facilities questions
Are all systems in a machine room ?
Is the room access controlled ? Is there good power provision ?
UPS ? Batteries ? What is the response time for
facilities problems ?
11th February 2006 Service Checklist [email protected] 24
Hardware questions
What kind of machine is required CPU, RAM, Disk
Do we need redundancy ? Power Supply, Disk, ….
Do maintenance contracts match the service ?
Currently, there are no capacity guides for each application. These are required to avoid purchase of inappropriate machines
11th February 2006 Service Checklist [email protected] 25
Sample RB disk calculation
Parameter Value (MB)
Size of input sandbox 10
Size of output sandbox 10
Jobs / Day currently 21000
Estimated Factor for LHC 3
Sandbox Purge Time (days) 14
Jobs in queue 35000
Total Disk Space Required 17,640,000
11th February 2006 Service Checklist [email protected] 26
Network questions
What network capacity OPN connectivity ? Bandwidth ? Firewall ports ?
Currently, there is no connectivity guide for each application. This is required for secure set up and appropriate network configuration.
11th February 2006 Service Checklist [email protected] 27
Sample CE ports sheet
Function Direction Port
Globus Job Manager Outgoing 20000-21000
GridFTP Incoming 2811
GRIS BDII Incoming 2135
EDG Log Daemon Incoming 9002
11th February 2006 Service Checklist [email protected] 28
Database questions
What is your sites preferred database ?
What are the options for each application ?
Expected database size / growth ? High Availability options ?
11th February 2006 Service Checklist [email protected] 29
Backup / Restore questions What needs to be backed up for each
service ? How do we ensure consistency in the
event of a restore ? e.g. RB / CE. Software corruption risk different by
application ? e.g. LFC/SE vs Proxy Has a restore test been done ?
There is currently no list of critical state data for each application or steps to be executed after a restore
11th February 2006 Service Checklist [email protected] 30
Operations questions How are problems identified ?
Local console ? Grid Monitoring ?
Who should be contacted to resolve the problem ?
Who should be informed of the problem ?
What new procedures / operations guides are required ?
What is the local coverage for nights / weekends ?
How does local and Grid operations interwork ?
11th February 2006 Service Checklist [email protected] 31
Validation
Check that the service class matches the answers A critical service cannot have the
server in an office Check the dependencies that no
critical services depend on non-critical services FTS, critical, requires MyProxy
therefore MyProxy Service must be critical
11th February 2006 Service Checklist [email protected] 32
Implementation Tracking at CERN
A dashboard approach on the WikiService Area Class Status Req Dvl HW Ops RB WMS C WlcgScDashRb Green Green Green RED CE WMS C WlcgScDashCe Green Green Green Yellow GRPK WMS M WlcgScDashGrpk Green Yellow Green RED FTS DMS H WlcgScDashFts Green Green Green Green LFC DMS C WlcgScDashLfc Green Green Yellow Green BDII IS C WlcgScDashBdii Green Green Green Green MYPX AAS C WlcgScDashPx Green Green Green Yellow VOMS AAS C WlcgScDashVOMS Green RED Green RED MONB IS M WlcgScDashMon Green Green Green RED GRVW IS M WlcgScDashGrvw Green Green Green Yellow SFT IS M WlcgScDashSft Green Green Yellow RED UI Wms C WlcgScDashUi? Green Green Green Green SE DMS C WlcgScDashSe? Green Yellow Green Yellow
11th February 2006 Service Checklist [email protected] 33
Common Themes But it’s all green ? What’s the problem ?
Green does not mean no problems. We are often generous with assessments since red/yellow everywhere does not highlight issues.
Operations No operations or problem determination guides.
Limited administration guides. Support call-tree unclear Backup/Restore details are missing
Hardware Limited or no capacity planning information leads to
incorrect server sizing ‘Forgot a box’ problems e.g. one per-VO not one per
site Development
Difficult to match the user expectations (e.g. a critical service) with implementation (e.g. stateful)
11th February 2006 Service Checklist [email protected] 34
Summary
Complete a service catalog for your sites
Check the questions and prepare an action plan to address items under your control
Assess the status by service and concentrate on getting the reds to yellows
11th February 2006 Service Checklist [email protected] 35
More Information
LCG MoU http://lcg.web.cern.ch/lcg/C-RRB/MoU/WLCGMoU.pdf
SC4 Service Definitions for CERN https://uimon.cern.ch/twiki/bin/view/LCG/ScFourServiceDefinition
SC4 CERN Dashboard https://uimon.cern.ch/twiki/bin/view/LCG/WlcgScDash