A Service-Based SLA Model

20
A Service-Based SLA Model HEPIX -- CERN May 6, 2008 Tony Chan -- BNL

description

A Service-Based SLA Model. HEPIX -- CERN May 6, 2008 Tony Chan -- BNL. Overview. Facility operations is a manpower-intensive activity at the RACF. Sub-groups responsible for systems within the facility (tape storage, disk storage, linux farm, grid computing, network, etc) Software upgrades - PowerPoint PPT Presentation

Transcript of A Service-Based SLA Model

A Service-Based SLA Model

HEPIX -- CERN

May 6, 2008

Tony Chan -- BNL

Overview

Facility operations is a manpower-intensive activity at the RACF.

Sub-groups responsible for systems within the facility (tape storage, disk storage, linux farm, grid computing, network, etc) Software upgrades Hardware lifecycle management Integrity of facility services User account lifecycle management Cyber-security

Experience with RHIC operations for the past 9 years.

Support for ATLAS Tier 1 facility operations.

Experience with RHIC Operations

24x7 year-round operations since 2000.

Facility systems classified into 3 categories: non-essential, essential and critical.

Response to system failure depends on component classification: Critical components are covered 24x7 year-round. Immediate response is

expected from on-call staff. Essential components have built-in redundancy/duplication and are

addressed the next business day. Escalated to “critical” if large number of essential components fail and compromise service availability.

Non-essential components are addressed the next business day.

Staff provides primary coverage during normal business hours.

Operators contact on-call person during off-hours and weekends.

Experience with RHIC Operations (cont.)

Users report problems via ticket system, pagers and/or phone.

Monitoring software instrumented with alarm system. Alarm system connected to selected pagers and cell phones. Limited alarm escalation procedure (ie, contact back-up if primary is not

available) during off-hours and weekends. Periodic rotation of primary and back-up on-call list for each subsystem. Automatic response to alarm conditions in certain cases (ie, shutdown of

Linux Farm cluster in case of cooling failure).

Facility operations in RHIC has worked well over past 8 years.

Service Level AgreementService Server Rank Comments

       Network to Ring   1  Internal Network   1  External Network   1 ITD handlesRCF firewall   1 ITD handlesHPSS rmdsXX 1  AFS Server rafsXX 1  AFS File systems   1  NFS Server   1  NFS home directories rmineXX 1  CRS Management rcrsfm, rcras 1 Rcrsfm is 1, rcras is 2Web server (internet) www.rhic.bnl.gov 1  Web server (intranet) www.rcf.bnl.gov 1  NFS data disks rmineXX 1  Instrumentation   2  SAMBA rsmb00    DNS rnisXX 2 Should fail overNIS rnisXX 2 Should fail overNTP rnisXX 2 Should fail overRCF gateways   2 Multiple gateway machinesADSM backup   2  Wincenter rnts00 2/3  CRS Farm   2  LSF rlsf00 2  CAS Farm   2  rftp   2  Oracle   2  Objectivity   2  MySQL   2  Email   2/3  Printers   3  

A New Operational Model for the RACF

RHIC facility operations is a system-based approach.

Some systems support more than one service, and some services depend on multiple systems – unclear lines of responsibilities.

Service-based operational approach better suited for distributed computing environment in ATLAS.

Tighter integration of monitoring, alarm mechanism and problem tracking – automate where possible.

Define a system and service dependency matrix.

Service/System Dependency Matrix

Monitoring in the new SLA

Monitor service and system availability, system performance and facility infrastructure (power, cooling, network).

Mixture of open-source and RACF-written components. Nagios Infrastructure Condor RT

Choices guided by desired features: historical logs, ease of integration with other software, support from open-source community, ease of configuration, etc.

Nagios

Monitor service availability.

Host-based daemons configured to use externally-supplied “plugins” to obtain service status.

Host-based alarm response customized (e-mail notification, system reboot, etc).

Connected to RT ticketing system for alarm logging and escalation.

Nagios (cont.)

Infrastructure (Cooling)

The growth of the RACF has put considerable strain on power and cooling.

UPS back-up power for RACF equipment.

Custom RACF-written script to monitor power and cooling issues.

Alarm logging and escalation through RT ticketing system.

Controlled automatic shutdown of Linux Farm during cooling or power failures.

Infrastructure (Network)

Use of cacti to monitor network traffic and performance.

Can be used at switch or system level.

Historical information and logs.

To be instrumented with alarms and be integrated in the alarm logging and escalation.

Condor

Condor does not have native monitoring interface.

RACF created its own web-based, monitoring interface.

Interface used by staff for performance tuning.

Connected to RT for alarm logging and escalation.

Monitoring functions Throughput Service Availability Configuration Optimization

RT

Flexible ticketing system.

Historical records available.

Coupled to monitoring software for alarm logging and escalation.

Integrated in service-based SLA.

Implementing new SLA

Create Alarm Management Layer (AML) to interface monitoring to RT.

Alarm conditions configurable via custom-written rule engine.

Clearer lines of responsibilities for creating, maintaining and responding to alarms.

AML creates RT ticket in appropriate category and keeps track of responses.

AML escalates alarm when RT ticket is not addressed within (configurable) amount of time.

Service Coordinators oversee management of service alarms.

How It Works

What data is logged?

Host, service, host group, and service group

Alarm timestamp

NRPE (Nagios) message content

Alarm status

Notification status

RT ticket status (new, open, resolved)

Timestamp of lastest RT update

Due date

RT ticket information (number, queue, owner, priority, etc)

Example Configuration (rule) File

[linuxfarm-testrule]

host: testhost(\d) (Regular expression compatible)

service: condorq, condor

hostgroup: any

queue: Test

after_hours_PageTime: 30

work_hours_PageTime: 60

work_hours_response_time: 120 (When does the problem need to be resolved by)

after_hours_response_time: 720 (When does the problem need to be resolved by)

auto_up: 1 (Page people)

down_hosts: 2 (Number of down hosts to be a real problem)

firstContact: test-person@pager

secondContact: [email protected]

New Response Mechanism

Summary

Well-established procedures from RHIC operational experience.

Need service-based SLA for distributed computing environment.

Create Alarm Management Layer (AML) to integrate RT with monitoring tools and create clearer lines of responsibilities for staff.

Some features already functional.

Expect full implementation by late summer 2008.