ITSM Guide - Extract Chapter 3.2 - Service Operations

INNOTRAIN IT

IT Service Management

QUICK – SIMPLE - CLEAR

Preview

Extract

Chapter 3.2

2011

IT Service Management

I

Authors

Dr. Mariusz Grabowski, Universität der Wirtschaft Krakau

Dr. Claus Hoffmann, Beatrix Lang GmbH

Philipp Küller, Hochschule Heilbronn

Elena-Teodora Miron, Universität Wien

Dr. Dariusz Put, Universität der Wirtschaft Krakau

Dr. Piotr Soja, Universität der Wirtschaft Krakau

Dr. Janusz Stal, Universität der Wirtschaft Krakau

Marcus Vogt, Hochschule Heilbronn

Dr. Eng. Tadeusz Wilusz, Universität der Wirtschaft Krakau

Dr. Agnieszka Zając, Universität der Wirtschaft Krakau

3.2 Service operations

3.2.1 Service and infrastructure operations

"Help, my screen went black!" - We are all familiar with this call from users. Every day, we

encounter a wide variety of these incidents in system operation, i.e. deviations from the plans. This

chapter explains how to handle these incidents in a structured manner.

But first, let's briefly explain the relevant terminology: The service desk is the central contact for all

inquiries from users, according to the principle of "one face to the customer." This provides the

customer with a single point of contact for all IT-related inquiries (e.g. hotline, ticket system). In a

few cases, companies even expand their service desk to handle other inquiries (such as event

management). The purpose of this single point of contact is to handle service requests, create an

incident or to submit a request for change. The service desk can be understood as a kind of funnel

that collects all messages and then steers them to the correct processes. Initially, enquiries of all

types are treated as incidents.

Service desk

The service desk is the central function of ITSM. It is the link between the IT service and business

operations. This function is used to transact all enquiries from, and support provided to, employees.

The incident is an unplanned interruption (such as a workstation computer that won't work) or

reduction of the quality of a service (like a slow Internet connection). Failures of elements of the

configuration (Configuration Items) can be treated as an incident without a direct effect on the

service. In this case, one example would be the failure of a mirrored hard drive, where the server is

still up and running. In contrast, Service Requests are enquiries from users (with regard to

information, consultation, standard changes, access) that do not have an effect on the service.

They are one way to satisfy customers' needs. One example is a request for more printer toner

when the printer indicates that it will run out soon.

Disruption or incident?

An incident is defined as an IT disruption or IT service enquiry. Examples of an incident could be:

"My Excel is crashing" or "I need to create a PDF from Excel. How do I do that?". All incidents

should be processed by the central service desk and the status updated to enable later evaluation.

The process for managing and processing incidents is usually known as incident management. It

passes through various phases:

1. Entering and classifying the incident

2. Diagnosing the incident

3. Escalating the incident

4. Closing out the incident

The current Version 3 of the IT Infrastructure Library (ITIL) provides a sample process as a best

practice. This can provide the basis for a company to develop its own processes, as the

requirements differ enormously from company to company. For smaller companies, it is surely

good enough to just pick up the phone without any overarching workflow, but in medium to large

enterprises, this results in repeated interruptions of employees' core tasks. In this case, it is

worthwhile to take the more formal route.

In both cases, however, it makes sense to document the incidents and analyse them as part of an

improvement process. Ideally, the incidents should be entered into a ticket system, but a simple list

might be good enough at the beginning.

ITIL recommends the following procedure:

Figure 9 - Incident process based on ITIL V3

Phase I - Identifying, entering and categorising the disruption

The identification of incidents normally shared equally by the large number of users. Most

deviations can be identified easily, and because of the high relevance for users, they gladly accept

the effort required to report these incidents.

In many cases, however, a proactive configuration is responsible for identifying an incident or

imminent incident. For example, deviations can be identified on the hardware level (such as the

failure of a hard drive in a RAID array) and reported. On a system level, the use of monitoring

solutions has become widespread. This allows, for example, the function of an e-mail server to be

checked by having the monitoring application send an e-mail and measuring how long it takes to

receive a response. If a threshold value is exceeded during this process, this is designated as an

event and an alarm is triggered.

Depending on the systems used and the corporate culture in place, actually entering the disruption

could be the responsibility of the user (in a ticket system, for example) or the service desk agent

(for phone calls or e-mails). In large companies, correctly categorising the ticket ensures that it will

be forwarded to exactly the right specialist. However, even in small companies, it provides the

ability—beginning at a critical mass of disruptions—to identify weak points and areas for

improvement. For example, if a particularly large number of problems with an office application are

reported, it may be worthwhile to provide the users with better training or replace the application.

Like the entry of the disruption, its categorisation can also be carried out by the user or an IT

employee. Based on the collected data, a decision can be made as to whether the entry pertains to

a disruption or service request, which triggers a separate process.

Phase II - Prioritising

The priority assigned to an incident specifies

how the incident is handled by the employees

and tools of the service desk. The prioritising

process often holds a large potential for conflict

between users and services providers, as by

nature users will always assign top priority to

their own incidents. Experience has shown that

in many cases in which users determine the

priority themselves, the priority differs

significantly from reality. Therefore, it is better

to have the incident classified by the service

desk employee, as only he or she has the

necessary overview of the current situation in the company. The definition of the priority is based

on the effect on the supported business processes and the urgency until this effect takes hold.

Figure 10 – Example of prioritising incidents

Based on the priority assigned to the incident, response times (time until troubleshooting begins)

and solution times (time until regular operation) can be defined.

Phase 3 – Diagnosing and possible escalating

The objective of the initial diagnostics is to gather all relevant facts (environment data, symptoms

etc.). In many cases, this takes place in direct communication between the service desk employee

and user. If the problem is simple or known, the employee will try to resolve it immediately. If this is

not possible because there is not enough time or the necessary detailed technical knowledge is

lacking, the incident has to be escalated for further handling. It can be distinguished between two

types of escalation:

Functional escalation is passing it on to another authority (person or team) with greater

experience. The forwarding can be either internal (to in-house IT employees) or external (e.g. to a

vendor's support staff). Nowadays, this is often referred to as “second-level support”.

Hierarchical escalation refers to notifying and involving higher management levels to support the

escalation. In this process, the higher-level manager is called upon to overcome organisation

hurdles or mobilise additional resources to solve the problem in a timely manner.

The appropriate specialists now have to create the diagnosis or escalate it further until the final

diagnosis is reached. Regardless of the escalation level, the service desk is responsible for the

incident, co-ordinates the activities and provides users with regular updates about the progress of

their incidents.

Phase 4 – Remedying the disruption

Once a diagnosis for the disruption has been identified, it can be remedied and the normal state

restored. The solutions should always undergo corresponding testing. For example, printing a test

page after removing a paper jam can provide immediate information as to whether other problems

exist. When applications are adapted in what are known as hot fixes or patches, the possible

interactions should be examined before making them available on a large scale.

After a successful resolution and restore, the incident can be closed out. In doing so, the service

desk should ensure that the user is satisfied with the solution. In many cases, this is implemented

by the system in that the service desk changes the status of the incident to resolved, but the user

can close out the incident. In many cases, the next step is a brief survey with a few questions (3-5)

to evaluate the quality of the service desk.

Problem Management

Incident resolved – is that all there is to it? Of course not. In many cases, though the incident is

resolved quickly using the process we just described, but the cause is not eliminated and may

result in further problems. For example, if paper jams occur frequently in a certain type of printer,

this type could have a manufacturing defect or be incompatible with the paper being used.

Problem management is concerned with just these kinds of root causes.

Problem

A problem exists when multiple incidents indicate a pattern. Central management of the incidents

by the Help Desk allows recurring problems to be identified (e.g. Excel always crashes for user XY

whenever he or she has Word open at the same time) and long-term solutions can be found.

A problem, i.e. a root cause of one or more incidents, is handled by the problem management

process in multiple steps. Again ITIL provides an adequate reference process:

1. Identifying the problem; this is done by the employees of the service desk, technical

support team or event management.

2. Entering the problem, providing links to the corresponding malfunctions, including a

categorisation for later reporting and the prioritisation of the problem, in a way similar to

incident management.

3. Diagnosing the problem with the objective of identifying the root cause. If the cause has

been identified but no solution is yet available, a workaround (e.g. restart printer) has to

be defined. This is entered as a known error and made available to the service desk so

that it can remedy the corresponding disruption more quickly.

4. Finding a solution with the objective of implementing it as quickly as possible. However, if

a change is necessary for final resolution, this should be done using the procedure defined

in the change management system. This structured procedure reduces and acts as a

check on the possible effects (for more information, refer to Chapter 5Fehler!

Verweisquelle konnte nicht gefunden werden.).

Both incident management and problem management are based on identical concepts with regard

to personnel and tools. In larger organisations, it is recommended to establish a separate team that

runs the service desk function. These organisations can consider concepts such as the centralised

or decentralised service desk, virtual service desk (e.g. in collaboration with a supplier) or even

corresponding time zone concepts for international companies (follow-the-sun principle). In small IT

organisations, the function can also be entrusted to an employee who is responsible for the service

desk and is supported by his or her colleagues. Ideally, this should implement the concept of "one

face to the customer" or, in other words, one contact person for the user in all matters. It makes it

easier for users to communicate with IT, intercepts trivial enquiries directly and enables the

remaining employees (e.g. developers or administrators) to concentrate on their core topics.

On the tool side, numerous commercial and open source solutions are available today. Ideally, the

service desk should have the following applications available, which are integrated into one

solution or linked to each other via logical interfaces:

1. Ticket system that manages and documents a disruption or problem over its entire life

cycle. It should also enable communication with the user (e.g. via a Web interface or by e-

mail).

2. Database for collecting known errors and solutions (known error database, KEDB). This

does not always need to be a lofty solution. For smaller organisations, a simple list is

usually sufficient.

3. A configuration management database (CMDB) is a tool that supports many areas. The

database supplies data and information about the entire IT landscape and thus helps to

identify the context and identify problems more easily. For example, you can read which

employee uses which type of printer at his or her workstation. For more information on this

topic, refer to Chapter 4.

3.2.1 Systems & outsourced services

Hands on the keyboard: is the heart of your IT still beating? This chapter is all about the heart of

information technology – the applications, systems, networks and hardware. However, a wide

variety of activities are required in order to set up, maintain and operate this complex configuration.

Have you already outsourced everything? Even if you have, this chapter provides valuable

information.

Before we really get down to business, let's stick with the subject of management for a bit. Many IT

folks consider managing availability and capacity to be a strategic or tactical task. In smaller IT

organisations, however, the usual scenario is that the specialist knows his or her systems in detail

while also providing them with conceptual support; both topics are shifted to the operational level.

Availability management is responsible for all aspects that pertain to the availability of a service.

Generally speaking: when required by the customer, a service provides the needed and planned

function as set forth in the SLA [Service Level Agreement]. Concretely put: when the user wants to

retrieve his or her e-mail, the corresponding e-mail server has to be working. Therefore, availability

management serves as a monitor to ensure adherence to the objectives defined in the SLA and

provides the necessary and possible improvements in terms of availability. In doing so, availability

management can make use of reactive and proactive means:

Re-active Pro-active

! Monitoring, measuring, analysing, reporting and verifying the availability

! Examining the non-availability

! Risk assessment and management

! Implementing cost-appropriate countermeasures

! Planning, designing and testing new or changed services

! Testing the availability and failure mechanisms

Service providers often attempt to attract customers by promising 99% availability of the service.

This availability in percent is calculated by dividing the actual availability of the service by the

agreed service time:

!"#$%#&$%$'(! !"!! ! ! !!"#$$%!!"#$%&"!!"#$ ! !"#$!!"!!"#$%&"!!"#$#%&#'&%()!!"#$$%!!"#$%&"!!"#$ !

!

At first glance, the value of 99 percent availability seems very high. However, let's convert this to

minutes and days and see what the results are. Relative to one day, 99-percent availability means

less than 15 minutes of downtime. Calculated over the entire year, these 15-minute periods add up

to 3.5 days. On this basis, a decision can be made as to whether 99 percent is a realistic level or

not. In retrospect, it is worthwhile to check that the promise has actually been kept.

Another critical point for orientation is service availability (regardless of whether in-house or

outsourced) from provision of the service to its consumption (end-to-end). For example, if we

measure provision of a business application based on server uptime, other circumstances (e.g.

failure of the network) between server and user can cause a downtime, which, however, is not

taken into consideration. Accordingly, the measurement should be carried out as close to the

receiver as possible in order to take all eventualities into account.

If a failure occurs despite all preventive measures, availability management provides two additional

metrics:

! Response time – Time between the report of a disruption and the beginning of

troubleshooting.

! Restore time – Time between the report of an incident and restoration of the service.

If service management is outsourced, the most important aspect to be considered when selecting

the service provider is the restore time. Otherwise, the following case can occur: After a hardware

defect, the provider already responds after a few minutes and initiates the order of the spare part.

However, if the spare part is not available and a week passes until delivery, the service cannot be

offered again for a few days.

Another management topic is managing the available capacity and the needed capacity in the

future (capacity management). The Capacity Manager acts as the "fortune teller" of corporate IT.

He or she does not look into a crystal ball, but instead analyses the current demand, monitors the

company's development and, based on the corporate strategy, derives the future demand for

services and the underlying infrastructure. He or she must ensure that the needed capacity is

available in the planned quality at all times.

Capacity management consists of three subareas:

! Business capacity management includes all activities intended to identify future business

requirements and reflect them in the capacity plan.

! Service Capacity Management refers to the activities that provide insight as to the

capacities of the IT services required in the future.

! Component Capacity Management includes all activities that monitor the capacity,

performance and utilisation of the individual configuration elements (e.g. PC, printer,

telephone, server).

We can put it most simply by saying that the future requirements of business for the services, and

the demand of the services for the resources, have to be taken into account and reflected in the

capacity planning. Based on this plan, actions are possible to ensure that the goals of the SLA are

also met in the future. For example, the growth of the amount of disk space needed can be

documented, a forecast derived from this and additional disk space purchased in a timely manner.

This ensures that a cost-appropriate IT capacity can be maintained.

Up to this point, we have only talked about management of IT and IT services. However, we must

not forget the specialists who install and maintain the applications and systems. Depending on the

size of the company, these technical operations are divided into various teams and responsibilities.

The common differentiation is between responsibility for systems and applications.

System support, the company's administrators are concerned with all hardware-related topics. In

the ITSM environment, this task is often given the title IT operations management and includes

management of the physical IT infrastructure (typically in data centres or computer rooms). The

foremost goal is safeguarding and optimising the current, stable condition of the infrastructure.

Examples of the tasks of IT operations management include:

! System administration and running operational activities and events

! Console management and job scheduling of the servers

! Backup and restore

! Print management

! Performance measurement and optimisation

! Maintenance activities

! IT facility management (climate control system, power supply etc.)

Application management, on the other hand, is responsible for designing, developing, testing and

improving business applications. The areas of responsibility can vary greatly from company to

company. If the software is developed in-house, the range of application management

responsibilities widens. The other option is to outsource application development. Of course, there

are many increments between these two solutions (e.g. standard software with in-house

adaptations). The tasks of application management are defined as follows:

! Supporting the company's applications

! In some cases, designing, developing, testing and improving applications

! Supporting IT operations management

! Training employees

3.2.2 IT procurement

The rapid development of information technology poses constant challenges to the IT departments

of small and medium-sized enterprises: there are new kinds of technologies, changed services and

innovative products. Do these have the potential to add value to the business or are they merely

self-serving? Many calculation options are available for answering this question:

! Total cost of ownership (TCO)

! Total benefits of ownership (TBO) / Total value of ownership (TVO)

! Static or dynamic investment calculation

! Return on investment (ROI)

It is, in fact, true that these options provide the company with correct results in subareas. Viewed

separately, however, they do not provide valid results in the majority of cases. For example, there

is no correct comparison of all costs and benefits, or only purely monetary variables are used.

Ultimately, it is necessary to clarify whether the total benefit (TBO/TVO) to be expected justifies the

total costs (TCO) to be expected over the service life or even creates a profit situation. In other

words: a return on investment consideration, which is not limited to the investment costs and the

monetary benefits, but considers all costs and benefits.

Once the investment decision has been made, "all" that is left is to purchase the new IT

components. All too frequently, however, this plan proves to be extremely complex.

Not without reason, as IT procurement processes affect multiple areas of an organization –

including those outside IT – and include services of external providers, such as suppliers.

Accordingly, close co-operation should be pursued and open communication maintained.

Supplier management within the IT organization has the following objective:

! Regularly observing the procurement market and monitoring trends and innovations

! Selecting suppliers, taking into account the strategic significance for the company's business

processes

! Negotiating contracts and agreeing on a fixed scope of services with the suppliers

! Ensuring and continuously increasing the quality of the purchased service

! Managing relationships with suppliers

! Documenting all suppliers, contracts and relationships

In many cases, the tasks are also divided up

between the purchasing department as such and

the IT organization. In doing so, IT co-ordinates

all technical aspects in the cycle, while

purchasing handles structuring the contracts and

pricing.

The greater a supplier's strategic significance for

the company, the more long-term the business

relationships should be. The significance can be

defined based on two variables:

! Value contribution and importance

! Risk and influence

In most cases, a long-term, close-knit co-operation pays off. For example, blanket purchase

agreements often allow more favourable terms and conditions when buying components (e.g. for

the expected quantity of desktop computers in one year, while also allowing optimisations and

relieving workload in the procurement process. Over the medium term, consistent standardisation

can achieve additional effects of scale.

Diagram 11 - Classifying suppliers

Is IT procurement not a relevant topic to companies that have outsourced all services? Even when

full outsourcing is used, it is important for there to be a responsible contact person in the company;

here, too, the customer–supplier relationship has to be maintained, quality monitored and the

market observed regularly.

3.2.3 Security and environment

"Sony says sorry - the Playstation manufacturer has apologised for the massive data theft in its

networks and promised free games as compensation and better security measures. (!)“. These or

similar words were used by many daily newspapers to relate the story in spring 2011. Criminality in

the IT environment is nothing new. As a small company, one could surely ask: who could profit

from my data anyway? However, the topic of IT security is more varied than one might think, and

certainly also relevant for smaller companies:

! First names, car marques, birthdays or the favourite football club—many people use easy-to-

remember terms to recall a password. Is a corresponding guideline in place in the company?

! Is the company's administrator password securely stored with the administrator's supervisor in

case the admin is absent?

! Are virus scanners installed and are they updated on a regular basis?

! Are hard drives securely deleted (wiped) before being disposed of?

! Are important servers stored where they are safe from water or heat damage?

! What happens to e-mails when an employee is on vacation?

! Who is permitted to use his or her personal mobile phone in the company?

Numerous statistics prove that approximately half of security-related incidents are triggered not by

external parties, but by the company's own employees. In almost all cases, this is accidental,

usually out of ignorance, a lack of training or carelessness. Accordingly, SMEs should also analyse

the possible hazards and take countermeasures.

In doing so, all possible risks should be taken into account:

! Protecting the information from unauthorised access and malware (e.g. viruses, hacker

attacks, espionage)

! Provisioning the information to authorised persons (Access Management)

! Securing the infrastructure against influences from the area surrounding the IT (e.g.

overvoltage in the power supply network or power failure, flood, heat or even fire)

The measures taken (e.g. providing a firewall, using a climate control system) are to be considered

preventive. The measures taken should be in proportion to the possible harm. Operating a server in

the supply closet next to chemicals and moist rags is surely negligent. However, an autonomous,

earthquake-proof data centre is surely also not the right choice for a small company. One hundred

percent protection is possible in rare cases only or associated with high costs that are justified in

only a few application areas. However, the possible risks should be specified accordingly and the

measures planned in case the risks do occur. This is done in what is known as an IT recovery plan

for various scenarios. The objective is to restore normal operation of the disrupted service(s) as

quickly as possible. If, for example, the servers have to be shut down during a long-term power

failure, the recovery plan should describe the systematic procedure at the start so that all

dependencies between the systems are taken into account and no further delay or even damage

occurs.

ITSM Guide - Extract Chapter 3.2 - Service Operations

Documents

Transcript of ITSM Guide - Extract Chapter 3.2 - Service Operations