ITSM Guide - Extract Chapter 3.2 - Service Operations
-
Upload
project-innotrain-it -
Category
Documents
-
view
224 -
download
1
description
Transcript of ITSM Guide - Extract Chapter 3.2 - Service Operations
INNOTRAIN IT
IT Service Management
QUICK – SIMPLE - CLEAR
Preview
Extract
Chapter 3.2
2011
IT Service Management
I
Authors
Dr. Mariusz Grabowski, Universität der Wirtschaft Krakau
Dr. Claus Hoffmann, Beatrix Lang GmbH
Philipp Küller, Hochschule Heilbronn
Elena-Teodora Miron, Universität Wien
Dr. Dariusz Put, Universität der Wirtschaft Krakau
Dr. Piotr Soja, Universität der Wirtschaft Krakau
Dr. Janusz Stal, Universität der Wirtschaft Krakau
Marcus Vogt, Hochschule Heilbronn
Dr. Eng. Tadeusz Wilusz, Universität der Wirtschaft Krakau
Dr. Agnieszka Zając, Universität der Wirtschaft Krakau
3.2 Service operations
3.2.1 Service and infrastructure operations
"Help, my screen went black!" - We are all familiar with this call from users. Every day, we
encounter a wide variety of these incidents in system operation, i.e. deviations from the plans. This
chapter explains how to handle these incidents in a structured manner.
But first, let's briefly explain the relevant terminology: The service desk is the central contact for all
inquiries from users, according to the principle of "one face to the customer." This provides the
customer with a single point of contact for all IT-related inquiries (e.g. hotline, ticket system). In a
few cases, companies even expand their service desk to handle other inquiries (such as event
management). The purpose of this single point of contact is to handle service requests, create an
incident or to submit a request for change. The service desk can be understood as a kind of funnel
that collects all messages and then steers them to the correct processes. Initially, enquiries of all
types are treated as incidents.
Service desk
The service desk is the central function of ITSM. It is the link between the IT service and business
operations. This function is used to transact all enquiries from, and support provided to, employees.
The incident is an unplanned interruption (such as a workstation computer that won't work) or
reduction of the quality of a service (like a slow Internet connection). Failures of elements of the
configuration (Configuration Items) can be treated as an incident without a direct effect on the
service. In this case, one example would be the failure of a mirrored hard drive, where the server is
still up and running. In contrast, Service Requests are enquiries from users (with regard to
information, consultation, standard changes, access) that do not have an effect on the service.
They are one way to satisfy customers' needs. One example is a request for more printer toner
when the printer indicates that it will run out soon.
Disruption or incident?
An incident is defined as an IT disruption or IT service enquiry. Examples of an incident could be:
"My Excel is crashing" or "I need to create a PDF from Excel. How do I do that?". All incidents
should be processed by the central service desk and the status updated to enable later evaluation.
The process for managing and processing incidents is usually known as incident management. It
passes through various phases:
1. Entering and classifying the incident
2. Diagnosing the incident
3. Escalating the incident
4. Closing out the incident
The current Version 3 of the IT Infrastructure Library (ITIL) provides a sample process as a best
practice. This can provide the basis for a company to develop its own processes, as the
requirements differ enormously from company to company. For smaller companies, it is surely
good enough to just pick up the phone without any overarching workflow, but in medium to large
enterprises, this results in repeated interruptions of employees' core tasks. In this case, it is
worthwhile to take the more formal route.
In both cases, however, it makes sense to document the incidents and analyse them as part of an
improvement process. Ideally, the incidents should be entered into a ticket system, but a simple list
might be good enough at the beginning.
ITIL recommends the following procedure:
Figure 9 - Incident process based on ITIL V3
Phase I - Identifying, entering and categorising the disruption
The identification of incidents normally shared equally by the large number of users. Most
deviations can be identified easily, and because of the high relevance for users, they gladly accept
the effort required to report these incidents.
In many cases, however, a proactive configuration is responsible for identifying an incident or
imminent incident. For example, deviations can be identified on the hardware level (such as the
failure of a hard drive in a RAID array) and reported. On a system level, the use of monitoring
solutions has become widespread. This allows, for example, the function of an e-mail server to be
checked by having the monitoring application send an e-mail and measuring how long it takes to
receive a response. If a threshold value is exceeded during this process, this is designated as an
event and an alarm is triggered.
Depending on the systems used and the corporate culture in place, actually entering the disruption
could be the responsibility of the user (in a ticket system, for example) or the service desk agent
(for phone calls or e-mails). In large companies, correctly categorising the ticket ensures that it will
be forwarded to exactly the right specialist. However, even in small companies, it provides the
ability—beginning at a critical mass of disruptions—to identify weak points and areas for
improvement. For example, if a particularly large number of problems with an office application are
reported, it may be worthwhile to provide the users with better training or replace the application.
Like the entry of the disruption, its categorisation can also be carried out by the user or an IT
employee. Based on the collected data, a decision can be made as to whether the entry pertains to
a disruption or service request, which triggers a separate process.
Phase II - Prioritising
The priority assigned to an incident specifies
how the incident is handled by the employees
and tools of the service desk. The prioritising
process often holds a large potential for conflict
between users and services providers, as by
nature users will always assign top priority to
their own incidents. Experience has shown that
in many cases in which users determine the
priority themselves, the priority differs
significantly from reality. Therefore, it is better
to have the incident classified by the service
desk employee, as only he or she has the
necessary overview of the current situation in the company. The definition of the priority is based
on the effect on the supported business processes and the urgency until this effect takes hold.
Figure 10 – Example of prioritising incidents
Based on the priority assigned to the incident, response times (time until troubleshooting begins)
and solution times (time until regular operation) can be defined.
Phase 3 – Diagnosing and possible escalating
The objective of the initial diagnostics is to gather all relevant facts (environment data, symptoms
etc.). In many cases, this takes place in direct communication between the service desk employee
and user. If the problem is simple or known, the employee will try to resolve it immediately. If this is
not possible because there is not enough time or the necessary detailed technical knowledge is
lacking, the incident has to be escalated for further handling. It can be distinguished between two
types of escalation:
Functional escalation is passing it on to another authority (person or team) with greater
experience. The forwarding can be either internal (to in-house IT employees) or external (e.g. to a
vendor's support staff). Nowadays, this is often referred to as “second-level support”.
Hierarchical escalation refers to notifying and involving higher management levels to support the
escalation. In this process, the higher-level manager is called upon to overcome organisation
hurdles or mobilise additional resources to solve the problem in a timely manner.
The appropriate specialists now have to create the diagnosis or escalate it further until the final
diagnosis is reached. Regardless of the escalation level, the service desk is responsible for the
incident, co-ordinates the activities and provides users with regular updates about the progress of
their incidents.
Phase 4 – Remedying the disruption
Once a diagnosis for the disruption has been identified, it can be remedied and the normal state
restored. The solutions should always undergo corresponding testing. For example, printing a test
page after removing a paper jam can provide immediate information as to whether other problems
exist. When applications are adapted in what are known as hot fixes or patches, the possible
interactions should be examined before making them available on a large scale.
After a successful resolution and restore, the incident can be closed out. In doing so, the service
desk should ensure that the user is satisfied with the solution. In many cases, this is implemented
by the system in that the service desk changes the status of the incident to resolved, but the user
can close out the incident. In many cases, the next step is a brief survey with a few questions (3-5)
to evaluate the quality of the service desk.
Problem Management
Incident resolved – is that all there is to it? Of course not. In many cases, though the incident is
resolved quickly using the process we just described, but the cause is not eliminated and may
result in further problems. For example, if paper jams occur frequently in a certain type of printer,
this type could have a manufacturing defect or be incompatible with the paper being used.
Problem management is concerned with just these kinds of root causes.
Problem
A problem exists when multiple incidents indicate a pattern. Central management of the incidents
by the Help Desk allows recurring problems to be identified (e.g. Excel always crashes for user XY
whenever he or she has Word open at the same time) and long-term solutions can be found.
A problem, i.e. a root cause of one or more incidents, is handled by the problem management
process in multiple steps. Again ITIL provides an adequate reference process:
1. Identifying the problem; this is done by the employees of the service desk, technical
support team or event management.
2. Entering the problem, providing links to the corresponding malfunctions, including a
categorisation for later reporting and the prioritisation of the problem, in a way similar to
incident management.
3. Diagnosing the problem with the objective of identifying the root cause. If the cause has
been identified but no solution is yet available, a workaround (e.g. restart printer) has to
be defined. This is entered as a known error and made available to the service desk so
that it can remedy the corresponding disruption more quickly.
4. Finding a solution with the objective of implementing it as quickly as possible. However, if
a change is necessary for final resolution, this should be done using the procedure defined
in the change management system. This structured procedure reduces and acts as a
check on the possible effects (for more information, refer to Chapter 5Fehler!
Verweisquelle konnte nicht gefunden werden.).
Both incident management and problem management are based on identical concepts with regard
to personnel and tools. In larger organisations, it is recommended to establish a separate team that
runs the service desk function. These organisations can consider concepts such as the centralised
or decentralised service desk, virtual service desk (e.g. in collaboration with a supplier) or even
corresponding time zone concepts for international companies (follow-the-sun principle). In small IT
organisations, the function can also be entrusted to an employee who is responsible for the service
desk and is supported by his or her colleagues. Ideally, this should implement the concept of "one
face to the customer" or, in other words, one contact person for the user in all matters. It makes it
easier for users to communicate with IT, intercepts trivial enquiries directly and enables the
remaining employees (e.g. developers or administrators) to concentrate on their core topics.
On the tool side, numerous commercial and open source solutions are available today. Ideally, the
service desk should have the following applications available, which are integrated into one
solution or linked to each other via logical interfaces:
1. Ticket system that manages and documents a disruption or problem over its entire life
cycle. It should also enable communication with the user (e.g. via a Web interface or by e-
mail).
2. Database for collecting known errors and solutions (known error database, KEDB). This
does not always need to be a lofty solution. For smaller organisations, a simple list is
usually sufficient.
3. A configuration management database (CMDB) is a tool that supports many areas. The
database supplies data and information about the entire IT landscape and thus helps to
identify the context and identify problems more easily. For example, you can read which
employee uses which type of printer at his or her workstation. For more information on this
topic, refer to Chapter 4.
3.2.1 Systems & outsourced services
Hands on the keyboard: is the heart of your IT still beating? This chapter is all about the heart of
information technology – the applications, systems, networks and hardware. However, a wide
variety of activities are required in order to set up, maintain and operate this complex configuration.
Have you already outsourced everything? Even if you have, this chapter provides valuable
information.
Before we really get down to business, let's stick with the subject of management for a bit. Many IT
folks consider managing availability and capacity to be a strategic or tactical task. In smaller IT
organisations, however, the usual scenario is that the specialist knows his or her systems in detail
while also providing them with conceptual support; both topics are shifted to the operational level.
Availability management is responsible for all aspects that pertain to the availability of a service.
Generally speaking: when required by the customer, a service provides the needed and planned
function as set forth in the SLA [Service Level Agreement]. Concretely put: when the user wants to
retrieve his or her e-mail, the corresponding e-mail server has to be working. Therefore, availability
management serves as a monitor to ensure adherence to the objectives defined in the SLA and
provides the necessary and possible improvements in terms of availability. In doing so, availability
management can make use of reactive and proactive means:
Re-active Pro-active
! Monitoring, measuring, analysing, reporting and verifying the availability
! Examining the non-availability
! Risk assessment and management
! Implementing cost-appropriate countermeasures
! Planning, designing and testing new or changed services
! Testing the availability and failure mechanisms
Service providers often attempt to attract customers by promising 99% availability of the service.
This availability in percent is calculated by dividing the actual availability of the service by the
agreed service time:
!"#$%#&$%$'(! !"!! ! ! !!"#$$%!!"#$%&"!!"#$ ! !"#$!!"!!"#$%&"!!"#$#%&#'&%()!!"#$$%!!"#$%&"!!"#$ !
!
At first glance, the value of 99 percent availability seems very high. However, let's convert this to
minutes and days and see what the results are. Relative to one day, 99-percent availability means
less than 15 minutes of downtime. Calculated over the entire year, these 15-minute periods add up
to 3.5 days. On this basis, a decision can be made as to whether 99 percent is a realistic level or
not. In retrospect, it is worthwhile to check that the promise has actually been kept.
Another critical point for orientation is service availability (regardless of whether in-house or
outsourced) from provision of the service to its consumption (end-to-end). For example, if we
measure provision of a business application based on server uptime, other circumstances (e.g.
failure of the network) between server and user can cause a downtime, which, however, is not
taken into consideration. Accordingly, the measurement should be carried out as close to the
receiver as possible in order to take all eventualities into account.
If a failure occurs despite all preventive measures, availability management provides two additional
metrics:
! Response time – Time between the report of a disruption and the beginning of
troubleshooting.
! Restore time – Time between the report of an incident and restoration of the service.
If service management is outsourced, the most important aspect to be considered when selecting
the service provider is the restore time. Otherwise, the following case can occur: After a hardware
defect, the provider already responds after a few minutes and initiates the order of the spare part.
However, if the spare part is not available and a week passes until delivery, the service cannot be
offered again for a few days.
Another management topic is managing the available capacity and the needed capacity in the
future (capacity management). The Capacity Manager acts as the "fortune teller" of corporate IT.
He or she does not look into a crystal ball, but instead analyses the current demand, monitors the
company's development and, based on the corporate strategy, derives the future demand for
services and the underlying infrastructure. He or she must ensure that the needed capacity is
available in the planned quality at all times.
Capacity management consists of three subareas:
! Business capacity management includes all activities intended to identify future business
requirements and reflect them in the capacity plan.
! Service Capacity Management refers to the activities that provide insight as to the
capacities of the IT services required in the future.
! Component Capacity Management includes all activities that monitor the capacity,
performance and utilisation of the individual configuration elements (e.g. PC, printer,
telephone, server).
We can put it most simply by saying that the future requirements of business for the services, and
the demand of the services for the resources, have to be taken into account and reflected in the
capacity planning. Based on this plan, actions are possible to ensure that the goals of the SLA are
also met in the future. For example, the growth of the amount of disk space needed can be
documented, a forecast derived from this and additional disk space purchased in a timely manner.
This ensures that a cost-appropriate IT capacity can be maintained.
Up to this point, we have only talked about management of IT and IT services. However, we must
not forget the specialists who install and maintain the applications and systems. Depending on the
size of the company, these technical operations are divided into various teams and responsibilities.
The common differentiation is between responsibility for systems and applications.
System support, the company's administrators are concerned with all hardware-related topics. In
the ITSM environment, this task is often given the title IT operations management and includes
management of the physical IT infrastructure (typically in data centres or computer rooms). The
foremost goal is safeguarding and optimising the current, stable condition of the infrastructure.
Examples of the tasks of IT operations management include:
! System administration and running operational activities and events
! Console management and job scheduling of the servers
! Backup and restore
! Print management
! Performance measurement and optimisation
! Maintenance activities
! IT facility management (climate control system, power supply etc.)
Application management, on the other hand, is responsible for designing, developing, testing and
improving business applications. The areas of responsibility can vary greatly from company to
company. If the software is developed in-house, the range of application management
responsibilities widens. The other option is to outsource application development. Of course, there
are many increments between these two solutions (e.g. standard software with in-house
adaptations). The tasks of application management are defined as follows:
! Supporting the company's applications
! In some cases, designing, developing, testing and improving applications
! Supporting IT operations management
! Training employees
3.2.2 IT procurement
The rapid development of information technology poses constant challenges to the IT departments
of small and medium-sized enterprises: there are new kinds of technologies, changed services and
innovative products. Do these have the potential to add value to the business or are they merely
self-serving? Many calculation options are available for answering this question:
! Total cost of ownership (TCO)
! Total benefits of ownership (TBO) / Total value of ownership (TVO)
! Static or dynamic investment calculation
! Return on investment (ROI)
It is, in fact, true that these options provide the company with correct results in subareas. Viewed
separately, however, they do not provide valid results in the majority of cases. For example, there
is no correct comparison of all costs and benefits, or only purely monetary variables are used.
Ultimately, it is necessary to clarify whether the total benefit (TBO/TVO) to be expected justifies the
total costs (TCO) to be expected over the service life or even creates a profit situation. In other
words: a return on investment consideration, which is not limited to the investment costs and the
monetary benefits, but considers all costs and benefits.
Once the investment decision has been made, "all" that is left is to purchase the new IT
components. All too frequently, however, this plan proves to be extremely complex.
Not without reason, as IT procurement processes affect multiple areas of an organization –
including those outside IT – and include services of external providers, such as suppliers.
Accordingly, close co-operation should be pursued and open communication maintained.
Supplier management within the IT organization has the following objective:
! Regularly observing the procurement market and monitoring trends and innovations
! Selecting suppliers, taking into account the strategic significance for the company's business
processes
! Negotiating contracts and agreeing on a fixed scope of services with the suppliers
! Ensuring and continuously increasing the quality of the purchased service
! Managing relationships with suppliers
! Documenting all suppliers, contracts and relationships
In many cases, the tasks are also divided up
between the purchasing department as such and
the IT organization. In doing so, IT co-ordinates
all technical aspects in the cycle, while
purchasing handles structuring the contracts and
pricing.
The greater a supplier's strategic significance for
the company, the more long-term the business
relationships should be. The significance can be
defined based on two variables:
! Value contribution and importance
! Risk and influence
In most cases, a long-term, close-knit co-operation pays off. For example, blanket purchase
agreements often allow more favourable terms and conditions when buying components (e.g. for
the expected quantity of desktop computers in one year, while also allowing optimisations and
relieving workload in the procurement process. Over the medium term, consistent standardisation
can achieve additional effects of scale.
Diagram 11 - Classifying suppliers
Is IT procurement not a relevant topic to companies that have outsourced all services? Even when
full outsourcing is used, it is important for there to be a responsible contact person in the company;
here, too, the customer–supplier relationship has to be maintained, quality monitored and the
market observed regularly.
3.2.3 Security and environment
"Sony says sorry - the Playstation manufacturer has apologised for the massive data theft in its
networks and promised free games as compensation and better security measures. (!)“. These or
similar words were used by many daily newspapers to relate the story in spring 2011. Criminality in
the IT environment is nothing new. As a small company, one could surely ask: who could profit
from my data anyway? However, the topic of IT security is more varied than one might think, and
certainly also relevant for smaller companies:
! First names, car marques, birthdays or the favourite football club—many people use easy-to-
remember terms to recall a password. Is a corresponding guideline in place in the company?
! Is the company's administrator password securely stored with the administrator's supervisor in
case the admin is absent?
! Are virus scanners installed and are they updated on a regular basis?
! Are hard drives securely deleted (wiped) before being disposed of?
! Are important servers stored where they are safe from water or heat damage?
! What happens to e-mails when an employee is on vacation?
! Who is permitted to use his or her personal mobile phone in the company?
Numerous statistics prove that approximately half of security-related incidents are triggered not by
external parties, but by the company's own employees. In almost all cases, this is accidental,
usually out of ignorance, a lack of training or carelessness. Accordingly, SMEs should also analyse
the possible hazards and take countermeasures.
In doing so, all possible risks should be taken into account:
! Protecting the information from unauthorised access and malware (e.g. viruses, hacker
attacks, espionage)
! Provisioning the information to authorised persons (Access Management)
! Securing the infrastructure against influences from the area surrounding the IT (e.g.
overvoltage in the power supply network or power failure, flood, heat or even fire)
The measures taken (e.g. providing a firewall, using a climate control system) are to be considered
preventive. The measures taken should be in proportion to the possible harm. Operating a server in
the supply closet next to chemicals and moist rags is surely negligent. However, an autonomous,
earthquake-proof data centre is surely also not the right choice for a small company. One hundred
percent protection is possible in rare cases only or associated with high costs that are justified in
only a few application areas. However, the possible risks should be specified accordingly and the
measures planned in case the risks do occur. This is done in what is known as an IT recovery plan
for various scenarios. The objective is to restore normal operation of the disrupted service(s) as
quickly as possible. If, for example, the servers have to be shut down during a long-term power
failure, the recovery plan should describe the systematic procedure at the start so that all
dependencies between the systems are taken into account and no further delay or even damage
occurs.