Every cloud has a silver lining

EVERY CLOUD HAS

A SILVER LINING A WHITEPAPER ON ITSM INCIDENT MANAGEMENT PROCESS FOR CLOUD

ENVIRONMENT

Cloud Computing has changed the dynamics of IT Services business but organizations have

not been able to foresee the changes required in ITSM Processes and Procedures to adopt

the Cloud Computing. In this publication, I have tried to explore the procedure and process

level changes needed in ITIL Incident Management process in order to work smoothly in Cloud

Environment.

Published by: Aditya Dashora

© Conceptualized and Published by Aditya Dashora

1

About the Author

Aditya Dashora, a senior

consultant from Infosys

Limited is an IT Enthusiast

with around 9 years of

experience in delivering

many IT Service

Management

consulting projects for

large enterprises across

the globe.

Aditya is quite

passionate about

helping CIOs and CTOs

in improving their IT

Strategy to meet the

current and future

demands. Also, he is

instrumental in exploring

and defining new ways

of working for the

organizations by

leveraging technology.

Aditya is based out of

Bangalore, India.

Contact Information:

[email protected]

https://www.linkedin.com/i

n/adityadashora

mailto:[email protected]?subject=Regarding%20white%20paper%20on%20Cloud%20Incident%20Management

mailto:[email protected]

https://www.linkedin.com/in/adityadashora






2

CONTENTS

1. Executive Summary .................................................................................................................................. 3

2. A sneak peek into the world of “Cloud” .............................................................................................. 4

3. Incident Management process for Cloud ........................................................................................... 6

4. Procedural Level Changes ..................................................................................................................... 8

5. Key Performance Indicators ................................................................................................................. 15

6. Key Policies ............................................................................................................................................... 16

7. Technology Considerations .................................................................................................................. 17

References........................................................................................................................................................ 18


3 Executive Summary

1. EXECUTIVE SUMMARY

With the rapidly growing adoption rate, it is already conceived that within next 5-6 years, Cloud

Computing is going to change the rules of the game, played by victorious IT Service Providers

across the world. Firms, doing business in IT Infrastructure space have started feeling nervousness

about the growing acceptability of IaaS and PaaS services provided by Cloud Vendors. IT Service

Management, an instrument or weapon used by IT Service Providers and IT Support Organization

to fight the so called challenges in delivering IT Services to the customers, also considered as a

style statement within the IT Service Industry is going to play a vital role in the Cloud IT Shop.

However, concepts of ITSM will require some restructuring and renovation in order to attain the

capabilities to support the Cloud based IT Shop.

In this article, I have tried to explain the operational level changes needed in a traditional Incident

Management process to ensure accurate and speedy reaction to the Incidents/Issues/Events in

a Cloud Environment.


4 A sneak peek into the world of “Cloud”

2. A SNEAK PEEK INTO THE WORLD OF “CLOUD”

2.1. CLOUD ENVIRONMENT OVERVIEW

NIST definition of Cloud says that Cloud computing is a model for enabling ubiquitous, convenient,

on-demand network access to a shared pool of configurable computing resources (e.g.,

networks, servers, storage, applications, and services) that can be rapidly provisioned and

released with minimal management effort or service provider interaction. This cloud model is

composed of five essential characteristics, three service models, and four deployment models.

The three Cloud service models are defined as i.e. 1) Software as a Service, 2) Platform as a Service

& 3) Infrastructure as a Service. Similarly, there are four Cloud deployment models i.e. Private

Cloud, Public Cloud, Community Cloud and Hybrid Cloud.

There are five essential characteristics of Cloud Computing defined by NIST and they are: 1) On-

demand self-service, 2) broadband network access, 3) Resource Pooling, 4) Rapid Elasticity and

5) Measured Service.

Traditionally in an IT organization, IT support function(including managed service providers) is

responsible for procurement, implementation, support and maintenance of IT Services and

components like Critical Business Application, Enterprise/Corporate Applications, Messaging

Services, Databases, Batch Processing Services, Servers, Middleware, Storage and Back-up

infrastructure, Network, and IT Security Management etc.. In cloud implementation, some of the

mentioned IT components are provided and supported by a Cloud Service Provider on pay per

use basis. In case of a Private Cloud, the Technology Management function becomes the Cloud

Provider while in Public Cloud organizations avail services from providers like AWS, Rack Space,

Google Compute, MS Azure, Salesforce etc.

Focus of any Cloud implementation is to reduce cost of IT

and ensure high availability and in order to achieve that, it is

important to identify and analyze “IT Services” & “Critical

Business Applications” and define a Cloud implementation

strategy.

Some organizations choose to retain some of its critical IT Service components on-premise and

move reminder to the cloud. For example, a manufacturing company can choose to retain its

“Order Management System” applications and supporting infrastructure in-premise and offload

supporting services like Collaboration Portal, Messaging, CRM, HR Portal etc. to the cloud. This

setup is commonly known as Hybrid Cloud or IT Mix.

A common Cloud adoption approach is to move entire non-

production into cloud, which will ensure significant amount of

cost savings. Applications which require unpredictable

capacity during peak load hours are also good candidates

of cloud services.


5 A sneak peek into the world of “Cloud”

2.2. HYBRID CLOUD – THE REALITY OF THE FUTURE

Hybrid Cloud Environment is said to be the reality of the future of Cloud Computing. In the cloud

adoption journey, on one hand enterprises will transform their data centers into a private cloud

and also, they will engage multiple Cloud Providers to enjoy the benefits of Public Cloud. For this

white paper, I have considered a case of a big enterprise with a hybrid cloud environment. They

are using SaaS and IaaS from Public Cloud and along with their Private Cloud. In next section, I

have elaborated the required changes in the Incident Management process to manage a hybrid

cloud environment.

The reason to choose this scenario is that majority of the organizations will opt to walk on this path.

Organizations have already invested a lot into their IT environment and own IT Assets of worth

millions of dollars. Also, many organizations would choose to retain some of the IT Services related

to their critical business processes. Therefore, Hybrid Cloud deployment model provides enough

control, governance and flexibility so that enterprises can enjoy best of the both worlds.


6 Incident Management process for Cloud

3. INCIDENT MANAGEMENT PROCESS FOR CLOUD

3.1. INCIDENT MANAGEMENT PROCESS OVERVIEW

ITIL defines Incident as, “An unplanned interruption to an IT service or reduction in the quality of

an IT service. Failure of a configuration item that has not yet impacted service is also an incident,

for example failure of one disk from a mirror set.

“Incident Management (referred as IM hereafter) is the process for dealing with all incidents; this

can include failures, questions or queries reported by the users (usually via a telephone call to the

Service Desk), by technical staff, or automatically detected and reported by event monitoring

tools.”

These definitions are very much relevant in a cloud environment. The only update is that we can

be more specific in this definition to cover all the dependencies of IT Services: “An unplanned

interruption to an IT service or reduction in the quality of an IT service, service degradation/failure

of a configuration item or any enabler technology i.e. orchestration, hypervisor, self service

module, monitoring platform.”

Other important aspect which we need to keep in mind is dynamic nature of cloud environment

and because of that; many of the “Incidents” would end-up becoming minor Change Requests.

So, one need to be specific with the qualification of an Incident in a cloud environment.

INCIDENT MANAGEMENT (IM) HIGH LEVEL PROCESS DIAGRAM:

Figure 1

This process has been working seamlessly for any IT support function and would be instrumental in

cloud environment as well. There may be a need to emphasize more on some activities than

others. Also, cloud includes significant amount of automation and self-service and therefore, some

of the procedures or activities would be performed automatically or overlapped with sub-sequent

activities. (Red dotted circles)


7 Incident Management process for Cloud

SUGGESTED INCIDENT MANAGEMENT PROCESS FOR CLOUD:

Figure 2


8 Procedural Level Changes

4. PROCEDURAL LEVEL CHANGES

4.1. INCIDENT IDENTIFICATION

Incident Identification is performed in two fashions, 1) Identification through Event Management

platform & 2) User Reported Incidents. In cloud environment, there would be a higher degree of

dependency on the Event Monitoring systems and therefore a mature Event Management

process is a pre-requisite. Incidents related to enabler technologies like Hypervisors, Orchestration

Engine, Load balancers, Network or Domain Controllers will be identified by the monitoring tools

based on pre-defined thresholds. IT Security related Incidents will be identified by the IT Security

Monitoring tools and will share the information with central Event Management system or

Manager or Managers (MoM) layer. It is important to understand that in Public Cloud, there is a

potential risk of data leakage or security breach. Therefore, an IT organization must be sensitive

towards the security risk measures taken by Public Cloud vendors, and should try to establish real-

time monitoring of security issues.

For all user reported Incidents, it is important to determine the single point of failure using network

topology or CMDB (Configuration Management Database). Traditionally, this activity was

performed by L1/L2 support teams but in cloud environment, Service Desk or first point of contact

should be able to detect that in most of the cases. Cloud implementation ensures good amount

of automation and transparency which enables support staff to determine the single point of

failure.

A mature CMDB can provide CI dependency details but in cloud environment it might not be

relevant to identify faulty CIs due to dynamic nature of technology. Rather, it would be more

meaningful to trace the failed service, its dependencies on other services and failover plans. Also,

orchestration engine and self-service module can be configured to display on-going major

incidents to the users which can avoid incident queue.

In case of Incidents related to public cloud, most of the issues will be identified by the Cloud

Vendor and reminder would be reported by business-users/end-users. Ideally, orchestration

engine and service management platform should be capable of fetching real-time data from the

public cloud vendor and display ongoing outages/Incidents. That would suppress the related

Incidents. Apart from that, issues related to Network Connectivity, Application Functionality

(SaaS), Application Deployment (PaaS) related issues would be identified and reported by users

as usual.

In Hybrid Cloud model, issues related to Storage Gateway or connecter between In Premise infra

and Public Cloud will be identified and logged by both cloud service consumer and the vendor

(e.g. AWS). However, for better governance, the ownership of the ticket must remain with Service

Desk/L1/L2 support and not with the Cloud Vendor. Same policy would be applicable for tickets

created by monitoring tools at the vendor.

4.2. INCIDENT LOGGING

Incident Logging is second activity in IM lifecycle and it holds equal relevance in cloud

environment as traditional IT setup. As mentioned in ITIL v3 Service Operations book “All relevant



information relating to the nature of the incident must be logged so that a full historical record is

maintained”

Popular Service Management tools like ServiceNow, Remedy, HPSM etc. provide multiple fields for

logging an Incident ticket and needless to say that all of them are very much applicable for a

cloud environment. Besides, it would require a few additional fields to support accurate

classification and uniqueness of an Incident ticket in cloud environment.

For example:

- A field for identification of cloud provider would be very helpful in reducing overall

ticketing timestamp. It can be a dropdown with values like private cloud, public cloud etc.

or specifically NJ Datacenter, Singapore Datacenter, AWS, Rackspace etc.

- A field for associated hardware location/country can be helpful in case of security issues.

(tip: every country has different laws for data security)

- A field for affected Services or business processes would be helpful in communication

- A check-box for hypervisor related issues

In case of a hardware failure that can impact multiple services and thousands of users, Incident

Logging becomes crucial activity to trigger the resolution and recovery work. A hardware failure

must be treated as Sev-1 or Critical Incident and all dependent service owners/business process

owners must be notified in real time. Therefore, it is expected that the Incident Ticket should be

able to provide information about all the upstream and downstream dependencies of the failed

CI.

In case of incidents related to Public Cloud, the information flow from vendor’s monitoring and

ticketing tools to the host systems is essential and therefore automation and integration tools will

play a critical role.

4.3. INCIDENT CATEGORIZATION

Incident Categorization activity is performed by the Service Desk staff/IT Support Staff to ensure

that appropriate categorization codes are assigned to each Incident. With the help of

automation, Event Monitoring tools can also populate Incident Categorization codes while create

an Incident ticket from an Event.

In a cloud environment, although Incident Categorization activity overlaps with Incident Logging

however, Incident Categorization metadata must be designed to obtain meaningful information

for rapid routing of Incidents, Problem Identification and Supplier Management.

Traditional Multilevel Categorization Example is:

Category Tier-1 Tier-2 Tier-3

Incident Hardware Server Memory Board

Incident Software Microsoft Exchange

Table 1

Another popular approach is categorized as CI Category and Service Category. Example:



CI Name: NN150B12Win2k8A01

Service: Collaboration Service

In a cloud environment, we need to ensure that Incident Categorization provides details on

service provider, service, name of the application/service/server, criticality index etc. For

example:

AWS ->Infra -> ABCAWSUSEC001 -> Criticality Index: 1 -> Not Accessible

Salesforce -> Application -> CRM -> Criticality Index: 2> Functionality Issue

Private Cloud -> Application -> Exchange Server -> Criticality Index: 2 -> Slow Response

Private Cloud -> Intranet -> Connectivity -> Not Accessible

ATT -> Internet -> Connectivity

AWS ->Security -> Unauthorized Access



4.4. INCIDENT PRIORITIZATION

Incident Prioritization is one of the most critical aspects of not only IM process but the whole

lifecycle of IT Services. Incident Prioritization means allocating appropriate priority to an Incident

based on pre-defined criteria. Allocated priority codes will help support staff to give appropriate

attention to the Incident. Most of the IT Outsourcing Contracts are driven by the SLAs which are

defined based on Incident Priority Guidelines.

In a cloud environment, Incident Prioritization becomes all the more important because a) there

are multiple service providers who may have to work towards Incident Resolution, b) Single

hardware or hypervisor failure can effect multiple users and services & c) Due to heavy

dependency on Network (WAN & LAN), any network related issue must be treated as high priority

Typically, priority of an Incident is determined by two factors namely “Impact” and “Urgency”

where Impact is how much damage caused by an Incident and Urgency is how quickly it needs

to be resolved. Some of the organizations use a questionnaire to determine the impact and

urgency. In case of user reported Incidents, user can be facilitated to provide inputs for

determining the urgency.

Incident Priority data or logs are analyzed further for defining and negotiation SLAs (Service Level

Agreement)/ OLAs (Operational Level Agreement) and UCs (Underpinning Contracts). Therefore,

in a cloud environment, where there is significant dependency on the vendors/service providers,

a proper Incident Prioritization would certainly play a major role in SLA Definition and Negotiations

activities. It will also help in determining the good candidates (Apps or Infra) for migrating to public

cloud based on impact/urgency analysis.

An example of Incident Prioritization in Cloud Environment:

Urgency Urgency Determination

Questionnaire (example):

Revenue Generating

Service/Application?

Brand Exposure?

Safety Exposure?

Business Hours?

CIA Rating of the

Service/Application?

VIP User Profile?

Orchestration Engine

related?

High Medium Low

Impact

Extensive/Widespread Critical High Medium

Significant/ Large High High Medium

Moderate / Medium Medium Medium Medium

Localized/ Minor Medium Low Low

Impact Determination Questionnaire (example):

Number of Instances/ virtual devices?

Number of Services/ Applications?

Number of Geographical locations?

BCP Available?

Network Issue?

Number of Users?

Table 2



4.5. INCIDENT ESCALATION

In traditional IM process, there are two types of Incident Escalation procedures: 1) Functional

Escalation & 2) Hierarchical Escalation. Functional Escalation defines inter-groups/teams routing

model. Example: Service Desk to Wintel Support; Wintel to DBA; DBA to Network; Network to Third

Party and so on. On the other hand, Hierarchical Escalation provides a mechanism to involve

senior management or leadership team in case of a Sev-1 incident or any challenging situation

like ambiguity on Incident Ownership, involving third party on warranty issues, customer

dissatisfaction etc.

In a cloud environment, there are multiple parties involved or associated with a Service and

therefore any Service degradation (Incident) would require all the stakeholders to come together

as an online forum. For that purpose, Functional and Hierarchical Escalations should run hand-in

hand. The only difference is that business might not be interested in known the details of Incidents

while they would be interested in knowing the impact on their work. So, the communication has

to be designed in such a way that it sends out relevant details to the stakeholders.

In a suggested Incident Escalation model for cloud, an Incident should be assigned to a support

group and at the same time other groups who have any relationship with the Incident should also

get notification. Later on, after Incident resolution activity, one of the effected support groups

may be engaged to give a sign-off. Social Networking features in Service Management tool can

play a role in this kind of escalation. In- case of vendor related Incidents; vendor must be

intimidated at the beginning of the Incident lifecycle. Once the Incident is assigned to the vendor,

then a parallel communication must be sent to Problem Manager, IT Manager, Vendor Manager

and Account Manager (vendor).

SLA BREACH NOTIFICATIONS

In-case of SLA breach warning, a communication/notification must be sent out to group

manager, IT manager, IT Director etc. In an SLA breach situation, apart from IT leadership team,

stakeholders from the business and finance must be involved. Some of the vendors have service

based SLAs (non-negotiable) and in that case, a clear expectation setting must be done with the

business. During Service Design phase, business should get the option to choose components from

the catalog based on SLA vs. Cost analysis. Example:

Server Type Baseline SLA (turn-around) Hourly Downtime Cost (post

the Baseline SLA)

HPC Windows (Private) 2 Hours $7000

HPC Unix (Private) 2 Hours $6000

HPC Windows (Public) Best efforts $1500

HPC Unix (Public) Best Efforts $1100

Table 3



ROLE OF SERVICE DESK

In a traditional enterprise, Service Desks are responsible for determining Incident Category

followed by performing initial investigation based on knowledge base or Runbook and finally

escalating the ticket to the appropriate support group. Considering the complexity and nature of

the Incidents in cloud environment, there are chances that traditional service desk function might

not be able to do initial diagnosis and they may end up routing it to wrong support group. Hence

it becomes important to upgrade the traditional service desk by marrying it to monitoring teams

or command center. Combining two teams will form a function known as integrated command

center (ICC) or IT Operations Center (ITOC), which will have good technical competency to

perform initial investigation and escalation in cloud environment.

We have to keep in mind that majority of common Incidents related to availability, accessibility,

device failure etc. will be eliminated in cloud environment because of the high performance

compute design. Hence, it makes absolute sense to combine Service Desk and Command Center

and enhance the productivity.

4.6. INVESTIGATION, RESOLUTION AND RECOVERY

In traditional IM lifecycle, Incident Investigation & Incident Resolution are defined as sequential

activities. In cloud environment, we should go a step further and combine them for faster

turnaround. It would be a logical step because in the previous section, I proposed to merge

Service Desk and Monitoring teams for better initial investigation and diagnosis. Therefore,

unwanted Incident hopping (escalation to wrong groups) should be eliminated and resolution

and recovery should come right after the escalation.

Incident Resolution in cloud should be faster and better than traditional IT environment. There must

be higher degree of proactive detection, fault tolerance, redundancy to avoid downtime, auto

correction aspects and intelligent systems to analysis and detect Incidents proactively.

In a white paper published by VMWare on “Proactive Incident and Problem Management”, they

have defined three Cloud Capability Levels: 1) Reactive, 2) Proactive & 3) Innovative where

Reactive is lowest maturity level for a cloud provider and Innovative is highest. Reactive model is

natural approach but it’s not sustainable in cloud environment because of various reasons

including visualization, orchestration, no clarity on assets/CI/managed objects etc. So, it becomes

important to develop intelligent systems to analyze the event monitoring data, historical ticket

data, maintenance tasks, business growth patterns, IT needs of a business process and other IT

drivers and move from Reactive capability to Innovative Capability.

Incidents in a cloud environment would require highly skilled professionals but at the same time,

cloud environment provides enough redundancy to avoid/reduce downtime. So initially there

might be some limitations in establishing SOP/Run-book (Standard Operating Procedure) based

approach but in a longer run, cloud can provide enough opportunities to reduce Incidents and

automate resolution tasks. In a cloud environment, IT support staff should work towards ensuring

that repetitive Incidents do not occur in the environment.

Once the Incident is resolved, it can be owned by support team itself or passed to other group for

validation/sign-off. In case of user reported Incidents, a user sign-off must be taken.



4.7. INCIDENT CLOSURE

Once the Incident is resolved, it enters into the ultimate activity of its lifecycle which is Incident

Closure. Incident Closure is an important activity for ensuring that required solution has been

provided and implemented.

In Incident Escalation section, I have mentioned about the Incident or Service Failure notification

to all the stakeholders. Likewise, before closing the Incidents, system needs ensure that all the

stakeholders have given their sign-off on the Incident. This task can be automated by making it

time bound force closure. In case of public cloud, the closure must be performed only after

obtaining required confirmation from Cloud Providers.

Most of the Service Management tools provide Closure Categorization Codes (Similar to Incident

Categorization) and it would be helpful in Cloud Environment to use those codes properly.

If solution provided by support groups doesn’t completely solve the issue, then stakeholders or

end-user may choose to Re-open the incident. Any re-opened Incident would trigger hierarchical

escalation and involve senior management into the lifecycle for better governance.


15 Key Performance Indicators

5. KEY PERFORMANCE INDICATORS

Key Performance Indicators (KPIs) are also known as process performance measurement criteria.

As name indicates, the purpose of KPIs is to evaluate the process performance against process

goals and objectives. Some of the mature organizations have tightly coupled KPIs with Business

CSFs (Critical Success Factors).

As illustrated in ITIL v3 Guidelines “A KPI refers to a specific, agreed level of performance that will

be used to measure the effectiveness of an organization or process”

The standard to define KPIs is known is GQM approach where G is Goals, Q is Question and M is

Metrics. The goal is very clear here – to ensure that Incidents are resolved at the earliest. The

questions we may ask that “what it takes to do rapid incident resolution?”; “what can cause the

delay?”; “what are the dependencies?”

When we start thinking on these lines, we come across multiple KPIs related to Incident

Management process. Most of the KPIs are already being used in the industry. In this section, we

will try to explore the needs to revise the existing KPIs for Cloud Environment.

Let’s take a look at some of the KPIs:

- Percentage Reduction in number in Incidents (Month-on-month)

- Percentage Reduction in Weekly Incident Backlog (weekly)

- Percentage Increment in SLA compliance (daily/weekly)

- Percentage reduction in incorrectly assigned Incidents (weekly/monthly)

In case of Cloud, we need to consider the performance of the “vendor” or partner. Therefore

there is a need to have additional KPIs to ensure required coverage.

Some examples of additional KPIs for Cloud Incident Management Process:

- Ratio of auto generated tickets and user reported tickets

- Percentage reduction in issues escalated to Cloud Service Provider

- Percentage reduction in incorrect escalations to Cloud Service Provider

- Percentage reduction in the Incident Diagnosis time

- Percentage reduction in incorrectly categorized incidents

- Percentage reduction in number of major Incidents

- Percentage reduction in average turn-around time from vendor

- Increase in proactive detection rate


16 Key Policies

6. KEY POLICIES

Ticket Ownership Policy

Ticket ownership should always remain with the cloud consumer. Having said that, we must

account certain situations that are controlled by cloud vendor internally and cloud consumer

will have no role to play. For those instances, we can consider a joint ownership and ensure that

cloud consumer gets real time updates on the issues.

Escalation Policy

Any escalation to the cloud vendor must be approved or supervised by L3 support team or

Incident Manager. Team must ensure that there is minimum incorrect escalation to the cloud

vendor. In case of issues related to internal infrastructure or applications, the escalation

guidelines are same as mentioned in ITIL book.


17 Technology Considerations

7. TECHNOLOGY CONSIDERATIONS

As mentioned earlier, technology is going to play a critical role in supporting and managing

cloud environment and therefore the ITSM Processes must be integrated and orchestrated in

such a way that they can enable a seamless information flow between the processes, tools and

teams. There are four key technology considerations that are critical for running Incident

Management process in Cloud.

Service Catalog Self Service

Orchestration Analytics

Below is a reference high level architecture of Integrated ITSM Processes to support future

technology:

Figure 3


18 References

REFERENCES

1. ITIL 2011 Guidelines (https://www.axelos.com/itil)

2. Wikipedia (http://en.wikipedia.org/wiki/Cloud_computing)

3. ServiceNow (http://www.servicenow.com)

4. NIST Cloud Definition

http://en.wikipedia.org/wiki/Cloud_computing

http://www.servicenow.com/

Every cloud has a silver lining

Technology

Transcript of Every cloud has a silver lining