Every cloud has a silver lining
-
Upload
aditya-dashora -
Category
Technology
-
view
338 -
download
0
Transcript of Every cloud has a silver lining
EVERY CLOUD HAS
A SILVER LINING A WHITEPAPER ON ITSM INCIDENT MANAGEMENT PROCESS FOR CLOUD
ENVIRONMENT
Cloud Computing has changed the dynamics of IT Services business but organizations have
not been able to foresee the changes required in ITSM Processes and Procedures to adopt
the Cloud Computing. In this publication, I have tried to explore the procedure and process
level changes needed in ITIL Incident Management process in order to work smoothly in Cloud
Environment.
Published by: Aditya Dashora
© Conceptualized and Published by Aditya Dashora
1
About the Author
Aditya Dashora, a senior
consultant from Infosys
Limited is an IT Enthusiast
with around 9 years of
experience in delivering
many IT Service
Management
consulting projects for
large enterprises across
the globe.
Aditya is quite
passionate about
helping CIOs and CTOs
in improving their IT
Strategy to meet the
current and future
demands. Also, he is
instrumental in exploring
and defining new ways
of working for the
organizations by
leveraging technology.
Aditya is based out of
Bangalore, India.
Contact Information:
https://www.linkedin.com/i
n/adityadashora
© Conceptualized and Published by Aditya Dashora
2
CONTENTS
1. Executive Summary .................................................................................................................................. 3
2. A sneak peek into the world of “Cloud” .............................................................................................. 4
3. Incident Management process for Cloud ........................................................................................... 6
4. Procedural Level Changes ..................................................................................................................... 8
5. Key Performance Indicators ................................................................................................................. 15
6. Key Policies ............................................................................................................................................... 16
7. Technology Considerations .................................................................................................................. 17
References........................................................................................................................................................ 18
© Conceptualized and Published by Aditya Dashora
3 Executive Summary
1. EXECUTIVE SUMMARY
With the rapidly growing adoption rate, it is already conceived that within next 5-6 years, Cloud
Computing is going to change the rules of the game, played by victorious IT Service Providers
across the world. Firms, doing business in IT Infrastructure space have started feeling nervousness
about the growing acceptability of IaaS and PaaS services provided by Cloud Vendors. IT Service
Management, an instrument or weapon used by IT Service Providers and IT Support Organization
to fight the so called challenges in delivering IT Services to the customers, also considered as a
style statement within the IT Service Industry is going to play a vital role in the Cloud IT Shop.
However, concepts of ITSM will require some restructuring and renovation in order to attain the
capabilities to support the Cloud based IT Shop.
In this article, I have tried to explain the operational level changes needed in a traditional Incident
Management process to ensure accurate and speedy reaction to the Incidents/Issues/Events in
a Cloud Environment.
© Conceptualized and Published by Aditya Dashora
4 A sneak peek into the world of “Cloud”
2. A SNEAK PEEK INTO THE WORLD OF “CLOUD”
2.1. CLOUD ENVIRONMENT OVERVIEW
NIST definition of Cloud says that Cloud computing is a model for enabling ubiquitous, convenient,
on-demand network access to a shared pool of configurable computing resources (e.g.,
networks, servers, storage, applications, and services) that can be rapidly provisioned and
released with minimal management effort or service provider interaction. This cloud model is
composed of five essential characteristics, three service models, and four deployment models.
The three Cloud service models are defined as i.e. 1) Software as a Service, 2) Platform as a Service
& 3) Infrastructure as a Service. Similarly, there are four Cloud deployment models i.e. Private
Cloud, Public Cloud, Community Cloud and Hybrid Cloud.
There are five essential characteristics of Cloud Computing defined by NIST and they are: 1) On-
demand self-service, 2) broadband network access, 3) Resource Pooling, 4) Rapid Elasticity and
5) Measured Service.
Traditionally in an IT organization, IT support function(including managed service providers) is
responsible for procurement, implementation, support and maintenance of IT Services and
components like Critical Business Application, Enterprise/Corporate Applications, Messaging
Services, Databases, Batch Processing Services, Servers, Middleware, Storage and Back-up
infrastructure, Network, and IT Security Management etc.. In cloud implementation, some of the
mentioned IT components are provided and supported by a Cloud Service Provider on pay per
use basis. In case of a Private Cloud, the Technology Management function becomes the Cloud
Provider while in Public Cloud organizations avail services from providers like AWS, Rack Space,
Google Compute, MS Azure, Salesforce etc.
Focus of any Cloud implementation is to reduce cost of IT
and ensure high availability and in order to achieve that, it is
important to identify and analyze “IT Services” & “Critical
Business Applications” and define a Cloud implementation
strategy.
Some organizations choose to retain some of its critical IT Service components on-premise and
move reminder to the cloud. For example, a manufacturing company can choose to retain its
“Order Management System” applications and supporting infrastructure in-premise and offload
supporting services like Collaboration Portal, Messaging, CRM, HR Portal etc. to the cloud. This
setup is commonly known as Hybrid Cloud or IT Mix.
A common Cloud adoption approach is to move entire non-
production into cloud, which will ensure significant amount of
cost savings. Applications which require unpredictable
capacity during peak load hours are also good candidates
of cloud services.
© Conceptualized and Published by Aditya Dashora
5 A sneak peek into the world of “Cloud”
2.2. HYBRID CLOUD – THE REALITY OF THE FUTURE
Hybrid Cloud Environment is said to be the reality of the future of Cloud Computing. In the cloud
adoption journey, on one hand enterprises will transform their data centers into a private cloud
and also, they will engage multiple Cloud Providers to enjoy the benefits of Public Cloud. For this
white paper, I have considered a case of a big enterprise with a hybrid cloud environment. They
are using SaaS and IaaS from Public Cloud and along with their Private Cloud. In next section, I
have elaborated the required changes in the Incident Management process to manage a hybrid
cloud environment.
The reason to choose this scenario is that majority of the organizations will opt to walk on this path.
Organizations have already invested a lot into their IT environment and own IT Assets of worth
millions of dollars. Also, many organizations would choose to retain some of the IT Services related
to their critical business processes. Therefore, Hybrid Cloud deployment model provides enough
control, governance and flexibility so that enterprises can enjoy best of the both worlds.
© Conceptualized and Published by Aditya Dashora
6 Incident Management process for Cloud
3. INCIDENT MANAGEMENT PROCESS FOR CLOUD
3.1. INCIDENT MANAGEMENT PROCESS OVERVIEW
ITIL defines Incident as, “An unplanned interruption to an IT service or reduction in the quality of
an IT service. Failure of a configuration item that has not yet impacted service is also an incident,
for example failure of one disk from a mirror set.
“Incident Management (referred as IM hereafter) is the process for dealing with all incidents; this
can include failures, questions or queries reported by the users (usually via a telephone call to the
Service Desk), by technical staff, or automatically detected and reported by event monitoring
tools.”
These definitions are very much relevant in a cloud environment. The only update is that we can
be more specific in this definition to cover all the dependencies of IT Services: “An unplanned
interruption to an IT service or reduction in the quality of an IT service, service degradation/failure
of a configuration item or any enabler technology i.e. orchestration, hypervisor, self service
module, monitoring platform.”
Other important aspect which we need to keep in mind is dynamic nature of cloud environment
and because of that; many of the “Incidents” would end-up becoming minor Change Requests.
So, one need to be specific with the qualification of an Incident in a cloud environment.
INCIDENT MANAGEMENT (IM) HIGH LEVEL PROCESS DIAGRAM:
Figure 1
This process has been working seamlessly for any IT support function and would be instrumental in
cloud environment as well. There may be a need to emphasize more on some activities than
others. Also, cloud includes significant amount of automation and self-service and therefore, some
of the procedures or activities would be performed automatically or overlapped with sub-sequent
activities. (Red dotted circles)
© Conceptualized and Published by Aditya Dashora
7 Incident Management process for Cloud
SUGGESTED INCIDENT MANAGEMENT PROCESS FOR CLOUD:
Figure 2
© Conceptualized and Published by Aditya Dashora
8 Procedural Level Changes
4. PROCEDURAL LEVEL CHANGES
4.1. INCIDENT IDENTIFICATION
Incident Identification is performed in two fashions, 1) Identification through Event Management
platform & 2) User Reported Incidents. In cloud environment, there would be a higher degree of
dependency on the Event Monitoring systems and therefore a mature Event Management
process is a pre-requisite. Incidents related to enabler technologies like Hypervisors, Orchestration
Engine, Load balancers, Network or Domain Controllers will be identified by the monitoring tools
based on pre-defined thresholds. IT Security related Incidents will be identified by the IT Security
Monitoring tools and will share the information with central Event Management system or
Manager or Managers (MoM) layer. It is important to understand that in Public Cloud, there is a
potential risk of data leakage or security breach. Therefore, an IT organization must be sensitive
towards the security risk measures taken by Public Cloud vendors, and should try to establish real-
time monitoring of security issues.
For all user reported Incidents, it is important to determine the single point of failure using network
topology or CMDB (Configuration Management Database). Traditionally, this activity was
performed by L1/L2 support teams but in cloud environment, Service Desk or first point of contact
should be able to detect that in most of the cases. Cloud implementation ensures good amount
of automation and transparency which enables support staff to determine the single point of
failure.
A mature CMDB can provide CI dependency details but in cloud environment it might not be
relevant to identify faulty CIs due to dynamic nature of technology. Rather, it would be more
meaningful to trace the failed service, its dependencies on other services and failover plans. Also,
orchestration engine and self-service module can be configured to display on-going major
incidents to the users which can avoid incident queue.
In case of Incidents related to public cloud, most of the issues will be identified by the Cloud
Vendor and reminder would be reported by business-users/end-users. Ideally, orchestration
engine and service management platform should be capable of fetching real-time data from the
public cloud vendor and display ongoing outages/Incidents. That would suppress the related
Incidents. Apart from that, issues related to Network Connectivity, Application Functionality
(SaaS), Application Deployment (PaaS) related issues would be identified and reported by users
as usual.
In Hybrid Cloud model, issues related to Storage Gateway or connecter between In Premise infra
and Public Cloud will be identified and logged by both cloud service consumer and the vendor
(e.g. AWS). However, for better governance, the ownership of the ticket must remain with Service
Desk/L1/L2 support and not with the Cloud Vendor. Same policy would be applicable for tickets
created by monitoring tools at the vendor.
4.2. INCIDENT LOGGING
Incident Logging is second activity in IM lifecycle and it holds equal relevance in cloud
environment as traditional IT setup. As mentioned in ITIL v3 Service Operations book “All relevant
© Conceptualized and Published by Aditya Dashora
9 Procedural Level Changes
information relating to the nature of the incident must be logged so that a full historical record is
maintained”
Popular Service Management tools like ServiceNow, Remedy, HPSM etc. provide multiple fields for
logging an Incident ticket and needless to say that all of them are very much applicable for a
cloud environment. Besides, it would require a few additional fields to support accurate
classification and uniqueness of an Incident ticket in cloud environment.
For example:
- A field for identification of cloud provider would be very helpful in reducing overall
ticketing timestamp. It can be a dropdown with values like private cloud, public cloud etc.
or specifically NJ Datacenter, Singapore Datacenter, AWS, Rackspace etc.
- A field for associated hardware location/country can be helpful in case of security issues.
(tip: every country has different laws for data security)
- A field for affected Services or business processes would be helpful in communication
- A check-box for hypervisor related issues
In case of a hardware failure that can impact multiple services and thousands of users, Incident
Logging becomes crucial activity to trigger the resolution and recovery work. A hardware failure
must be treated as Sev-1 or Critical Incident and all dependent service owners/business process
owners must be notified in real time. Therefore, it is expected that the Incident Ticket should be
able to provide information about all the upstream and downstream dependencies of the failed
CI.
In case of incidents related to Public Cloud, the information flow from vendor’s monitoring and
ticketing tools to the host systems is essential and therefore automation and integration tools will
play a critical role.
4.3. INCIDENT CATEGORIZATION
Incident Categorization activity is performed by the Service Desk staff/IT Support Staff to ensure
that appropriate categorization codes are assigned to each Incident. With the help of
automation, Event Monitoring tools can also populate Incident Categorization codes while create
an Incident ticket from an Event.
In a cloud environment, although Incident Categorization activity overlaps with Incident Logging
however, Incident Categorization metadata must be designed to obtain meaningful information
for rapid routing of Incidents, Problem Identification and Supplier Management.
Traditional Multilevel Categorization Example is:
Category Tier-1 Tier-2 Tier-3
Incident Hardware Server Memory Board
Incident Software Microsoft Exchange
Table 1
Another popular approach is categorized as CI Category and Service Category. Example:
© Conceptualized and Published by Aditya Dashora
10 Procedural Level Changes
CI Name: NN150B12Win2k8A01
Service: Collaboration Service
In a cloud environment, we need to ensure that Incident Categorization provides details on
service provider, service, name of the application/service/server, criticality index etc. For
example:
AWS ->Infra -> ABCAWSUSEC001 -> Criticality Index: 1 -> Not Accessible
Salesforce -> Application -> CRM -> Criticality Index: 2> Functionality Issue
Private Cloud -> Application -> Exchange Server -> Criticality Index: 2 -> Slow Response
Private Cloud -> Intranet -> Connectivity -> Not Accessible
ATT -> Internet -> Connectivity
AWS ->Security -> Unauthorized Access
© Conceptualized and Published by Aditya Dashora
11 Procedural Level Changes
4.4. INCIDENT PRIORITIZATION
Incident Prioritization is one of the most critical aspects of not only IM process but the whole
lifecycle of IT Services. Incident Prioritization means allocating appropriate priority to an Incident
based on pre-defined criteria. Allocated priority codes will help support staff to give appropriate
attention to the Incident. Most of the IT Outsourcing Contracts are driven by the SLAs which are
defined based on Incident Priority Guidelines.
In a cloud environment, Incident Prioritization becomes all the more important because a) there
are multiple service providers who may have to work towards Incident Resolution, b) Single
hardware or hypervisor failure can effect multiple users and services & c) Due to heavy
dependency on Network (WAN & LAN), any network related issue must be treated as high priority
Typically, priority of an Incident is determined by two factors namely “Impact” and “Urgency”
where Impact is how much damage caused by an Incident and Urgency is how quickly it needs
to be resolved. Some of the organizations use a questionnaire to determine the impact and
urgency. In case of user reported Incidents, user can be facilitated to provide inputs for
determining the urgency.
Incident Priority data or logs are analyzed further for defining and negotiation SLAs (Service Level
Agreement)/ OLAs (Operational Level Agreement) and UCs (Underpinning Contracts). Therefore,
in a cloud environment, where there is significant dependency on the vendors/service providers,
a proper Incident Prioritization would certainly play a major role in SLA Definition and Negotiations
activities. It will also help in determining the good candidates (Apps or Infra) for migrating to public
cloud based on impact/urgency analysis.
An example of Incident Prioritization in Cloud Environment:
Urgency Urgency Determination
Questionnaire (example):
Revenue Generating
Service/Application?
Brand Exposure?
Safety Exposure?
Business Hours?
CIA Rating of the
Service/Application?
VIP User Profile?
Orchestration Engine
related?
High Medium Low
Impact
Extensive/Widespread Critical High Medium
Significant/ Large High High Medium
Moderate / Medium Medium Medium Medium
Localized/ Minor Medium Low Low
Impact Determination Questionnaire (example):
Number of Instances/ virtual devices?
Number of Services/ Applications?
Number of Geographical locations?
BCP Available?
Network Issue?
Number of Users?
Table 2
© Conceptualized and Published by Aditya Dashora
12 Procedural Level Changes
4.5. INCIDENT ESCALATION
In traditional IM process, there are two types of Incident Escalation procedures: 1) Functional
Escalation & 2) Hierarchical Escalation. Functional Escalation defines inter-groups/teams routing
model. Example: Service Desk to Wintel Support; Wintel to DBA; DBA to Network; Network to Third
Party and so on. On the other hand, Hierarchical Escalation provides a mechanism to involve
senior management or leadership team in case of a Sev-1 incident or any challenging situation
like ambiguity on Incident Ownership, involving third party on warranty issues, customer
dissatisfaction etc.
In a cloud environment, there are multiple parties involved or associated with a Service and
therefore any Service degradation (Incident) would require all the stakeholders to come together
as an online forum. For that purpose, Functional and Hierarchical Escalations should run hand-in
hand. The only difference is that business might not be interested in known the details of Incidents
while they would be interested in knowing the impact on their work. So, the communication has
to be designed in such a way that it sends out relevant details to the stakeholders.
In a suggested Incident Escalation model for cloud, an Incident should be assigned to a support
group and at the same time other groups who have any relationship with the Incident should also
get notification. Later on, after Incident resolution activity, one of the effected support groups
may be engaged to give a sign-off. Social Networking features in Service Management tool can
play a role in this kind of escalation. In- case of vendor related Incidents; vendor must be
intimidated at the beginning of the Incident lifecycle. Once the Incident is assigned to the vendor,
then a parallel communication must be sent to Problem Manager, IT Manager, Vendor Manager
and Account Manager (vendor).
SLA BREACH NOTIFICATIONS
In-case of SLA breach warning, a communication/notification must be sent out to group
manager, IT manager, IT Director etc. In an SLA breach situation, apart from IT leadership team,
stakeholders from the business and finance must be involved. Some of the vendors have service
based SLAs (non-negotiable) and in that case, a clear expectation setting must be done with the
business. During Service Design phase, business should get the option to choose components from
the catalog based on SLA vs. Cost analysis. Example:
Server Type Baseline SLA (turn-around) Hourly Downtime Cost (post
the Baseline SLA)
HPC Windows (Private) 2 Hours $7000
HPC Unix (Private) 2 Hours $6000
HPC Windows (Public) Best efforts $1500
HPC Unix (Public) Best Efforts $1100
Table 3
© Conceptualized and Published by Aditya Dashora
13 Procedural Level Changes
ROLE OF SERVICE DESK
In a traditional enterprise, Service Desks are responsible for determining Incident Category
followed by performing initial investigation based on knowledge base or Runbook and finally
escalating the ticket to the appropriate support group. Considering the complexity and nature of
the Incidents in cloud environment, there are chances that traditional service desk function might
not be able to do initial diagnosis and they may end up routing it to wrong support group. Hence
it becomes important to upgrade the traditional service desk by marrying it to monitoring teams
or command center. Combining two teams will form a function known as integrated command
center (ICC) or IT Operations Center (ITOC), which will have good technical competency to
perform initial investigation and escalation in cloud environment.
We have to keep in mind that majority of common Incidents related to availability, accessibility,
device failure etc. will be eliminated in cloud environment because of the high performance
compute design. Hence, it makes absolute sense to combine Service Desk and Command Center
and enhance the productivity.
4.6. INVESTIGATION, RESOLUTION AND RECOVERY
In traditional IM lifecycle, Incident Investigation & Incident Resolution are defined as sequential
activities. In cloud environment, we should go a step further and combine them for faster
turnaround. It would be a logical step because in the previous section, I proposed to merge
Service Desk and Monitoring teams for better initial investigation and diagnosis. Therefore,
unwanted Incident hopping (escalation to wrong groups) should be eliminated and resolution
and recovery should come right after the escalation.
Incident Resolution in cloud should be faster and better than traditional IT environment. There must
be higher degree of proactive detection, fault tolerance, redundancy to avoid downtime, auto
correction aspects and intelligent systems to analysis and detect Incidents proactively.
In a white paper published by VMWare on “Proactive Incident and Problem Management”, they
have defined three Cloud Capability Levels: 1) Reactive, 2) Proactive & 3) Innovative where
Reactive is lowest maturity level for a cloud provider and Innovative is highest. Reactive model is
natural approach but it’s not sustainable in cloud environment because of various reasons
including visualization, orchestration, no clarity on assets/CI/managed objects etc. So, it becomes
important to develop intelligent systems to analyze the event monitoring data, historical ticket
data, maintenance tasks, business growth patterns, IT needs of a business process and other IT
drivers and move from Reactive capability to Innovative Capability.
Incidents in a cloud environment would require highly skilled professionals but at the same time,
cloud environment provides enough redundancy to avoid/reduce downtime. So initially there
might be some limitations in establishing SOP/Run-book (Standard Operating Procedure) based
approach but in a longer run, cloud can provide enough opportunities to reduce Incidents and
automate resolution tasks. In a cloud environment, IT support staff should work towards ensuring
that repetitive Incidents do not occur in the environment.
Once the Incident is resolved, it can be owned by support team itself or passed to other group for
validation/sign-off. In case of user reported Incidents, a user sign-off must be taken.
© Conceptualized and Published by Aditya Dashora
14 Procedural Level Changes
4.7. INCIDENT CLOSURE
Once the Incident is resolved, it enters into the ultimate activity of its lifecycle which is Incident
Closure. Incident Closure is an important activity for ensuring that required solution has been
provided and implemented.
In Incident Escalation section, I have mentioned about the Incident or Service Failure notification
to all the stakeholders. Likewise, before closing the Incidents, system needs ensure that all the
stakeholders have given their sign-off on the Incident. This task can be automated by making it
time bound force closure. In case of public cloud, the closure must be performed only after
obtaining required confirmation from Cloud Providers.
Most of the Service Management tools provide Closure Categorization Codes (Similar to Incident
Categorization) and it would be helpful in Cloud Environment to use those codes properly.
If solution provided by support groups doesn’t completely solve the issue, then stakeholders or
end-user may choose to Re-open the incident. Any re-opened Incident would trigger hierarchical
escalation and involve senior management into the lifecycle for better governance.
© Conceptualized and Published by Aditya Dashora
15 Key Performance Indicators
5. KEY PERFORMANCE INDICATORS
Key Performance Indicators (KPIs) are also known as process performance measurement criteria.
As name indicates, the purpose of KPIs is to evaluate the process performance against process
goals and objectives. Some of the mature organizations have tightly coupled KPIs with Business
CSFs (Critical Success Factors).
As illustrated in ITIL v3 Guidelines “A KPI refers to a specific, agreed level of performance that will
be used to measure the effectiveness of an organization or process”
The standard to define KPIs is known is GQM approach where G is Goals, Q is Question and M is
Metrics. The goal is very clear here – to ensure that Incidents are resolved at the earliest. The
questions we may ask that “what it takes to do rapid incident resolution?”; “what can cause the
delay?”; “what are the dependencies?”
When we start thinking on these lines, we come across multiple KPIs related to Incident
Management process. Most of the KPIs are already being used in the industry. In this section, we
will try to explore the needs to revise the existing KPIs for Cloud Environment.
Let’s take a look at some of the KPIs:
- Percentage Reduction in number in Incidents (Month-on-month)
- Percentage Reduction in Weekly Incident Backlog (weekly)
- Percentage Increment in SLA compliance (daily/weekly)
- Percentage reduction in incorrectly assigned Incidents (weekly/monthly)
In case of Cloud, we need to consider the performance of the “vendor” or partner. Therefore
there is a need to have additional KPIs to ensure required coverage.
Some examples of additional KPIs for Cloud Incident Management Process:
- Ratio of auto generated tickets and user reported tickets
- Percentage reduction in issues escalated to Cloud Service Provider
- Percentage reduction in incorrect escalations to Cloud Service Provider
- Percentage reduction in the Incident Diagnosis time
- Percentage reduction in incorrectly categorized incidents
- Percentage reduction in number of major Incidents
- Percentage reduction in average turn-around time from vendor
- Increase in proactive detection rate
© Conceptualized and Published by Aditya Dashora
16 Key Policies
6. KEY POLICIES
Ticket Ownership Policy
Ticket ownership should always remain with the cloud consumer. Having said that, we must
account certain situations that are controlled by cloud vendor internally and cloud consumer
will have no role to play. For those instances, we can consider a joint ownership and ensure that
cloud consumer gets real time updates on the issues.
Escalation Policy
Any escalation to the cloud vendor must be approved or supervised by L3 support team or
Incident Manager. Team must ensure that there is minimum incorrect escalation to the cloud
vendor. In case of issues related to internal infrastructure or applications, the escalation
guidelines are same as mentioned in ITIL book.
© Conceptualized and Published by Aditya Dashora
17 Technology Considerations
7. TECHNOLOGY CONSIDERATIONS
As mentioned earlier, technology is going to play a critical role in supporting and managing
cloud environment and therefore the ITSM Processes must be integrated and orchestrated in
such a way that they can enable a seamless information flow between the processes, tools and
teams. There are four key technology considerations that are critical for running Incident
Management process in Cloud.
Service Catalog Self Service
Orchestration Analytics
Below is a reference high level architecture of Integrated ITSM Processes to support future
technology:
Figure 3
© Conceptualized and Published by Aditya Dashora
18 References
REFERENCES
1. ITIL 2011 Guidelines (https://www.axelos.com/itil)
2. Wikipedia (http://en.wikipedia.org/wiki/Cloud_computing)
3. ServiceNow (http://www.servicenow.com)
4. NIST Cloud Definition