Event Management Best Practices1 488

OverviewBSM and Event ManagementIncident Management and Problem ManagementPrinciples of Event Management Best PracticesData AcquisitionNormalizationEnrichmentCorrelation and Root Cause AnalysisConsole ConsolidationAutomationNotificationEscalation ReportingBenefitsCustomer ExamplesBMC Software Solutions

2003 BMC Software

Event Management is part of Business Service ManagementService Impact ManagementDeals with the relationships between Services and IT (dependency and impact)Service model definitionService impactService availabilityAutomate and visualize eventsRequires sophisticated Event ManagementEvent Management Deals with receiving and manipulating the IT eventsCollect and process eventsDefine and manipulate eventsPerform notificationsRequires well-developed Infrastructure and Application Management

2003 BMC Software

Incident Management and Problem ManagementEvent Management is comprised ofIncident managementProblem managementThe primary objective of incident management is to restore service as soon as possible to minimize any negative effect on business services.The objective of problem management is to takes a proactive approach in defining preventative measures so that service disruptions do not occur.

2003 BMC Software

Best Practices PrinciplesConsistent processEvent management tools that achieveData AcquisitionNormalizationEnrichmentCorrelation and Root Cause AnalysisConsolidationAutomationNotificationEscalation Reporting

2003 BMC Software

Data AcquisitionData acquisition encompasses all the methods by which event management information is collected. These methods can be PushPullPublish/Subscribe

2003 BMC Software

NormalizationNormalization is the process of homogenizing event data into a common event format.

Specific values are always located in a specific field and called a specific nameRegardless of the source of an event, a standard set of data is associated with all eventsReporting is made consistent and efficientEvent data has a common definition

2003 BMC Software

EnrichmentEnrichment is the process of adding value to the original event data for the purpose of streamlining incident management and facilitating service management.

Provides details that supplement trouble tickets and repair actionsExtends event data for correlation, automation, notification, and reporting functionsAssociates service data or other data, such as blackout periods, to an event that can be used in incident or problem management

2003 BMC Software

Correlation and Root Cause AnalysisCorrelation and root cause analysis are processes that determine the source of a problem, identify sympathetic events, and relate associated events.

Focuses repair action and speeds service restoration for incidentsStreamlines the event data presented to operators by suppressing event clutterTargets problem management efforts

2003 BMC Software

Console ConsolidationConsolidation is the process of delivering all events from across the enterprise to a single pane of glass.

Manage more with lessReduce complexityLeverage existing infrastructure management toolsGain a birds eye viewFacilitate service management and problem management

2003 BMC Software

AutomationAutomation is the process of removing or reducing the need for human intervention in event management while promoting service resiliency.

Allows problems to be fixed at machine speedSupports escalation of events when repair criteria are exceededStreamlines event management workload on staffEnforces policies and processes

2003 BMC Software

NotificationNotification is the process of presenting incident information to the right person in the right form at the right time, and verifying their receipt of the information.

Speeds information about an incident to the trouble ticket system or repair expertContacts a repair expert in a suitable mannerFrees operators from another rote taskSupports multiple forms of communication

2003 BMC Software

EscalationEscalation is the process of heightening the severity of or broadening the awareness about an incident if not being addressed in an appropriate and timely manner.

Prevents incidents from falling through the cracksEnsures attention to the most critical problemsElevates the status of an event if the problem persists

2003 BMC Software

ReportingReporting is the process of disseminating information that reflects the measurement of service level agreements, historical usage, or the performance of service delivery components.

Summarizes event management detailsMeasures problem resolution effectivenessProvides consistent service delivery data

2003 BMC Software

Best Practices BenefitsIdentifies affected services and prioritizes repair actions through enrichment and correlationPinpoints the exact problem condition through acquisition and correlationPresents repair details through enrichment, such asPhysical device location or specific application/databaseResponsible department/staff expertRepair action to be takenReduces incident management workload through automated repair actionDelivers consistent service reporting through acquisition and normalization

2003 BMC Software

Best Practices BenefitsSpeeds notification to the right person at the right time in the right form through notification and escalationReduces mean time to repair through escalation, notification and automation Reflects problem areas that can be addressed through problem management through reporting

2003 BMC Software

Customer Example #1

Consolidation, Notification, Correlation, and Root Cause Analysis at a National Bank

2003 BMC Software

Customer Example #2

Normalization, Enrichment, and Automation at a Telco Vendor

2003 BMC Software

Customer Example #3

Enrichment and Notification at a Large Hospital

2003 BMC Software

BMC Software Event Management SolutionsProductsBMC Event ManagerPATROL Enterprise ManagerPATROL KM for Event Management PATROL agent domain onlyAlarmPoint by Invoq Systems

ServicesArchitectural AssessmentProfessional Services

2003 BMC Software

drive

Thank you!Questions?

2003 BMC Software

You cant get to BSM without EM and SIM.Event Management is comprised of incident management and problem management.

Incident management is essentially reactive event management. It involves problem situations that are disrupting service delivery and require corrective action immediately! For example, when a customer contacts the help desk, they are typically calling because they are experiencing a service disruption at that moment.

Problem management is essentially proactive event management. Being able to analyze problem situations that have occurred over time and establish management tools to monitor and respond to those types of situations before they become incidents.

For example, a batch job runs every night at 1am on a particular server and requires 30% disk space availability. The disk itself is backed up daily at 2am, and usually one backup a day is sufficient. One day, an incident occurred at 1am during batch processing because the disk was more than 70% full at batch initialization, so the job could not complete without human intervention. By applying a problem management methodology, reporting tools were used to identify additional and related information about problems with that server. Through analysis, IT operations determined that the problem condition had occurred a couple of times in the past. The problem management solution was to define system management tools to monitor the server disk space and generate a notification (a trouble ticket) if the disk space was greater than 60% during any one time of the day, except between 1am and 2am. If the condition was detected, automation would be triggered that would initiate an immediate backup of the server.Event Management is comprised of incident management and problem management.

Incident management is essentially reactive event management. It involves problem situations that are disrupting service delivery and require corrective action immediately! For example, when a customer contacts the help desk, they are typically calling because they are experiencing a service disruption at that moment.

Problem management is essentially proactive event management. Being able to analyze problem situations that have occurred over time and establish management tools to monitor and respond to those types of situations before they become incidents.

For example, a batch job runs every night at 1am on a particular server and requires 30% disk space availability. The disk itself is backed up daily at 2am, and usually one backup a day is sufficient. One day, an incident occurred at 1am during batch processing because the disk was more than 70% full at batch initialization, so the job could not complete without human intervention. By applying a problem management methodology, reporting tools were used to identify additional and related information about problems with that server. Through analysis, IT operations determined that the problem condition had occurred a couple of times in the past. The problem management solution was to define system management tools to monitor the server disk space and generate a notification (a trouble ticket) if the disk space was greater than 60% during any one time of the day, except between 1am and 2am. If the condition was detected, automation would be triggered that would initiate an immediate backup of the server.

As a best practices doctrine, enterprise event management is a basic building block for achieving service delivery. Delivering IT services it the basic product output from the IT enterprise. Event management best practices encompasses defining methodologies and tooling your environment correctly.

Best practices is not just about event management tools, its also about the people and processes.

In order to put an effective event management solution in place across the enterprise, you need to ensure that all of the separate departments with all of their different politics and ways of working are all going to agree that event management practices should occur in a specific way. Its your event management tools that offer you the flexibility in determining how standardized these processes have to get.

No matter how complex the actual enterprise, best practices application of your event management tools is where you take low-value raw event information is transformed into valued guidance for repair actions and service continuity. The Event Management system can be considered the tool for adding value to event data.

Some goals of effective event management best practices include:a well-defined process a properly tooled event management system. meaningful event information that enables better directed and more focussed IT decisions Understanding what parts your enterprise is critical to your business services and understanding the priority of each business service.

Data acquisition facilitates downstream processes like correlation, root cause analysis, and reporting. With consolidated event data, reporting, analyzing and relating problems is made easier because there is one place from which to manipulate the event data.

Your network team may use NNM; the Oracle, Unix, Windows and Siebel teams may use PATROL, the mainframe group may use MAINVIEW, or these teams could be using non-BMC Software products. Perhaps you have multiple data center locations and one location uses HP Operations while another uses PATROL Enterprise Manager. An effective event management tool should not care where its getting its data, but it should be a tool that can collect all valuable raw data. It also should allow for that data to be collected only once. So if an infrastructure management tool, such as NNM that is monitoring the network, is already collecting snmp traps on all network devices, your event management system should be able to get the data from NNM rather than collect the same data redundantly.

Event driven and standard SNMP are examples of data that is pushed into an event management productPolling technologies pull data into an event managerThe PATROL 7 architecture would be an example of a publish/subscribe event source, or An API integration can be deploy any of these methods, depending on the capabilities of the API

Input : raw alert.Output : raw alert When you do have to cope with existing information sources (existing element management systems, applications) or information sources made available by external companies (hardware vendors, standard software packages, ), you end up with a lot of different data formats and most of it will not have valuable service information.

To provide value, data should be normalized. This is simply a process of converting multiple event data formats into one normalised event format.

Normalization provides efficiencies to downstream event management processes. Activities such as reporting, correlation, and enrichment etc., are made more efficient and straightforward by starting with data normalization. Insignificant parts of event data can be discarded during this process, keeping only the essential data for event management.

Normalization also scales down the number of different possibilities a certain event field can have. For example, different software packages use different status terminology to tell the user that there is a mismatch with a desired state (e.g. Backup Failed, Transaction Abended, System Crashed, File Full, ). With normalization, although those different values continue to exist, they can all be identified as a common normalized field call status, as an example.

Over time, normalization helps your IT staff have a consistent understanding of each field of data. This makes human interaction across the enterprise more efficient and effective when the definition of event data can be discussed a common manner.

Input: raw alert from data acquisitionOutput: normalized alert.

Devices, systems, applications, etc, are not going to issue event messages that say Contact George at x69154 to fix my problem! And by-the-way, Im used by the order entry system so no orders can be processed and no money can be made by this company until I am fixed.

Enrichment allows pertinent specific information about your own business services, the event impact, and workflow processes to be added to a specific event.

Enrichment can give perspective to an event. Even though you have normalised event data, we still dont know to which part of the organisation (or service) this data belongs, nor do we know the service impact of the event, we may not know technical details like physical location or IP address. In order to add this value, we need to enrich the original data with information that provides contact information, service impact, or device location, data needed by a trouble ticket systemwhat ever data will be of value to the operator, the technician, business managers, IT managers, for historical reporting purposes, etc.

Some examples include:A network specialist receives enriched event information from a trouble ticket that identifies the physical location of a faulty switch. The specialist knows exactly where to go to fix the problem.A Voice Responding Unit (VRU) fails, but service is not impacted because the workload is balanced across other VRUs. Enrichment can be used to identify that the managed object is surrounded by backup devices. During downstream correlation and root cause analysis processes, this type of data helps put the priority of an event into greater perspective.A server is unavailable due to scheduled maintenance. Enrichment can add scheduled maintenance data so that operators know not to take action on these events.

Enrichment can be used in many ways to add value to an event.

Input: normalized alertOutput: Enriched alert

Correlation and root cause analysis are interrelated. Correlation identifies if events have relationships and root cause analysis determines how events are or are not related. Essentially, correlation makes root cause analysis possible.

Root cause analysis is extremely helpful in enterprise event management because it pinpoints the real issue versus sympathetic events that have only occurred because of the real issue. By identifying the root cause of an incident, the mean time to repair can be dramatically reduced. The correct technician is more easily notified and repair actions are directed to a specific problem.

Correlation and root cause analysis are also critical to determining whether or not an event is a critical incident or can be handled as with less priority as a problem.

For example, server A is vital to a business service. It communicates through router 1, and uses router 2 as a backup router. If router 1 goes down, correlation and RCA can be used to evaluate the event of router 1 going down to determine that the alert, trouble ticket, or page can be given a lesser severity. Although its important to maintain a backup router for server A, the service is not being impacted at that moment. This provides an opportunity to evaluate an issue that is impacting service over this type of issue which is just leaving a service is a little more at risk, by weighing the severity of problems against the importance of a service to the business.

Event consolidation is a best practice because it supports an efficient use of human resources. If you consolidate all of event messages into a single pane of glass, you need fewer people to manage the enterprise operations. This frees individual departments from continuous basic operations efforts and allows them to focus on more technical repair activities.

With centralized, consolidated management, the workflow processes for IT management across all departments becomes standardized and simplifiedat least to the extent of identifying and notifying the right person about a problem, triggering automation, or integrating with trouble ticket applications or phone notification systems.

For example, with a consolidated display of all of the different pieces of the enterprise used in delivering a specific service, you can view the status of the service more easily if all of the affecting events are located in one display.When an outage or slowdown occurs, a local infrastructure management tool, an Enterprise Event Management System, or a human being can take action.

An event management tool that has the ability to launch automation removes human intervention from the event/repair/restore process, and achieves machine-speed recovery. Common repair tasks can be triggered as automatic event-driven actions, like initiating a router reboot for example. Some examples of automation could includeRestarting a serverA Remote IPLInitialising a batch jobResetting a terminal ID

Many rote operator tasks can be automated to ease the amount of human interaction required in a fast-paced service-driven IT environment. At the other extreme, very specific and complex repair tasks can also be automated to ensure speed and accuracy in addressing that issue with consistency. With complex automation, you protect the IT environment from operators invoking inappropriate repair actions.

Also, automation takes thought and planning because it can be a multi-faceted aspect of best practices. Some automation actions can result in new input for the event management system, such as performing a lookup for the name of something in an active database. Sometimes this new data may require another action to be taken, which then results in more automation to be triggered. For example, getting the IP address of a device and plugging it into an automation script or getting an application logfile content and parsing through it.

Input: Enriched alerts relevant to the decision makerOutput: execute action

A critical task of an Event Management system is to deliver the right information to the right people at the right time.

Notification involvesIdentifying who to notifyIdentifying how to notify themUnderstanding the relevant information that a person needVerifying their receipt of the information

Notification can take many forms, depending on the workflow processes deployed at an organization. Some departments may deploy pagers or handhelds into their incident management processes. Some departments may work exclusively from a visual event console. A lot of departments work solely from trouble tickets. Outside of the IT division, business managers may require notification about incidents that impact the services they rely on.

The purpose of notification activities is to determine who needs what information, deliver it to them on time and in a way that they can obtain it, and to be able to verify that the data is received.

Input: Enriched AlertOutput: Enriched alert send to the relevant decision makers (Helpdesk, Operations, On-Call staff, etc.)

An effective event management system needs to have some way of changing the inherent information that currently exists in an event.

For example, if a technician is not responding to a page or automation does not recover problem, the service continues to be impacted and you need to be able to add more significance to an incident.

Conversely, there could be cases where you need to downgrade the significance of incidents.

Escalation is basically providing a second level of defense in protecting your services. Automation can be applied to escalation actions, such as escalation after 5 minutes of continuous service disruption.The purpose of reporting is to combine information from different sources and transform it into a form that provides clear guidance to IT managers and business managers. Reporting is important in pinpointing problem management tasks and reviewing the details of service disruptions to service level agreements. Reporting helps IT managers plan for growth and change in people, processes, and technology.

Because the form in which information is stored often determines the type of reporting that can be achieved, normalization and event enrichment are paramount to providing data consistency and supporting clear and concise reporting capabilities.

What specific information should be reported upon and the format of those reports will vary from function to function and user to user throughout an enterprise. Business managers may require high-level service impact summaries on a monthly basis, while IT managers may require high-level summaries of mean time to respond, mean time to repair, historical events over time, or other data on a shift, daily, weekly or monthly basis, service level impact data.

Input: Enriched alert for the relevant decision makerOutput: Presentation as needed by the decision maker

Best practices should provide these types of benefits. Notification, Correlation, and Root Cause AnalysisA bank in Peru had over 100 branch offices nationwide. They had a diverse environment, including Tandem, OS/390, AS/400, PATROL, Tivoli NetView, etc., and 144 consoles in their command center. The staffing needed to monitor all the consoles was too high. In phase 1 of their consolidation effort, they reduced their consoles from 144 to 25; phase 2 calls for getting to a single console. With all of those consoles, centralized operators spent most of their time trying to pinpoint the exact root cause of a problem and contacting a branch technical expert to repair the problem. This was particularly true for their geographically dispersed ATM machines. By incorporating correlation and root cause analysis processes into their consolidated event management system, they can now quickly identify the source of a problem. With automated notification to remote branch technical experts, their ATM service availability has improved dramatically. Optimizing the use of their computing and human resources through notification, correlation, and root cause analysis allows the centralized operators to spend a lot less time on reactive IT management and more time on proactive management.

Notification, Correlation, and Root Cause AnalysisA bank in Peru had over 100 branch offices nationwide. Centralized operators spent most of their time trying to pinpoint the exact root cause of a problem and contacting a branch technical expert to repair the problem. This was particularly true for their geographically dispersed ATM machines. By incorporating correlation and root cause analysis processes into their consolidated event management system, they can now quickly identify the source of a problem. With automated notification to remote branch technical experts, their ATM service availability has improved dramatically. Optimizing the use of their computing and human resources through notification, correlation, and root cause analysis allows the centralized operators to spend a lot less time on reactive IT management and more time on proactive management.

Enrichment and NotificationA customer has a huge amount of infrastructure management solutions installed throughout their enterprise to monitor their Unix and Windows servers, their Oracle databases, their specific medical billing and customer management applications, and other pieces of the enterprise. They also had a trouble ticket system and their own phone system had the ability to schedule pages and phone calls based on a time schedule for their 24x7 operations. Most of their infrastructure management people used the trouble ticket system to track their efforts, but this company had no solution that could put each events into the perspective of the service that was being impacted. They needed to consolidate all of their infrastructure alerts, normalize the data from all the various tools, enrich it with service and priority data, and integrate it with their existing trouble-ticketing and phone systems. Their event management solution pulls all the enterprise event data into a single location and enriches the data with a service name and a value indicating the importance of that business service, as well as other asset management information. The system then goes one step further to calculate a weighting factor formula that is tied to service level agreements and business priorities. The result is a numeric priority for fixing a problem. When the event management solution notifies technicians via the phone system and opens up trouble tickets for events, the information presented to the specialist identifies what is broken, where its located, what the IP address is, etc., plus what service is being impacted, how important that service is to the business, and whether the repair priority of the problem is level 1, 2 or 3. This type of a best practice implementation required not just event management tooling, but a simple agreement across the IT department on what the response level should be for level 1, 2 and 3 tickets; (ie 15 minutes, 30 minutes, 2 hours).

Event Management Best Practices1 488

Documents

Transcript of Event Management Best Practices1 488