Modern IT Operations Management with Moogsoft AIOps...Splunk Nagios XI Oracle AWS DBA, Storage, Net...

54
Modern IT Operations Management with Moogsoft AIOps Real-World Use Cases For Algorithms And Machine Learning Dominic Wellington | Moogsoft

Transcript of Modern IT Operations Management with Moogsoft AIOps...Splunk Nagios XI Oracle AWS DBA, Storage, Net...

Modern IT Operations Management with Moogsoft AIOps

Real-World Use Cases For Algorithms And Machine Learning

Dominic Wellington | Moogsoft

Confidential and Proprietary Information for Moogsoft Inc.

What it’s like to work in IT

Confidential and Proprietary Information for Moogsoft Inc.

Confidential and Proprietary Information for Moogsoft Inc.

Different Tools

Confidential and Proprietary Information for Moogsoft Inc.

• ~200 Apps• Key input to MIM• Front line Support

Incident Management Workflow Today: Reactive

Customer

I have an issue!

• Major Incident Mgmt• Root Cause Analysis

All-hands Bridge

• KPI Tracking• Root Cause Analysis• Multiple People• Research, document issue

further to prevent recurrence

ECCEyes on Glass

EMIMEscalation/Restoration

NOCNetwork-centric

~10-15 per day

SOProcess & Problem Mgmt

L1Major Incident Mgmt

Avg. MTTB: 8min

60%of Issues

Severity & area of issue determined/guessed

Biz Infra L2

• No access to ECC data• Diff view of universe• Blamed 70-80% of time• Routes, switches, ISP,

VPN, Firewall, etc..

Proactive Bridge

Avg. MTTD: 15-20 mins

Avg. MTTR 107 mins

Monitoring

Reactive, rules-basedevent management

Reactive, rules-basedevent management

GSAM IBMRun bridge

Internet

40% of Issues

Confidential and Proprietary Information for Moogsoft Inc.

Your Incident Workflow Today

TOOLS

2 Million Events

WORKFLOW

TEAMS

XMatters

NOTIFY

New Relic

Riverbed

Splunk

Nagios XI

Oracle

AWS

DBA, Storage, Net

L3 App Dev

L2 App Support

SWAT Team

Exec / App Owner

TROUBLESHOOT

NagiosDynatraceExtrahop

L1 Operators

ServiceNow

CustomerReported 65%

MonitoringReported 35%

TICKET

Duplicate 50%

CORRELATE

L1 Operators

ServiceNow

RESOLVE

No Action 75%

L1 Operators

Webex

BRIDGE

Mean-Time-To-Detect 60 mins

Mean-Time-To-Resolve 120 mins

ANALYZE

Confidential and Proprietary Information for Moogsoft Inc.

Current vs. Future State with Moogsoft AIOps

TROUBLESHOOT

DETECTSITUATION

AUTO-TICKET

AUTO-NOTIFY

CurrentState

WithMoogsoft

ANALYZE NOTIFY RESOLVETICKETCORRELATE BRIDGE

MTTD: 15 mins MTTR: 104 mins

MTTD: secs MTTR: < 60 mins Moogsoft Business Value $$$

Min. reduction in downtime by 25%

Min. reduction in MTTD by 25%

Min. reduction in MTTR by 25%

Min. reduction in tickets by 25%RESOLVE

LEARN

KNOWLEDGECYCLE

TROUBLESHOOT

Algorithms

Humans

SITUATIONROOM

Confidential and Proprietary Information for Moogsoft Inc.

Moogsoft AIOps at RBC

“Operators could take hours to realize that they were investigating the same tickets”

–Adam Frank, Director of Alarm & Event Management Systems, RBC

Before Moogsoft

▪ 50% reduction in operational noise ▪ 35% reduction in Mean-Time-To-Detect ▪ 43% reduction in Mean-Time-To-Restore ▪ 4x RoI in first year

With Moogsoft

Confidential and Proprietary Information for Moogsoft Inc.

Moogsoft AIOps in Deutsche Telekom Group

Moogsoft AIOps detected early warnings 10 hours before Netcool

▪ Progression across multiple NOCs, domains and countries

▪ All stakeholders aware and push-notified

Machine learning reduced 360,000 raw events to thousands of

Situations

▪ 99.7% noise reduction

▪ No service-affecting incidents missed

▪ CMDB not required

Moogsoft AIOps rapidly ingests modern event feeds

▪ SDN and NFV ready

Confidential and Proprietary Information for Moogsoft Inc.

“We needed to automate our ‘catch and dispatch’ process without the need for rules”

–Navin Sabharwal, Fellow & Chief Architect, HCL Technologies.

• 62% Fewer Tickets• 33% Shorter MTTR• Integrated with existing and future tools

Operational Efficiency

Agility & Flexibility

HCL – Benefits of AIOps with Moogsoft

Confidential and Proprietary Information for Moogsoft Inc.

Moogsoft AIOps Process Overview

ApplicationOutage Occurs

Millions of Events

and Alerts

Real-TimeSituationInsight

Situation

AlgorithmicNoise Reduction

& Clustering

Algorithmic Knowledge

Ecosystem Integration

Collaborative Team-Based

Workflow

Network

L1 DBA

L2

Storage

Sys Admins

Dev

Workflow Automation

Confidential and Proprietary Information for Moogsoft Inc.

Moogsoft AIOps Algorithms

Time

Detects patterns

in timestamps

Linguistic TopologyOps

Template

Neural

FeedbackKnowledge

Detect linguistic

relationships in

events

Detects Patterns in

network proximity

Blueprint past

faults for future

Detection &

remediation

Automatically learns

Behavior from IT Ops

Users.

Knowledge Reuse

For Past Situations

Adapted ML for Real-Time IT Ops & DevOps

ACE

Algorithmic

Clustering

Engine

Teams

Intelligent Team

Notifications

Entropy

Alert Ranking &

Noise Reduction

Confidential and Proprietary Information for Moogsoft Inc.

Algorithmic Clustering Engine (ACE)

Gra

ph

En

tro

py

Tim

e

Occu

rren

ce

Wh

iteli

sti

ng

Bla

ckli

sti

ng

Info

rmati

o

nalE

ntr

op

y

Netw

ork

Pro

xim

ity

Textu

al

Sim

ilari

ty

So

ft F

uzzy

Matc

hin

g

AppDynamics

Splunk

Solarwinds

Nagios

Oracle

ACE

Streaming

Events

Situations

Monitoring

Tools

Lightweight ACE Definitions route

data to appropriate algorithms

Firewall Incident

01/07/17 10:14:21 AM

CRM, Website and Order Services Impacted

Database Incident

01/07/17 11:19:37 AM

BI Service Impacted

Storage Incident

01/07/17 12:14:06 AM

Payment Service Impacted

Real-Time Algorithms

Cluster Events

Confidential and Proprietary Information for Moogsoft Inc.

Clustering Techniques – Precision vs. Recall

Recall(Quantity)

Precision(Quality)

Low High

Rules

ACETime

Linguistic

Topology

High

Low

*Low Effort

*High Effort

Confidential and Proprietary Information for Moogsoft Inc.

Moogsoft AIOps Architecture

MooBotsWorkflow,

Notifications& Remediation

LAMsEvent Ingestion

Log Events

Monitoring Events

Change Events

IT Service Desk

Event Feeds

CMDBE

ven

ts

Ale

rts

Situ

atio

ns

SNMP, Netcool, BMC BEM, CA Spectrum, HP NNM/OM

Splunk, Log Files, syslog

Jira, Chef, Puppet

Extrahop, Dynatrace, Nagios

ServiceNow, Cherwell, BMC Remedy, JIRA Service Desk.

BMC Atrium, HP/IBM/CA CMDB, AMDOCS, File, any database, etc.

Slack, Skype, Google+, etc.

CLI, Java, JavaScript, C++, ObjC, SQL, PERL, etc.

SigalizersMachine Learning

SituationRoom

UI & Collaboration

MOOG

Knowledge

Real-time Bus

External

Knowledge

Script and Process etc.

IRC/Chat/Chatbots

NotificationsPagerDuty, OpsGenie, xMatters

Vielen Dank für Ihre Aufmerksamkeit