Cognitive IT Operations Analytics: Leveraging …Cognitive IT Operations Analytics: Leveraging...
Transcript of Cognitive IT Operations Analytics: Leveraging …Cognitive IT Operations Analytics: Leveraging...
Cognitive IT Operations Analytics: Leveraging machine learning to reduce system outagesDan Wiegand
IBM
November 2019
Session OK
The market is entering a new chapter in cloud and digital
Chapter 1
Consumer-driven innovation
Digital/AI experimentation
“User applications”
Public cloud
Chapter 2
Enterprise-driven innovation
Digital/AI embedded in the business at scale
“Mission critical” workloads
Hybrid cloud
IBM Z Operations Analytics – Cognitive IT Operations Analytics / GSE UK, November 2019 / © 2019 IBM Corporation
Regulatory
fines can
cost > $70M
33% of
outages cost
> $1M per
hour
Single
incidents
can cost >
$100M
14% of
outages cost
> $5M per
hour
3
The cost of an outage
IBM Z Operations Analytics – Cognitive IT Operations Analytics / GSE UK, November 2019 / © 2019 IBM Corporation
55%find out from an executive or non-IT
member at their company who alerts the
IT department
4
Reactive
operations
Issues identified by the
internal or external users
38%find out from users
posting on social
networks
58%find out from users
calling or emailing
their organization’s
help desk
IBM Z Operations Analytics – Cognitive IT Operations Analytics / GSE UK, November 2019 / © 2019 IBM Corporation
Reactive OperationsThe War Room
5
Oh no! I’m getting reports that
some users are having issues
with my application
IT Ops
CICS is looking good.
Must be Db2….
CICS SME
Uh-oh! That
transaction is waiting
on a lock impacting
the response time.
DB2 SME
No unusual
behavior with the
network
Network
No issues with my
databases
DBA
MQ looks ok to me.
I’ll check with the CICS zSME.
MQ SME
All our Linux
systems look fine
Linux platforms
Our cloud
services seem fine
Cloud support
IBM Z Operations Analytics – Cognitive IT Operations Analytics / GSE UK, November 2019 / © 2019 IBM Corporation
Best practice for Enterprise Operations – Proactive Resolution
6
Application Performance ManagementEnd to end transaction visibility in Application specific views
Operations AnalyticsOps metrics from across the enterprise in a single location
Deep dive toolingDomain specific tooling for the SMEs
1. Deep dive, powerful and familiar tooling for visibility at the control block level
2. Integration between monitoring, automation & scheduling
IBM Z Operations Analytics – Cognitive IT Operations Analytics / GSE UK, November 2019 / © 2019 IBM Corporation
Proactive Operations
7
That’s weird. I saw some
application problems but they
seem to have stopped. Hope it
doesn’t happen again!
IT Ops
I’m starting to see some storage
violations…. I need to fix what’s
happening in the CICS application
quickly…. I hope this isn’t affecting
any of the LOB apps!
CICS SME
Issues identified by
internal operations
IBM Z Operations Analytics – Cognitive IT Operations Analytics / GSE UK, November 2019 / © 2019 IBM Corporation
Best practice for Enterprise Operations – Hybrid Cloud Visibility
8
Application Performance ManagementEnd to end transaction visibility in Application specific views
Operations AnalyticsOps metrics from across the enterprise in a single location
Deep dive toolingDomain specific tooling for the SMEs
1. Avoid war rooms with swift component isolation
2. Eliminate black boxes with full enterprise visibility
3. Appropriate data for SME hand over
8IBM Z Operations Analytics – Cognitive IT Operations Analytics / GSE UK, November 2019 / © 2019 IBM Corporation
Proactive Operations
9
I can see a slowdown happening
and not as many CICS
transactions are being
processed. I better call our CICS
SME!
IT Ops
Thanks for the notification. I can see
some storage violations…. I need to
fix what’s happening in the CICS
application quickly before affecting
our customers using the LOB apps!
CICS SMEIBM Z Operations Analytics – Cognitive IT Operations Analytics / GSE UK, November 2019 / © 2019 IBM Corporation
Issues identified by
internal operations
through enterprise
visibility
2 Sessions Tomorrow
• IBM And AppDynamics – Empowering
Greater Agility
• IBM Z Operations Analytics and Splunk
10
Proactive
operations
through AI Ops
Issues identified through
integration with Machine
Learning
IBM Z Operations Analytics – Cognitive IT Operations Analytics / GSE UK, November 2019 / © 2019 IBM Corporation
Best practice for Enterprise Operations – AI Ops
11
Application Performance ManagementEnd to end transaction visibility in Application specific views
Operations AnalyticsOps metrics from across the enterprise in a single location
Deep dive toolingDomain specific tooling for the SMEs
1. Deep dive, powerful and familiar tooling for visibility at the control block level
2. Integration between monitoring, automation & scheduling
3. Machine Learning technology for forecasting of potential future problems
11IBM Z Operations Analytics – Cognitive IT Operations Analytics / GSE UK, November 2019 / © 2019 IBM Corporation
Proactive Operationswith AI Ops
12
Everything is running great and
our customers are happy.
IT Ops
I was just alerted that we MAY run
short on storage soon…. I need to fix
what’s happening in the CICS
application before this affects any of
the LOB apps!
CICS SMEIBM Z Operations Analytics – Cognitive IT Operations Analytics / GSE UK, November 2019 / © 2019 IBM Corporation
Issues identified by
internal operations
through AI Ops
Autonomous Operations
13
Everything is running great and
our customers are happy.
IT Ops
CICS is running great
CICS SME
Db2 is running
smoothly
Db2 SME
No unusual
behavior with the
network
Network
No issues with my
databases
DBA
MQ has no issues.
MQ SME
All our Linux
systems look fine
Linux platforms
Our cloud
services seem fine
Cloud support
13IBM Z Operations Analytics – Cognitive IT Operations Analytics / GSE UK, November 2019 / © 2019 IBM Corporation
I’ve learned what
is normal and how
to respond when it
isn’t.
IBM Z Operations Analytics AI Operations Journey
Deliver Guidance
Use MachineLearning on tribal
knowledge together with prescriptive
analytics to provide guidance on actions
to take
Goal: provide insight and recommend remediation
Learn ”Normal” from history
Use Machine Learning to Leverage
historical normal system behavior to establish baseline
Autonomous Operations
Preemptive action on problems
through proven automation
Goal: allow the system to manage its “normal” health
Early Detection of Departure from
“Normal”
Use Machine Learning to Detect changes in system behavior together with predictive
analytics to determine likely
outcomes to alert operations for early
intervention
Goal: Eliminate / Reduce Impact
from service disruptions
14IBM Z Operations Analytics – Cognitive IT Operations Analytics / GSE UK, November 2019 / © 2019 IBM Corporation
IBM Z Operations Analytics zAwareat a Major Airline
15
Major airline implemented zAware component in IZOA
In 12 months, identified 4 Anomalies that would have led to significant impacts if left unresolved
Example: Anomaly reported for a spike in a data set allocation error messages ~800 in a 10 minutes
Storage team quickly identified the issue, and restored service before customers were impacted
Minor impact for end users, avoided a much more significant outage due to rapid resolution
IBM Z Operations Analytics – Cognitive IT Operations Analytics / GSE UK, November 2019 / © 2019 IBM Corporation
Statement of Direction: September 2018Enhanced Cognitive Operations Through Embedding IBM Watson Machine Learning for z/OS into IBM Z Operations Analytics
16
… IBM intends to continue enhancing the Problem Insight dashboard available in Z Operations Analytics to help organizations ensure their IT operations meet their business goals ...
… New insights are expected to advance the existing IBM System z Advanced Workload Analysis Reporter (IBM zAware) anomaly detection by leveraging new IBM Watson Machine Learning for z/OS models and problem signatures based on IBM data science and expertise to forecast when system behavior may lead to broader user impacts and system outages.
Users can expect to be alerted to system behavior changes based on historical data trend analysis across multiple subsystems.
ibm.biz/IZOAAnnounce
IBM Z Operations Analytics – Cognitive IT Operations Analytics / GSE UK, November 2019 / © 2019 IBM Corporation
Now Available - IZOA 4.1
Announced September 3, 2019
2. Training
IZOA Machine Learning Data and Analytic FlowTraining, Scoring, and Anomaly Detection
17
Normal
metrics
……………
…
Model
Live metrics
……………
…
……………
…
Watson Machine
Learning for z/OS
compares model
with live metrics
Data represents
normal system
behavior
Trained Model 2. Scoring
Watson Machine
Learning for z/OS
creates a model of
expected behavior
3. Visualize Anomaly Scores
1. Collect Data
IZOA ML
Db2 DB
SMF
Scoring runs unattended to
generate live anomaly scores
CDP batch
load for
training
CDP
streaming
for scoring
IBM Z Operations Analytics – Cognitive IT Operations Analytics / GSE UK, November 2019 / © 2019 IBM Corporation
Visualize anomaly scores for each
KPI in the subsystem to help diagnose
potential issues
Watson Machine
Learning for z/OS
trained model
I can see that the Db2 active
connections are showing as anomalous
and have been slowly trending upward.
We just made some changes that could
be an issue.
Db2 SME
IBM Z Operations Analytics
Visualize Subsystem KPIs Scoring
• Real time scoring shows an
anomaly has been detected …
• The impacted metrics do not
match a known problem, but
indicate a broader impact may
occur
• Z Operations can look at the
specific anomalous metrics and
contact the appropriate Z SME for
investigation
Dynamic Problem InsightsSubsystem Scorecards
18IBM Z Operations Analytics – Cognitive IT Operations Analytics / GSE UK, November 2019 / © 2019 IBM Corporation
DDF_TCB_TIME KPI – Normal Day
DDF_TCB_TIME KPI – Anomalous From12-4 AM
19IBM Z / DOC ID / July 17, 2017 / © 2017 IBM Corporation
SYSD
SYSD
SYSD
SYSDF
SYSF
SYSD
SYSD
SYSD
SYSDF
SYSF
A storage violation was detected by the module named in the message. DFHSM0102
DFHSM0102
A plan has been denied an IRLM lock because of a detected deadlock.
DSNL030I
DSNT375I
A requesting conversation was terminated because of DDF processing…
A storage violation was detected by the module named in the message.
Db2 external database connections are trending upward… MAXDBAT
20IBM Z / DOC ID / July 17, 2017 / © 2017 IBM Corporation
Subsystem ScorecardsProblem Insights
SYSD
SYSD
SYSD
CICS020AT
CICSA17
CICST4
CICSIL3
SYSA
SYSA
SYSA
SYSA
SYSA
SYSA
SYSD
IBM Z / DOC ID / July 17, 2017 / © 2017 IBM Corporation
IBM Z / DOC ID / July 17, 2017 / © 2017 IBM Corporation
IBM Z / DOC ID / July 17, 2017 / © 2017 IBM Corporation
IBM Z / DOC ID / July 17, 2017 / © 2017 IBM Corporation
IBM Z / DOC ID / July 17, 2017 / © 2017 IBM Corporation
IBM Z / DOC ID / July 17, 2017 / © 2017 IBM Corporation
IBM Z / DOC ID / July 17, 2017 / © 2017 IBM Corporation
IBM Z / DOC ID / July 17, 2017 / © 2017 IBM Corporation
IBM Z / DOC ID / July 17, 2017 / © 2017 IBM Corporation
IBM Z / DOC ID / July 17, 2017 / © 2017 IBM Corporation
IBM Z / DOC ID / July 17, 2017 / © 2017 IBM Corporation
IBM Z / DOC ID / July 17, 2017 / © 2017 IBM Corporation
IBM Z / DOC ID / July 17, 2017 / © 2017 IBM Corporation
IBM Z / DOC ID / July 17, 2017 / © 2017 IBM Corporation
Statement of Direction: September 2019Problem Signatures and Additional Subsystem Support
35
… IBM intends to expand IBM Z Operations Analytics usage of machine learning technologies to help
organizations proactively ensure their IT operations meet their business goals.
Enhancements may include additional single and multiple variable insights across an additional set of
subsystems such as IMS, MQ, and WebSphere. In addition, problem signatures will be created to identify
anomalies that are forecasted and focus on specific system issues that could lead to broader user impacts
and system outages. Users can expect to be alerted to system behavior changes based on historical data
trend analysis and problem signatures across multiple subsystems with guidance on how to remediate
potential system issues.
ibm.biz/IZOAAnnounce
IBM Z Operations Analytics – Cognitive IT Operations Analytics / GSE UK, November 2019 / © 2019 IBM Corporation
IZOA Machine Learning Statement of DirectionAvailability of Problem Signatures
36
Model
Live metrics
……………
…
……………
…
Watson Machine
Learning for z/OS
compares model
with live metrics
2. Scoring
4. Problem Signature Analysis
1. Collect Data
IZOA ML
Db2 DB
SMF
Scoring runs unattended to
generate live anomaly scores
CDP
streaming
for scoring
IBM Z Operations Analytics – Cognitive IT Operations Analytics / GSE UK, November 2019 / © 2019 IBM Corporation
Anomalies are matched to known problems
generating alerts with details and recommended
actions when they are forecast to become a
problem
Watson Machine
Learning for z/OS
trained model
Trained Model
3. Visualize Anomaly Scores
Visualize anomaly scores for each
KPI in the subsystem to help diagnose
potential issues
IBM Z Operations Analytics
Detecting a Known Problem
• Real time scoring shows an
anomaly(s) has been detected …
• The impacted metric(s) match a
known problem signature, a
specific condition that may lead to
broader system impacts
• Alert generated to your event
management system
• The IZOA Problem Insight details
the specific problem signature, and
recommends actions to take to
remediate
Anomaly(s) Matching a Known Problem Signature
37IBM Z Operations Analytics – Cognitive IT Operations Analytics / GSE UK, November 2019 / © 2019 IBM Corporation
38IBM Z / DOC ID / July 17, 2017 / © 2017 IBM Corporation
IBM Z / DOC ID / July 17, 2017 / © 2017 IBM Corporation
Leverage machine learning to detect
operational anomalies that could lead to
broader business impacts.
IBM Z
Operations
Analytics
Gain hybrid
multicloud
visibility, breaking
down operational
silos by integrating
IBM Z data with
data from the rest
of your enterprise.
Access to your
operational data
with the included
IBM Z Common
Data Provider.
40IBM Z Operations Analytics – Cognitive IT Operations Analytics / GSE UK, November 2019 / © 2019 IBM Corporation
41
More InformationIBM Z Operations Analytics
IBM Z Common Data Provider product page ibm.biz/CDPzInfo Product Page
IBM Z Operations Analytics product page ibm.biz/IZOAInfo Product Page
IBM Z Operations Analytics Announcement ibm.biz/IZOAAnnounce IZOA Announce with SOD
Z Operations Analytics Community ibm.biz/IOAzCommunity Product announcements, info, forums
Rabobank Splunk .conf 2018 Presentation ibm.biz/IBMZVisibilityConf2018 Rabobank Splunk .conf Presentation
IBM Z ITSM Newsletter ibm.biz/zITSMNewsletterSubscribe
IBM Z Operations Analytics – Cognitive IT Operations Analytics / GSE UK, November 2019 / © 2019 IBM Corporation
Please submit your session feedback!
• Do it online at http://conferences.gse.org.uk/2019/feedback/ok
• This session is OK
43IBM Z Operations Analytics – Cognitive IT Operations Analytics / GSE UK, November 2019 / © 2019 IBM Corporation
Only IBM Z Operations Insight Suite provides the Key Capabilities Needed to Maximize Use of Z Operational Data
Visualize on the
platform of your
choice
IT O
pera
tio
nal
Analy
tics pla
tform
s
IBM Z Operations Insight Suite
Data Streaming
Data where and
when you need it
Problem
Identification
Rapid operational
root cause analysis
Anomaly Prediction
Prevent outages with
Machine Learning
Performance
Analysis
Gain insight for
critical decisions
Capacity
Forecasting
Predictive resource
usage & optimization
Cost Management
Optimize Tailored Fit
Pricing strategies &
enable chargeback
IBM Z Operations Analytics – Access and Insight / GSE UK, November 2019 / © 2019 IBM Corporation
Only IBM Z Operations Insight Suite provides the Key Capabilities Needed to Maximize Use of Z Operational Data
Visualize on the
platform of your
choice
IT O
pera
tio
nal
Analy
tics pla
tform
s
IBM Z Operations Insight Suite
Data Streaming
Data where and
when you need it
Problem
Identification
Rapid operational
root cause analysis
Anomaly Prediction
Prevent outages with
Machine Learning
Performance
Analysis
Gain insight for
critical decisions
Capacity
Forecasting
Predictive resource
usage & optimization
Cost Management
Optimize Tailored Fit
Pricing strategies &
enable chargeback
IBM Z Operations Analytics – Access and Insight / GSE UK, November 2019 / © 2019 IBM Corporation