Anatomy of an IT Outage(and ways to prevent them)
IT on Wall Street SeminarSep 22, 2016
Doron Pinhas, CTO
Or it can look like this:
“System outage grounds
2300 Delta flights”
» Always unexpected
» Usually takes too long to resume service
» Way too often…• Root cause remains unknown
• Same problems reoccur
Anatomy of an IT outage
3
» Technology evolution and environment complexity don’t make it any easier
*Based on Continuity Software annual surveys
Are we getting better?
When was your last downtime event?
4
Can this be the real reason?
Can this be the real reason?
Can this be the real reason?
A closer look into recent events
5
Delta airlines Reported damages reach $150M (est.)
Power switch failed
Where Impact Reason given
Worldpay A million Etsy transactions affected (est.)
Server update
Southwest Airlines
2300 flights canceled Network router failed
Deutsche Bank Unable to report swap data for five days
???
6
“The issue arose from a hardware failure in the main database of the equities’ trading system”
Why does it keep happening?
Hint: Not [necessarily] for lack of trying…
7
Possible reasons
8
Design issues
Implementation issues
Testing issues
Measurement(of Quality and Risk)
• Setting up a resilient infrastructure is costly and complex– Cloud orchestration, HA (virtual HA/FT, clustering, App LB), redundant /
active-active storage, multi-pathing, teaming, Network LB, replication, Geo-HA (SRM, Stretch/Metro compute & storage), …
• The result: multiple technologies, vendors and teams
The resilient datacenter – blueprint
Site 1 Site N
Active-active / Active-passive
9
Site 1 Site N
The challenge? Simple math…
OS Configuration
ApplicationConfiguration
HAConfiguration
SANConfiguration
LB sessionconfiguration
VMware HA
VMware SRM
• Some changes slip through the cracks…• IT stability & quality cannot be fully tested following every change
SnapshotConfiguration
Mirror / Replica
Configuration
PuppetManifest
Active-active / Active-passive
10
Ris
k
Time
New build / Test / Audit
Every day that goes by…
11
If disaster strikes tomorrow…
12
How confident are you that your IT will recover smoothly?
12
How to prevent your next outage?
13
Lessons learned from the best run shops
14
1 Design right• Manage knowledge• Rely on community (vendors, other
users)
2 Implement right • Test quality immediately
3 Make sure it stays that way (see next slides!)
Put quality control at the centerTransforms IT operations from reactive to proactive
Making sure your environment stays ready
15
Transform IT operations from this:
… to that
Must be automated!
Ris
k
Time
New build / Test / Audit
Daily validation
About quality & risk automation
» Stamina & focus more important than pace• Each small addition goes a long way
• Start with your:– Existing manual checklists
– Most recurring issues
– Any newly discovered (& significant) risk
» Ways to automate & “shortcuts”• Use vendor scripts & built-in
validation tools
• Create your own scripts
• Use automation / configuration management tools
16
Fast ROI – quickly frees time to fuel your journey
Limitation: cross-domain issues will not be caught!
Tracking & enumeration
» Record results over time• Create score-cards for servers / objects
• Will allow comparing status before and after a change, and examine trends
» Mining the data enables numerous benefits• Understand what works and what does not
• Benchmark your vendors
• Make it part or your decision making process
• …
17
Case study – large bank
» Automating checks in design labs, pre-prod and prod
» Store results in a repository correlating:• What has changed
• Score cards
• Open issues & trending
• Business dependencies
» Customized feeds for:• IT teams
• Business owners
18
Case study – results
» 90% reduction in downtime
» 70% reduction in firefighting costs
» Dramatic improvement in predictability & confidence
• Tests and actual workload shifts work
• When gaps do exist, it’s easy to understand: what to do
» Better collaboration
• From finger-pointing to constant improvement
» Significant return on investment
• New use-cases uncovered regularly
19
The right program – key to constant improvement
» Change control process:• Better decision making through quality & risk measurement
• Shorter cycles
» Stay ahead of trouble• Automate
• Handle violations immediately
» Cross-team collaboration & visibility• Single-pane of glass for IT configuration quality, health & risk
20
About us
21
About us
Helping many of the world’s largest
enterprises prevent outages and data
loss in their critical IT infrastructure.
22
Our technology & services
23
Early detection of availability risks and single-points-of-failure
Actionable alerts to relevant teams
Automated, cross-layer configuration validation
Easy measurement and visualization of resiliency metrics
AvailabilityGuard Services
Resilience health checks, best-practice validation
Managed Service Availability Assurance
To learn more
» Come talk to us....
» One-time health check
24
Thank you!(Questions?)
app.continuitysoftware.com
25
Top Related