Predictive and Adaptive Failure Mitigation to Avert ... · Predictive and Adaptive Failure...
Transcript of Predictive and Adaptive Failure Mitigation to Avert ... · Predictive and Adaptive Failure...
Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions
OSDI ‘20
Sebastien Levy†, Randolph Yao†, Youjiang Wu†, Yingnong Dang†, Peng Huang^, Zheng Mu†, Pu Zhao*, Tarun Ramani†, Naga Govindaraju†, Xukun
Li†, Qingwei Lin*, Gil Lapid Shafriri†, Murali Chintalapati†
† Microsoft Azure, ^ Johns Hopkins University, * Microsoft Research
2
Azure global infrastructure
Azure regions
HeterogeneityAll nodes are different
Different characteristics: # cores, total memory …
Different hardware type / vendors
Different workload patterns
Different health history
3
Measuring Customer Experience
• We need to monitor VM availability: down time / up time
• But each VM interruption causes significant impact to customer• Disrupt user experience (ie gaming)• Applications take time to recover• Customer frustration would come in
case of repeated reboots• Two short interruptions are more
impactful than one longer one
4
Impact of Node Failure
VM VM
VM VM
VM
…
Disks CPUMemory
VM VM
VM VM
VM
…
Disks CPUMemoryFW/HW failure
• Node-level failure ➔ bad impact for every VM it contains
• We need to predict failures and take the appropriate mitigation actions
5
Traditional Operation Workflow
Offline & repair node
6
VM VM
VM VM
VM
…
Disks CPUMemory
VM VM
VM VM
VM
…
Disks CPUMemory
VM VM …
Healthy Node
2 Diagnosis
3 Migration
!
All VMs reboot
Long VM downtime
VM VM
VM VM
VM
…
Disks CPUMemory
1 Detection
?Node failure
Attempt 1: Mitigation Before Diagnostics
Offline & repair node
7
VM VM
VM VM
VM
…
Disks CPUMemory
VM VM
VM VM
VM
…
Disks CPUMemory
VM VM …
Healthy Node
1 Detection 3 Diagnosis
2 Migration
? !
All VMs reboot
Short VM downtime
Mitigation can be better in some cases
Node failure
VM VM
VM VM
VM
…
Disks CPUMemory
Attempt 2: Prediction of Node Failure
Offline & repair node
8
VM VM
VM VM
VM
…
Disks CPUMemory
VM VM
VM VM
VM
…
Disks CPUMemory
VM VM …
Healthy Node
1 Node predicted to fail 2 Diagnosis
3 Migration
?
VM VM
VM VM
VM
…
Disks CPUMemoryWait
Use prediction to speed up OS crash mitigation
All VMs reboot
Shorter VM downtime
Mitigation suboptimal for wrong prediction
Node failure
Attempt 3: Prediction + Static Mitigation
Offline & repair node
9
VM VM
Disks CPUMemory
VM VM …
Healthy Node
1 Node predicted to fail 3 Diagnosis
2
?
VM VM
Disks CPUMemoryBlock allocation
Capacity impact
Live migrate eligible VMsVM pause impact
Some VMs did not reboot
Other VMs reboot
Shorter VM downtime
Increased impact for false positives
Can we do better?
VM VM
VM VM
VM
…
Disks CPUMemory
Mitigation Tradeoff
❖ Can a mitigation be effective for all kind of scenarios?
❖ How to confirm a mitigation is effective?
❖ Even if a mitigation is effective, can we do better? Can it change in time?
10
Mitigation Tradeoff: No One-Size-Fits-All
How to mitigate predicted node failures: reasonable proposal
Capacity impactVM pause impact VM interruption impact
Block allocation
Live-migrate eligible VMs
Wait for 7 days
Force remaining VMs
migration
Diagnosis + Repair
If failure is not imminent, do we need to completely block
allocation?
How long should we wait for customers to intentionally move
their VMs?
If node has still not failed after 7 days, could it now be healthy?
11
How To Estimate Mitigation Efficacy?
• Heterogeneity: Multiple factors will impact the mitigation Efficacy• Live migration success rate depends on available capacity, hardware health,
workload on the node
• Allocation relies on a complex optimization logic and on stochastic customer demand
• Prediction false positives are unavoidable and can depend on unobservable signals
12
Key Insights
“In a heterogeneous and ever-changing cloud system, the effectiveness of a mitigation action is often probabilistic.”
“To select the (near-)optimal mitigation, each possible action needs to be compared at-scale with production workload.”
13
Solution
Continuous probabilistic online minimizationof customer impact
Different mitigation actions depending on the type of predicted failure
Online testing of the different options
Adapt to ever-changing cloud behavior
Account for node heterogeneity of Azure’s fleet
Focus on impacting failures
14
Narya’s Overview
Prediction Rule
ML failure prediction Customer impact
Feedback loop
Diagnostics
Offline Monitoring
…
CostLabels
Prediction Decision Mitigation Impact Assessment15
1
3
2
1
4
4
Choose action
Main Challenges
Define Define customer impact as a generic metric
Test Safely test action in production
BalanceBalance exploring actions and exploiting the best so far
Adapt Adapt our decision to system changes
Fast Fast and scalable mitigation decision
16
Measuring Interruptions
• Interruption Types:
• VM reboot, IO pause, VM temporarily freeze / blackout
• Service Availability
• Time duration based
• Short VM downtime DOES NOT imply short customer service downtime.
• Each VM service downtime requires customer service to rehydrate their state
• Interruption Count
• Event count based
• Each VM interrupt impacts customer service once
• More realistically reflects the customer pain
17
Defining Customer Impact
• VM interruptions are the main negative impact to customer
• We define Annual interruption rate (AIR)
𝐴𝐼𝑅 =# 𝑖𝑛𝑡𝑒𝑟𝑟𝑢𝑝𝑡𝑖𝑜𝑛𝑠 𝑖𝑛 𝑇
𝑇𝑜𝑡𝑎𝑙 𝑢𝑝 𝑡𝑖𝑚𝑒 𝑖𝑛 𝑇× 365 𝑑𝑎𝑦𝑠 × 100
Cost = # VM interruptions per node
When scoping it to a contribution per node and removing the constants, we get:
18
Narya’s Prediction
Expert Rules ML Prediction
19
Motivation To Use ML Prediction
• Expert rule • Setting a threshold on a single predictive signal. Ex: Disk low spare space
• Limitations of expert rule• High variance of prediction horizon - actual failure time also depends on factors
• Difficult precision-recall tradeoff – strict rules result in low recall, loose rules yield low precision
• What if we look at all signals before the actual failure and use those signals to predict the failure?
20
ML Model Highlights
• Comprehensive list of signals that cover OS layer, driver layer and device layer observation.
• Predict server level failures that has customer impact
• Non-handcrafted features• Spatial attention
• Temporal convolution
21
Longer Lead Time
Higher Precision
Higher Recall
Diversity of Mitigation Actions
Allocation change VM migration Node reboot
Allow allocation
Avoid allocation*
Block allocation
Live migrate VMs*(unallocatable node only)
Service heal VMs(unallocatable node only)
Soft kernel reboot*(unallocatable node only)
Hard reboot
Offline + repair(empty node only)
*
No Impact
Pause Impact
Soft capacity impact
Hard capacity impact
VM reboot impact
Non-deterministic action22
Narya’s Mitigation Actions
• Use composite actions (sequence of actions)• Avoid illogical scenarios: migration without blocking allocation, repeated
reboots …
• Add safety constraint to minimize the cost of exploration
• Easy override priority if several rules overlap
• Simpler learningIf OS crashes
If empty
Example:
Composite actions’ duration is typically in the order of days 23
Narya’sDecision Engine
• Mapping each predicted bad nodes to the best composite mitigation action to minimize customer pain
• Two algorithms:• AB testing:
explore all different possibilities, observe the customer pain metric(s) then use the best action if it exists
• Multi-Armed Bandit (Thompson Sampling): learn the right tradeoff between exploring new actions and exploiting the estimated best one based on a single customer pain metric
24
AB Testing
1. Sample action with preconfigured probabilities
2. Observe customer impact within an observation window
3. Hypothesis test between cost of all actions
4. If statistical significance, use the optimal action
Action A
Compare statistical impact
Action B25
Adapting AB Testing To Our Scenario
• Cost attribution: VM reboots during the observation window• knowing which interruptions is caused by the action is not possible
• Stickiness: same node always use the same action• otherwise we would violate the iid assumption
• Decision vs action: we observe the consequence of every decision• even if it is skipped, delayed or overridden
26
Bandit Motivation
• AB testing limitations
• Does not leverage early observation before statistical significance
• Cannot adapt to change after statistical significance
• Multi-armed bandit: dynamically learn the probability based on observations
✓Leverage early observation in exploration
✓ Adapt to change in exploitation
27
Bandit Framework
Reward = - Customer impact (Cost)
Machines = Available composite actions
Pull / Game = New node mitigation request
Each prediction rule is a different bandit experiment
28
Bandit Adaptation To Our Scenario
Accommodate temporal change: exponential decay
Delayed rewards
Bandit stickiness Safeguards
29
Ensuring Safe Exploration
• Only allow relevant composite actions
• Override logic if other more severe issues are detected
• Fallback to AB testing if there are not enough observation data
• Enforce minimum and maximum probability for each action
30
31
Mitigation Engine
ML Prediction
Pub / Sub Real-time
Node Monitoring
Agent
Action Orchestrator
Policy Generator
Req
ues
t H
and
ler
Model Serving Platform Learner
VM Impact
Mitigation decision
Narya’s System Architecture1
3
2
Scalable
Fast
Adaptive
Overall Savings: 26%
• Compare the use of AB testing/Bandit to previous static strategy
• Estimated AIR savings: [Observed AIR] – [Control group AIR mapped to whole fleet]
0
10
20
30
40
50
60
% Im
pro
vem
ent
Experiments
Bandit improvement over AB testing
32
Prediction Performance
• Precision – 79.5%
• Recall – 50.7%
• ML prediction:Time to failure – 48 hours on average
33
Compare AB Testing and Bandits
• Counterfactual estimation
18.4
7.7
22.3
16.6
3.8
7.1
15.3
0
5
10
15
20
25
2 Ierr E500 E11 Tardigrade E7 30 Threshold 1 Ierr 63023 Orange ML Prediction
Improvement (%)34
Improvement (%)
Case Study: I/O Timeouts
• A/B testing then Bandit experiment on I/O timeouts prediction rule
• Unexpected system changes switched probability from using NoOp to using UA-LM-RH automatically
• Although change is not understood, bandit can automatically adjust
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
2/1
8/2
02
0
2/2
0/2
02
0
2/2
2/2
02
0
2/2
4/2
02
0
2/2
6/2
02
0
2/2
8/2
02
0
3/1
/20
20
3/3
/20
20
3/5
/20
20
3/7
/20
20
3/9
/20
20
3/1
1/2
02
0
3/1
3/2
02
0
3/1
5/2
02
0
3/1
7/2
02
0
3/1
9/2
02
0
3/2
0/2
02
0
3/2
2/2
02
0
3/2
4/2
02
0
3/2
6/2
02
0
3/2
8/2
02
0
3/3
0/2
02
0
4/1
/20
20
4/3
/20
20
4/5
/20
20
4/7
/20
20
Bandit probability
UaProb NoOpProb
35
UA-LM-RH: Unallocatable + Live Migration + Reset Node Health
Operation Learnings
• Many factors influence efficiency: need for probabilistic approaches
• Decisions may go wrong: closely monitor all components behavior and use interpretable models
• Data quality is critical: watch for telemetry schema changes
• Customer impact is complex: human in the loop helps prevent from new types of impact
36
Summary / Takeaway Message
• Both failure prediction and proactive mitigation is critical
• No one-size-fits-all mitigation strategy, adapting different mitigation strategies in an online fashion
• Narya uses AB testing and multi-armed bandit to proactively and adaptively mitigate of predicted bad nodes
• Narya achieved 26% improvement over previous static strategy
37
Thank you!
• Acknowledgement • Haryadi Gunawi, our shepherd, and the anonymous reviewers
• All our collaborators within Azure
• Contact us• [email protected]
38