NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen,...
-
Upload
christian-booker -
Category
Documents
-
view
215 -
download
0
Transcript of NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen,...
![Page 1: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang.](https://reader030.fdocuments.us/reader030/viewer/2022032612/56649ea25503460f94ba60a2/html5/thumbnails/1.jpg)
NetPilot: Automating Datacenter Network Failure Mitigation
Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang
Presented by: Chen Li
![Page 2: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang.](https://reader030.fdocuments.us/reader030/viewer/2022032612/56649ea25503460f94ba60a2/html5/thumbnails/2.jpg)
2
Failures are Common and Harmful
• Network failures are common
10,000+ switches
![Page 3: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang.](https://reader030.fdocuments.us/reader030/viewer/2022032612/56649ea25503460f94ba60a2/html5/thumbnails/3.jpg)
3
Failures are Common and Harmful
• Network failures are common
• Failures cause long down times
Time from detection to repair (minutes)
Six-month failure logs of production datacenters
25% of failures take 13+ hours to repair
![Page 4: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang.](https://reader030.fdocuments.us/reader030/viewer/2022032612/56649ea25503460f94ba60a2/html5/thumbnails/4.jpg)
4
Failures are Common and Harmful
• Failures are common due to VERY large datacenters
• Failures cause long down times
• Long failure duration large revenue loss
![Page 5: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang.](https://reader030.fdocuments.us/reader030/viewer/2022032612/56649ea25503460f94ba60a2/html5/thumbnails/5.jpg)
How to Shorten Failure Recovery Time?
![Page 6: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang.](https://reader030.fdocuments.us/reader030/viewer/2022032612/56649ea25503460f94ba60a2/html5/thumbnails/6.jpg)
6
Previous Work
• Conventional failure recovery takes 3 steps
• Failure localization/diagnosis– [M. K. Aguilera, SOSP’03]– [M. Y. Chen, NSDI’04]– [R.R Kompella, NSDI ’05]– [P.Bahl, SIGCOMM’07]– [S. Kandula, SIGCOMM’09]…
Detection Diagnosis Repair
passiveping
active
![Page 7: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang.](https://reader030.fdocuments.us/reader030/viewer/2022032612/56649ea25503460f94ba60a2/html5/thumbnails/7.jpg)
7
Automating Failure Diagnosis is Challenging
• Root causes are deep in network stack
• Diagnosis involves multiple parties
![Page 8: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang.](https://reader030.fdocuments.us/reader030/viewer/2022032612/56649ea25503460f94ba60a2/html5/thumbnails/8.jpg)
8
Category Failure types Diagnosis & Repair
%
Software 21% Link layer loop Find and fix bugs
19%Imbalance overload 2%
Hardware 18% FCS error Replace cable 13%Unstable power Repair power 5%
Unknown 23% Switch stops forwarding N/A 9%Imbalance overload 7%Lost configuration 5%High CPU utilization 2%
Configuration 38%
Errors on multiple switches
Update configuration
32%
Errors on one switch 6%
• Six-month failure logs from several production DCNs
1. Root causes are deep in the network stack
2. Diagnosis involves multiple partiesFailure Diagnosis Requires
Human Intervention !
![Page 9: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang.](https://reader030.fdocuments.us/reader030/viewer/2022032612/56649ea25503460f94ba60a2/html5/thumbnails/9.jpg)
Can we do something other than failure diagnosis?
![Page 10: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang.](https://reader030.fdocuments.us/reader030/viewer/2022032612/56649ea25503460f94ba60a2/html5/thumbnails/10.jpg)
10
NetPilot: Mitigating rather than Diagnosing Failures
• Mitigate failure symptoms ASAP, at the cost of reduced capacity
Detection Diagnosis RepairAutomated Mitigation
![Page 11: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang.](https://reader030.fdocuments.us/reader030/viewer/2022032612/56649ea25503460f94ba60a2/html5/thumbnails/11.jpg)
11
NetPilot Benefits
• Short recovery time• Small network disruption• Low operation cost
Automated Mitigation
Detection Diagnosis Repair
![Page 12: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang.](https://reader030.fdocuments.us/reader030/viewer/2022032612/56649ea25503460f94ba60a2/html5/thumbnails/12.jpg)
12
Failure Mitigation is Effective
• Most failures can be mitigated by simple actions
• Mitigation is feasible due to redundancy
![Page 13: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang.](https://reader030.fdocuments.us/reader030/viewer/2022032612/56649ea25503460f94ba60a2/html5/thumbnails/13.jpg)
13
Category Failure types Mitigation Repair %Software 21%
Link layer loop Deactivate port Find and fix bugs
19%
Imbalance-triggered overload
Restart switch2%
Hardware 18%
FCS error Deactivate port Replace cable 13%
Unstable power Deactivate switch Repair power 5%Unknown 23%
Switch stops forwarding
Restart switch N/A 9%
Imbalance-triggered overload
Restart switch 7%
Lost configuration Restart switch 5%High CPU utilization
Restart switch 2%
Configuration 38%
Errors on multiple switches
n/a Update configuration
32%
Errors on single switch
Deactivate switch 6%
68% of failures can be mitigated by simple actions
![Page 14: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang.](https://reader030.fdocuments.us/reader030/viewer/2022032612/56649ea25503460f94ba60a2/html5/thumbnails/14.jpg)
14
Mitigation Made Possible by Redundancy
• Redundancy deactivation unlikely to partition / overload the network
ToR
AGG
CORE
Internet
![Page 15: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang.](https://reader030.fdocuments.us/reader030/viewer/2022032612/56649ea25503460f94ba60a2/html5/thumbnails/15.jpg)
15
Outline
• Automating failure diagnosis is challenging
• Failure mitigation is effective
• How to automate mitigation?
• NetPilot evaluations
• Conclusion
![Page 16: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang.](https://reader030.fdocuments.us/reader030/viewer/2022032612/56649ea25503460f94ba60a2/html5/thumbnails/16.jpg)
16
A Strawman NetPilot: Trial-and-error
Network failure
Roll back if necessary
No Failure mitigated? End
Yes
Execute an action
Localization
![Page 17: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang.](https://reader030.fdocuments.us/reader030/viewer/2022032612/56649ea25503460f94ba60a2/html5/thumbnails/17.jpg)
17
NetPilot: Challenges & Solutions
1. Blind trial-and-error takes a long time
Network failure
Roll back if necessary
NoFailure
mitigated? EndYes
Execute an action
LocalizationLocalization
Failure specific localization
![Page 18: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang.](https://reader030.fdocuments.us/reader030/viewer/2022032612/56649ea25503460f94ba60a2/html5/thumbnails/18.jpg)
18
NetPilot: Challenges & SolutionsNetwork failure
Roll back if necessary
NoFailure
mitigated? EndYes
Execute an action
Localization
Estimate impact
Localization
2. Partition/overload network
Impact estimation
![Page 19: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang.](https://reader030.fdocuments.us/reader030/viewer/2022032612/56649ea25503460f94ba60a2/html5/thumbnails/19.jpg)
19
NetPilot: Challenges & SolutionsNetwork failure
Roll back if necessary
NoFailure
mitigated? EndYes
Execute an action
Localization
Estimate impact
Rank actions
Localization
3. Different actions have different side-effects
Rank actions based on impact
![Page 20: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang.](https://reader030.fdocuments.us/reader030/viewer/2022032612/56649ea25503460f94ba60a2/html5/thumbnails/20.jpg)
20
Failure Specific Localization• Limited # of failure types• Domain knowledge improves accuracy
Failure types1. Link layer loop2. Imbalance-triggered overload3. FCS error4. Unstable power5. Switch stops forwarding6. Imbalance-triggered overload
7. Lost configuration8. High CPU utilization9. Errors on multiple switches
10. Errors on single switch
![Page 21: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang.](https://reader030.fdocuments.us/reader030/viewer/2022032612/56649ea25503460f94ba60a2/html5/thumbnails/21.jpg)
21
Example: Frame Check Sequence (FCS) Errors
• 13% of all the failures• Cut-through switching– Forward frames before checksums are verified
• Increase application latency
![Page 22: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang.](https://reader030.fdocuments.us/reader030/viewer/2022032612/56649ea25503460f94ba60a2/html5/thumbnails/22.jpg)
22
Localizing FCS Errorserror frames seen on L frames corrupted by L
frames corrupted by other links & traverse L
• xL: link corruption rate
• # of variables = # of equations = # of links
• Corrupted links: xL> 0
![Page 23: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang.](https://reader030.fdocuments.us/reader030/viewer/2022032612/56649ea25503460f94ba60a2/html5/thumbnails/23.jpg)
23
NetPilot OverviewNetwork failure
Roll back if necessary
NoFailure
mitigated? EndYes
Execute an action
Localization
Estimate impact
Rank actions
![Page 24: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang.](https://reader030.fdocuments.us/reader030/viewer/2022032612/56649ea25503460f94ba60a2/html5/thumbnails/24.jpg)
24
Impact Metrics
• Derived from Service Level Agreement (SLA)– Availability: online_server_ratio– Packet loss: total_lost_pkt– latency: max_link_utilization• Small link utilization small (queuing) delay
• Total_lost_pkt & max_link_utilization derived from utilization of individual links
![Page 25: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang.](https://reader030.fdocuments.us/reader030/viewer/2022032612/56649ea25503460f94ba60a2/html5/thumbnails/25.jpg)
25
Estimating Link Utilization
• # of flows >> redundant paths– Traffic evenly distributed under ECMP
• Estimate the load contributed by each flow on each link
• Sum up the loads to compute utilization
Impact Estimator
Action
Traffic Link utilization Topology
![Page 26: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang.](https://reader030.fdocuments.us/reader030/viewer/2022032612/56649ea25503460f94ba60a2/html5/thumbnails/26.jpg)
26
Link Utilization Estimation is Highly Accurate
• 1-month traffic from a 8000-server network– Log socket events on each server
• Ground truth: SNMP counters
![Page 27: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang.](https://reader030.fdocuments.us/reader030/viewer/2022032612/56649ea25503460f94ba60a2/html5/thumbnails/27.jpg)
27
NetPilot OverviewNetwork failure
Roll back if necessary
NoFailure
mitigated? EndYes
Execute an action
Localization
Estimate impact
Rank actionsChoose the action with the least impact
![Page 28: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang.](https://reader030.fdocuments.us/reader030/viewer/2022032612/56649ea25503460f94ba60a2/html5/thumbnails/28.jpg)
28
Outline• Automating failure diagnosis is challenging
• Failure mitigation is effective
• How to automate mitigation? – Localization impact estimation ranking
• NetPilot evaluations–Mitigating load imbalance–Mitigating FCS errors–Mitigating overload
• Conclusion
![Page 29: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang.](https://reader030.fdocuments.us/reader030/viewer/2022032612/56649ea25503460f94ba60a2/html5/thumbnails/29.jpg)
29
Load Imbalance
• Agga stops receiving traffic • Localize to 4 suspects
corea
Agga
coreb
Aggb
![Page 30: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang.](https://reader030.fdocuments.us/reader030/viewer/2022032612/56649ea25503460f94ba60a2/html5/thumbnails/30.jpg)
30
Mitigating Load Imbalance
0:00 0:05 0:10 0:15 0:20 0:250
5000000000100000000001500000000020000000000250000000003000000000035000000000
lag core_a->AR_a lag core_a->AR_b
lag core_b->AR_a lag core_b->AR_b
Time (minutes)
corea -> agga
coreb -> agga
corea -> aggb
coreb -> aggb
Agga stops receiving traffic
Detected & reboot coreb
Reboot corea Reboot Agga
Mitigation confirmed
Load evenly splitted
corea
Agga
coreb
Aggb
![Page 31: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang.](https://reader030.fdocuments.us/reader030/viewer/2022032612/56649ea25503460f94ba60a2/html5/thumbnails/31.jpg)
31
Fast FCS Error Mitigation
NetPilot:deactivates 2 links in 1 trial
within 15 minutes
Human operator:
after 11 trials in 3.5 hours, 2 out of 28 ports are deactivated
3.5 hours 15 minutes
![Page 32: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang.](https://reader030.fdocuments.us/reader030/viewer/2022032612/56649ea25503460f94ba60a2/html5/thumbnails/32.jpg)
32
Mitigating Link Overload• Mitigate overload by deactivating healthy links
core1
1.5 1.5
3
agg
core2
core1
agg
![Page 33: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang.](https://reader030.fdocuments.us/reader030/viewer/2022032612/56649ea25503460f94ba60a2/html5/thumbnails/33.jpg)
33
Mitigating Link Overload• Mitigate overload by deactivating healthy links– Many candidate links in production networks– Choose the link(s) with the least impact
core1
1.5 1.5
3
agg
core2 core1
1 1.5
3
agg
core2 core1
0 3
3
agg
core2
lost 0.5
![Page 34: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang.](https://reader030.fdocuments.us/reader030/viewer/2022032612/56649ea25503460f94ba60a2/html5/thumbnails/34.jpg)
34
Action Ranking Lowers Link Utilization
• Replay 97 overload incidents due to link failures
![Page 35: NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang.](https://reader030.fdocuments.us/reader030/viewer/2022032612/56649ea25503460f94ba60a2/html5/thumbnails/35.jpg)
35
Conclusion
• Mitigation shortens failure recovery time– Simple actions are effective– Made possible by redundancy
• NetPilot: automating failure mitigation – Recovery time: hour minutes– Several mitigation scenarios deployed in Bing