Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan...
Transcript of Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan...
![Page 1: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/1.jpg)
Daniel Turner
Kirill Levchenko,
Alex C. Snoeren,
Stefan Savage
![Page 2: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/2.jpg)
Failure is a reality for large network
Achieving high availability requires engineering the network to be robust to failure
Designing mechanisms to effectively mitigate failures requires deep understanding of real failures
![Page 3: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/3.jpg)
Big Failures generate news stories
![Page 4: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/4.jpg)
Big Failures generate news stories
![Page 5: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/5.jpg)
Big Failures generate news stories◦ Rarely contain useful details
◦ Most networks failures are not catastrophic
![Page 6: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/6.jpg)
Big Failures generate news stories◦ Rarely contain useful details
◦ Most networks failures are not catastrophic
![Page 7: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/7.jpg)
![Page 8: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/8.jpg)
Collecting comprehensive failure data is difficult◦ Lightweight techniques are limited
◦ Special purpose monitoring is expensive
Access to network data is limited data◦ A few publicly available studies [A. Markopoulou ToN ’08] [C. Cranor SIGMOD 03]
◦ Many networks consider data proprietary
Some networks can’t invest time or capital
![Page 9: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/9.jpg)
Methodology to reconstruct failure history of a network◦ Using only commonly available data
◦ No need for additional instrumentation
Analyze a production network
![Page 10: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/10.jpg)
A time series of Layer-3 failure events◦ I.e, for each link a set of state transitions between up and down
And, where possible, annotated with:◦ What caused the failure?
◦ What was the impact of the failure?
![Page 11: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/11.jpg)
![Page 12: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/12.jpg)
interface GigabitEthernet1/1
ip address 137.211.22.8 255.255.255.254
![Page 13: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/13.jpg)
interface GigabitEthernet0/2
ip address 137.211.23.2 255.255.255.254
![Page 14: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/14.jpg)
interface GigabitEthernet1/1
ip address 137.211.22.9 255.255.255.254
![Page 15: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/15.jpg)
interface GigabitEthernet3/2
ip address 137.211.25.9 255.255.255.254
![Page 16: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/16.jpg)
![Page 17: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/17.jpg)
![Page 18: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/18.jpg)
![Page 19: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/19.jpg)
Router x:
Interface 1/1
DOWN
![Page 20: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/20.jpg)
Router Y:
Interface 2/3
DOWN
![Page 21: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/21.jpg)
![Page 22: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/22.jpg)
Router x:
Interface 1/1
UP
![Page 23: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/23.jpg)
![Page 24: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/24.jpg)
Router Y:
Interface 2/3
UP
![Page 25: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/25.jpg)
This message is to alert you that the CENIC network engineering team has scheduled an emergency repair
Start 0001 PDT, FRI 9/02/06End 0200 PDT, FRI 9/02/06
SCOPE: Shark bites through cable
IMPACT: Loss of redundancy between San Francisco and Los Angles
COMMENTSIt left behind a tooth
![Page 26: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/26.jpg)
How can we reconstruct a failure 4 years later?◦ Syslog
Describes interface state changes
◦ Router Configuration Files
Maps interfaces to Links
◦ Operation announcements
Caveat: data not intended for failure reconstruction
![Page 27: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/27.jpg)
![Page 28: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/28.jpg)
interface GigabitEthernet1/1
ip address 137.211.22.8 255.255.255.254
![Page 29: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/29.jpg)
interface GigabitEthernet1/1
ip address 137.211.22.8 255.255.255.254
![Page 30: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/30.jpg)
interface GigabitEthernet1/1
ip address 137.211.22.8 255.255.255.254
![Page 31: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/31.jpg)
interface GigabitEthernet1/1
ip address 137.211.22.8 255.255.255.254
137.211.22.9
![Page 32: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/32.jpg)
interface GigabitEthernet1/1
ip address 137.211.22.8 255.255.255.254
interface GigabitEthernet0/2
ip address 137.211.23.2 255.255.255.254
137.211.22.9
![Page 33: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/33.jpg)
interface GigabitEthernet1/1
ip address 137.211.22.8 255.255.255.254
137.211.22.9
![Page 34: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/34.jpg)
interface GigabitEthernet1/1
ip address 137.211.22.8 255.255.255.254
interface GigabitEthernet1/1
ip address 137.211.22.9 255.255.255.254
137.211.22.9
![Page 35: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/35.jpg)
![Page 36: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/36.jpg)
![Page 37: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/37.jpg)
SYSLOG02:40:05 x.cenic.net: %LINK-3-UPDOWN: Interface GigE1/1, changed state to down02:40:05 Y.cenic.net: %LINK-3-UPDOWN: Interface GigE2/3, changed state to down02:45:35 x.cenic.net: %LINK-3-UPDOWN: Interface GigE1/1, changed state to up02:45:35 Y.cenic.net: %LINK-3-UPDOWN: Interface GigE2/3, changed state to up
![Page 38: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/38.jpg)
SYSLOG02:40:05 x.cenic.net: %LINK-3-UPDOWN: Interface GigE1/1, changed state to down02:40:05 Y.cenic.net: %LINK-3-UPDOWN: Interface GigE2/3, changed state to down02:45:35 x.cenic.net: %LINK-3-UPDOWN: Interface GigE1/1, changed state to up02:45:35 Y.cenic.net: %LINK-3-UPDOWN: Interface GigE2/3, changed state to up
![Page 39: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/39.jpg)
SYSLOG02:40:05 x.cenic.net: %LINK-3-UPDOWN: Interface GigE1/1, changed state to down02:40:05 Y.cenic.net: %LINK-3-UPDOWN: Interface GigE2/3, changed state to down02:45:35 x.cenic.net: %LINK-3-UPDOWN: Interface GigE1/1, changed state to up02:45:35 Y.cenic.net: %LINK-3-UPDOWN: Interface GigE2/3, changed state to up
SYSLOG02:40:05 x.cenic.net: %LINK-3-UPDOWN: Interface GigE1/1, changed state to down02:40:05 Y.cenic.net: %LINK-3-UPDOWN: Interface GigE2/3, changed state to down02:45:35 x.cenic.net: %LINK-3-UPDOWN: Interface GigE1/1, changed state to up02:45:35 Y.cenic.net: %LINK-3-UPDOWN: Interface GigE2/3, changed state to up
![Page 40: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/40.jpg)
SYSLOG02:40:05 x.cenic.net: %LINK-3-UPDOWN: Interface GigE1/1, changed state to down02:40:05 Y.cenic.net: %LINK-3-UPDOWN: Interface GigE2/3, changed state to down02:45:35 x.cenic.net: %LINK-3-UPDOWN: Interface GigE1/1, changed state to up02:45:35 Y.cenic.net: %LINK-3-UPDOWN: Interface GigE2/3, changed state to up
SYSLOG02:40:05 x.cenic.net: %LINK-3-UPDOWN: Interface GigE1/1, changed state to down02:40:05 Y.cenic.net: %LINK-3-UPDOWN: Interface GigE2/3, changed state to down02:45:35 x.cenic.net: %LINK-3-UPDOWN: Interface GigE1/1, changed state to up02:45:35 Y.cenic.net: %LINK-3-UPDOWN: Interface GigE2/3, changed state to up
![Page 41: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/41.jpg)
![Page 42: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/42.jpg)
This message is to alert you that the CENIC network engineering team is performing an emergency repair
Start 0001 PDT, FRI 9/02/06End 0200 PDT, FRI 9/02/06
SCOPE: Shark bites through cable
IMPACT: Loss of redundancy between San Francisco and Los Angles
COMMENTSIt left behind a tooth
![Page 43: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/43.jpg)
This message is to alert you that the CENIC network engineering team is performing an emergency repair
Start 0001 PDT, FRI 9/02/06End 0200 PDT, FRI 9/02/06
SCOPE: Shark bites through cable
IMPACT: Loss of redundancy between San Francisco and Los Angles
COMMENTSIt left behind a tooth
![Page 44: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/44.jpg)
This message is to alert you that the CENIC network engineering team is performing an emergency repair
Start 0001 PDT, FRI 9/02/06End 0200 PDT, FRI 9/02/06
SCOPE: Shark bites through cable
IMPACT: Loss of redundancy between San Francisco and Los Angles
COMMENTSIt left behind a tooth
![Page 45: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/45.jpg)
This message is to alert you that the CENIC network engineering team is performing an emergency repair
Start 0001 PDT, FRI 9/02/06End 0200 PDT, FRI 9/02/06
SCOPE: Shark bites through cable
IMPACT: Loss of redundancy between San Francisco and Los Angles
COMMENTSIt left behind a tooth
![Page 46: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/46.jpg)
This message is to alert you that the CENIC network engineering team is performing an emergency repair
Start 0001 PDT, FRI 9/02/06End 0200 PDT, FRI 9/02/06
SCOPE: Shark bites through cable
IMPACT: Loss of redundancy between San Francisco and Los Angles
COMMENTSIt left behind a tooth
![Page 47: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/47.jpg)
Serving Californiaeducational institutions
Over 200 routers
5 years of dataLAX
SLO
SOL
SVL
OAK
![Page 48: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/48.jpg)
![Page 49: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/49.jpg)
![Page 50: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/50.jpg)
This message is to alert you that the CENIC network engineering team has scheduled maintenance
Start 0001 PDT, FRI 8/17/05End 0200 PDT, FRI 8/17/05
SCOPE: Routing protocol parameter change
IMPACT: San Fransico PoP
COMMENTS: Other PoPs to follow
![Page 51: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/51.jpg)
![Page 52: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/52.jpg)
![Page 53: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/53.jpg)
![Page 54: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/54.jpg)
This message is to alert you that the CENIC network engineering team has scheduled a repair
Start 1930 PDT, FRI 11/17/06End 2000 PDT, FRI 11/17/06
SCOPE: Faulty optical amplifier
IMPACT: San Diego PoP
COMMENTS: …
![Page 55: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/55.jpg)
Motivation
Methodology◦Limitations
◦ Validation
Findings in the CENIC network
![Page 56: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/56.jpg)
Syslog messages are sent from routers to a central server◦ Using UDP
Messages are lost
![Page 57: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/57.jpg)
Time
Link State
0 1 2 3 4 5
Up
Down
![Page 58: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/58.jpg)
Time
Link State
0 1 2 3 4 5
Up
Down
SyslogDown
![Page 59: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/59.jpg)
Time
Link State
0 1 2 3 4 5
Up
Down
SyslogDown
SyslogUP
![Page 60: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/60.jpg)
Time
Link State
0 1 2 3 4 5
Up
Down
SyslogDown
SyslogUP
![Page 61: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/61.jpg)
Time
Link State
0 1 2 3 4 5
Up
Down
![Page 62: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/62.jpg)
Time
Link State
0 1 2 3 4 5
Up
Down
![Page 63: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/63.jpg)
Time
Link State
0 1 2 3 4 5
Up
Down
![Page 64: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/64.jpg)
Time
Link State
0 1 2 3 4 5
Up
Down
What happened?
![Page 65: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/65.jpg)
Time
Link State
0 1 2 3 4 5
Up
Down
What happened?
Message Lost
![Page 66: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/66.jpg)
Time
Link State
0 1 2 3 4 5
Up
Down
What happened?
Message Lost
Spurious Message
![Page 67: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/67.jpg)
Time
Link State
0 1 2 3 4 5
Up
Down
Exclude time between
2 & 3
![Page 68: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/68.jpg)
Time
Link State
0 1 2 3 4 5
Up
Down
![Page 69: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/69.jpg)
Time
Link State
0 1 2 3 4 5
Up
Down
![Page 70: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/70.jpg)
Time
Link State
0 1 2 3 4 5
Up
Down
Same issue with double
UPs
![Page 71: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/71.jpg)
Configuration files are logged intermittently
Configuration files do not describe layer 2 topology
![Page 72: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/72.jpg)
Operational announcements are written by humans◦ Selection bias
Categorization is subjective
![Page 73: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/73.jpg)
Are there events mentioned in announcements that aren’t in syslog◦ Manually checked random 1% of announcements 97% of events were confirmed
![Page 74: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/74.jpg)
How do we know syslog is accurate?
CAIDA Skitter project (now Ark)◦ Traceroutes to every /24 on the Internet
◦ 75 Million probes over 6 months traversed CENIC
confirmed no traffic over any interface that we thought was down
![Page 75: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/75.jpg)
Can we verify links were down?◦ Routing protocols aim to mask failures
◦ Isolation is externally visible BGP updates are sent
Route Views project records BGP traffic◦ Verified 105 out of 147 isolation events
![Page 76: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/76.jpg)
Motivation
Methodology◦ Limitations
◦ Validation
Findings in the CENIC Network
![Page 77: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/77.jpg)
LAX
SLO
SOL
SVL
OAK
Three Types of Links:◦ Backbone
◦ Customer Access
◦ High Performance Backbone
![Page 78: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/78.jpg)
99.9%
99.999%
99.99%
![Page 79: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/79.jpg)
![Page 80: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/80.jpg)
> 60% of failures last
less than1 Minute
![Page 81: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/81.jpg)
7,000 email announcements
3,000 events
28% of events describe a failure
18% of observed failures are explained
![Page 82: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/82.jpg)
![Page 83: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/83.jpg)
Other
* Machine room flooded
* DoS attack
* Construction crews
demolished a manhole with
active cables
* Or unsolved
![Page 84: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/84.jpg)
![Page 85: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/85.jpg)
![Page 86: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/86.jpg)
![Page 87: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/87.jpg)
![Page 88: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/88.jpg)
![Page 89: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/89.jpg)
Not all downtime is equal◦ Some failures are unexpected Fiber cuts
◦ Some failures are scheduled Software upgrades
![Page 90: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/90.jpg)
![Page 91: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/91.jpg)
Scheduled vs. Unscheduled
◦ Simple metric to evaluate impact
Difficult to gauge impact of most failures
◦ Only 18% of failures are covered by an email
Customer isolation events have a clear impact
◦ Recall, BGP traffic makes these easy to spot
![Page 92: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/92.jpg)
![Page 93: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/93.jpg)
![Page 94: Daniel Turner Kirill Levchenko, Alex C. Snoeren, Stefan Savageconferences.sigcomm.org/sigcomm/2010/slides/S9Turner.pdf · understanding of real failures ... A time series of Layer-3](https://reader034.fdocuments.us/reader034/viewer/2022050207/5f5a51e18bb6313934289c3d/html5/thumbnails/94.jpg)
Engineering for failure requires real data◦ Data has historically been difficult to obtain
Methodology to perform historical failure analysis with low-quality data sources
Shared our findings in the CENIC network◦ Reliability of individual components
◦ Causes of failures
◦ Impact of failures