Predicting intermittent network device failures based on ... · feeds our datasets with real-time...

MSc System and Network EngineeringResearch Project 2

Predicting intermittent network device failuresbased on network metrics from multiple datasources

Chris [email protected]

Henk van [email protected]

July 16, 2018

Supervisors:P. BoersM. Kaat

Assessor:Prof. Dr. C.T.A.M. de Laat

Abstract

Predicting failures in networking has been a subject for a long time. Time series data andsyslog are now often used to monitor the state of the network. SURFnet has been collectingthis data in searchable database clusters for over two years. We set out to determine to whatextent these two datasets can be used to predict network device failures. This enables networkoperators to take pre-emptive measures to prevent an imminent failure. Using experiments,we proved a relation between the dataset containing the syslog device events and the datasetcontaining the time-series network metrics.

Contents1 Introduction 2

1.1 Research Question . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Background 32.1 SURFnet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2.1 Core-routers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2.2 Ciena 5410 switches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.3 Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.3.1 InfluxDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.3.2 Syslog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.3.3 Jira . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3 Related Work 5

4 Methodology 54.1 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

4.1.1 Spontaneous line-card reboots . . . . . . . . . . . . . . . . . . . . . . . 64.1.2 Permanent line-card failures . . . . . . . . . . . . . . . . . . . . . . . . 64.1.3 Unusual loss of throughput . . . . . . . . . . . . . . . . . . . . . . . . 7

4.2 Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74.2.1 Metrics identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74.2.2 Event analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74.2.3 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

4.3 Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84.3.1 Spontaneous line-card reboots . . . . . . . . . . . . . . . . . . . . . . . 84.3.2 Permanent line-card failures . . . . . . . . . . . . . . . . . . . . . . . . 84.3.3 Unusual loss of throughput . . . . . . . . . . . . . . . . . . . . . . . . 9

5 Findings 95.1 Spontaneous reboots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95.2 Line-card errors on core-routers . . . . . . . . . . . . . . . . . . . . . . . . . . 12

5.2.1 Failure at Bor on 14th of September 2017 . . . . . . . . . . . . . . . . 125.2.2 Failure at Truus on 6th of March 2018 . . . . . . . . . . . . . . . . . . 165.2.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

5.3 Spontaneous throughput loss . . . . . . . . . . . . . . . . . . . . . . . . . . . 185.3.1 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

6 Discussion 226.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236.2 Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

7 Conclusion 24

A Evaluation form 14-09-2017 26

B Evaluation form 06-03-2018 27

1

1 IntroductionPredicting network errors and failures has been a subject of interest for over 30 years[1].

The growing complexity and dynamics of networks has increased the demands on redun-dancy, uptime and fault tolerance. In most cases, networks are monitored with tools thatvisualize network performance and show metrics representing the current state of the net-work. This way of monitoring can be classified as reactive, meaning that network operatorsreact on events once they have happened.

With the storage capacity and processing power increasing over time, we are now at apoint where it is relatively easy to store and process events logged by hundreds of networkdevices. This allows manual analysis of errors and events both real-time and back in time.The next step would be to identify anomalous behavior based on the data at hand and predictanomalies based on patterns in this data. Therefore, transitioning to a more proactivemonitoring strategy.

Surfnet has been monitoring their network in a variety of ways since 1993, the last fiveyears, SURFnet has been collecting syslog[2] data and network device metrics for monitoringpurposes. This data ranges from temperature to link utilization to error messages. Overthe course of two years, these syslog messages and syslog have been stored in searchabledatabases. During this two year period, the SURFnet network has had a fair amount ofoutages where customers have experienced a degraded performance due to these events.The impact of these events to SURFnet’s customers varied between a momentary loss ofresilience for one customer to a complete loss of connectivity for multiple customers.

As SURFnet is storing a significant amount of data the question arose if this data could beused to predict failure. Modern techniques such as Machine Learning are nowadays capableof processing large amounts of structured and unstructured data to predict trends[3]. Partof this research will be to analyze the datasets to determine if they could be used forautomated processing such as Machine Learning. This could be an important aspect infuture networking, since according to [1], human network operators are only able to interpreta small set of metrics and events useful for short-term prediction, yet are unable to interpreta complex and large variety of data over an extended period of time to predict intermittentnetwork device failure.

This work sets out to determine to what extent it is possible to predict intermittenthardware failures based on network event data from multiple data sources collected bySURFnet. We will describe the current possibilities, limitations and usability of the datasets.

1.1 Research QuestionWe pose the following research question:

To what extent is it possible to predict intermittent network device failures based onnetwork metrics from multiple data sources?

To answer the above question, we pose the following sub-questions:

• What metrics are relevant to identify a failure?• What pattern can be identified between the intermittent failures?• Is it possible to prove correlation between multiple data sources?

2

2 BackgroundSURFnet has a around 400 switches and routers which they actively monitor using a

variety of tools. They operate 2 core-routers and offer 1, 10, 40 and 100 Gbit/s networkinterfaces. This section gives a brief insight into these tools, the network architecture andthe datasets we have used to conduct our research.

The overview we have presented here is concise and describes only the basic componentsour research is based on. SURFnet’s infrastructure and monitoring systems are more complexand extensive than we have described here.

2.1 SURFnetThe SURFnet network infrastructure, called SURFnet version 7, consists of roughly 400

switches and routers of various types. The device event logs are being stored centrally.Besides collecting these event logs, device metrics are polled once every five minutes usingSNMP[4], resulting in around 200.000 data points for every round of polling[5].

A third datasource is the Jira[6] ticketing system where network operators log ticketsand keep track of actions during and after incidents and problems.

2.2 DevicesThis section breefly describes the two networing devices our research focusses on.

2.2.1 Core-routers

Currently, SURFnet has two Juniper T4000 series routers which are the core-routerswithin their network. They are located on two seperate geographicmlocations for redundancypurposes and are named the following:

• Bor• Truus

These core-routers play a pivotal role within the SURFnet network and are interconnectedvia two fiber-optic cables.

2.2.2 Ciena 5410 switches

A core component within the SURFnet network is the Ciena 5410 type switches. Theseswitches are equipped with either 1 or 10 gigabit fiberoptic interfaces which connect tocustomers. According to the network operators these switches can be polled with SNMPwith a maximum frequency of five minutes. Polling with a higher frequency would cause theswitch to drop the current SNMP polling process resulting in data gaps. The SNMP pollingmethod will be further described in Section 2.3.1.

At this moment, SURFnet operates roughly 40 of these devices.

2.3 MonitoringThis section describes the network monitoring architecture. This monitoring architecture

feeds our datasets with real-time data from the SURFnet network.

2.3.1 InfluxDB

Device and link-state information is stored in InfluxDB[7], a time series database. Thedata is gathered by SNMP[4] polling. Using Grafana, a time series visualization tool, themetrics stored in InfluxDB can be plotted over time. Using these graphs, network operatorscan monitor the network. The architecture to gather the device metrics is outlined in Figure1.

3

Figure 1: SNMP polling of network devices (SURFnet[5])

The SNMP polling as depicted in Figure 1 is implemented as a SNMP walk that retrievesa subtree of management values using SNMP GETNEXT requests[8]. Therefore, all availableSNMP mib values of a device are stored in InfluxDB. Telegraf timestamps all received metricsgathered with SNMP. At this moment, the polling interval is once every five minutes.

2.3.2 Syslog

When a process on a device in the network generates a log message, the message isforwarded to Splunk[9] in the syslog[2] format, a standardized message format for systemlogs. Splunk is used as a central logging location, which offers a range of tools for querying,analyzing and visualizing the collected messages. Currently, the verbosity of the syslogmessages is set to informational. The possible syslog logging is set by a numerical numberwhich corresponds to a severity level which are depicted in Table 1.

level Severity0 Emergency: system is unusable1 Alert: action must be taken immediately2 Critical: critical conditions3 Error: error conditions4 Warning: warning conditions5 Notice: normal but significant condition6 Informational: informational messages7 Debug: debug-level messages

Table 1: Syslog severity levels according to RFC 5424 - The Syslog Protocol[2]

Setting the syslog logging level to six (informational) results in all message startingfrom zero (emergency) up to and including level six (informational). Therefore, this settingdoes not generate debug-level messages which could contain in-depth technical data. Thedebugging-level is used in specific cases, for instance, to solve issues.

4

2.3.3 Jira

Besides the data stored in InfluxDB and Splunk, the SURFnet’s ticketing system Jira[6]contains information about issues and incidents regarding the network. Network operatorskeep track of issues, create and update trouble tickets. Jira offers an overview of networkrelated issues and measures taken to mitigate or solve these issues.

In Section 4 we describe the methods by which we attempt to classify, analyze andvalidate our findings.

3 Related WorkIn the recent years the use of Artificial Intelligence and Machine Learning in particular,

is taking off. The seemingly unparalleled power of Artificial Neural Networks is used in themost diverse fields in IT. Forbes shows what the leading companies are doing with MachineLearning[10].

In order to train these Neural Networks effectively the use of algorithms, labeled datasetsis a necessity. Based on the labels, the learning algorithms are capable of identifying recur-ring patterns or anomalies that are related to the events that are labeled.

Roughan et al. tried to detect IP forwarding anomalies based on routing and trafficdata gathered with SNMP and BGP monitoring[11]. They state that correlation betweendifferent sources is a necessity since gathered data using SNMP is prone to have missingdata points. Since they had challenges comparable to ours, their work helped us to correlateSNMP data to syslog messages.

Salfner, Lenk, and Malek performed a survey of online failure prediction methods[1].They created a taxonomy of different approaches and described major concepts. Further-more, they specify the datasets necessary for machine learning stressing the importance oflabeled datasets. We have used their proposed methods as input to define our own method.

Aceto et al. discuss the scattered literature on internet outages and the importanceof having an overview of the current issues and threats[12]. They defined a terminologyof internet outages including causes and implications. Furthermore, they did extensiveliterature research to create an overview of the different issues and measures taken to mitigatethe issues which include models attempting to predict failures.

Duenas et al. built a system to create training sets suitable for machine learning usingApache Spark with different learning algorithms[13]. The system analyzed events generatedby the management systems and carefully labeled the events generated.

Chalermarrewong, Achalakul, and See attempted to predict failures in data centers[14].Although not focusing on networking, they used time series data as well. They simulatedtheir model in a simulated cluster on Simics where they gathered and compared it to real-world failure in the data center. Fault prediction proved accurate in a simulated environmentbut gave a high false-positive rate in the real-world data center. Their work is comparableto our work, since they have conducted research on a enormous scale. Their approach wasused as input for our methodology.

Zhang et al. created a framework to predict switch failure in data centers name Prefix[15].They collected data from 9397 switch generated from 3 different models. They were able topredict a significant number of failures with a low false positive rate.

4 MethodologyAt the start of this research, SURFnet provided us with initial starting points. We were

handed an overview of a few past big outages due to faulty line-cards. During interviewswith the SURFnet network operators, we obtained a second starting point; Spontaneousline-card reboots, without any apparent root cause.

Using these two types of events, we performed an initial analysis of the Jira ticketingdata, focusing on past outages. Interviews with SURFnet’s network operators, together with

5

the results of our initial analysis, are used to scope the research. We decided to concentrateon line-card related issues, because these issues seem to happen a number of times over thepast two years.

During the literature study we analyzed what others had already done in the field ofpredicting failures with the use of network metrics and syslog data. We used this as inputfor our analysis and to fine-tune our analysis process.

First, we have classified the line-card failure events based on their cause of failure. Next,we have identified metrics to analyze these events. We used this as input to analyze theindividual failures. Finally, these findings were validated. The process is detailed below.

4.1 ClassificationWe categorized these known past outages as mentioned in Seciton based on the infor-

mation in SURFnet’s ticketing system. Specifically, we focused on line-card related issues,because they seemed to have happened a number of times for the past two years. We havescoped the research to these specific events since they have a high impact and seem tohappen a number of times in the past few years. The line-card related outage events aregrouped according to the presumed cause of the outage as described in each trouble ticket.We found that the outages fitted in either one of the categories; spontaneous line-card re-boots or permanent line-card failures. As we explain in our findings, we later identified anadditional class; Unusual loss of throughput. This resulted in a total of three classes ourfindings could be divided into:

• Spontaneous line-card reboots;• Permanent line-card failures;• Unusual loss of throughput.

4.1.1 Spontaneous line-card reboots

Spontaneous line-card reboots are faults that happen intermittently and without a clearroot cause. When the line-card fails and reboots, links connected to this line-card experienceloss of connectivity for that period. After a short moment, connectivity is restored again.Listing 1 represents a typical Jira trouble ticket.

Subject: Spontaneous reboot LM5 Gv008a\_5410\_01Status: ResolvedType: IncidentOrigin: MonitoringImpact: P3

Description:|5/1|CR link to Gv001a\_5410\_01 port|Up|0h34m57s|10G/FD||5/2|CR link to Gv002a_3930_01 port|Up|0h34m57s|10G/FD||5/3|CR link to Gv001a_5142_03 port|Up|0h34m54s|10G/FD||5/4|AVAILABLE|Down|0h0m0s|N/A |

Listing 1: Example of a spontaneous reboot trouble ticket (translated from Dutch) (Jira)

In all seven of the reported cases in Jira, a Ciena 5410 switch was affected. Accordingto Jira, no other switch models have had line-card reboots without a apparent reason.

4.1.2 Permanent line-card failures

The permanent line-card malfunctions occurred on the core-routers. These are faultswhere the line-cards drop network traffic and the connected links loose connectivity, butoddly enough, customers with a redundant connection do not fail-over to their backup path.Normally, a link-failure triggers a fail-over mechanism that reroutes the network traffic. Thisfail-over detects that traffic is being dropped, which is a result of a broken path.

However, in this scenario traffic was being malformed and dropped without the fail-overmechanisms detecting this, failing the implementation. The connectivity was only restoredwhen the line-card was physically shutdown or disabled and the fail-over mechanisms took

6

over. These events are logged in Jira and are evaluated by network operators, the evaluationforms are in Appendix A and B.

Permanent line-card failures with these characteristics only occurred on the two JuniperT4000 core-routers.

4.1.3 Unusual loss of throughput

Events related to unusual loss of throughput are not, directly or indirectly, mentionedin the Jira ticketing system. These type of events have been pinpointed by identifying andanalyzing a hand-full of syslog events that were repeated many times without an apparentreason. Looking at the network performance as depicted in Section 5.3 Figure 14, we cansee a significant loss of throughput for a short period. Usually followed by a small reboundspike, seemingly to compensate for the short loss of throughput.

4.2 ProcessWe based our analysis on a iterative process of identification, analysis and validation.

This process is depicted in Figure 2.

Figure 2: Event analysis process

The failure events we collected during our initial analysis of the Jira ticketing data andevaluation forms were used as process input.

4.2.1 Metrics identification

On a per-event basis, we collected the device syslog messages and identify metrics thatare related to the event. For example, when dealing with a line-card failure, we collectedthe metrics of all network interfaces of that line-card. The metrics and syslog messages thatare analyzed on a per failure basis, are outlined in Section 4.3.

4.2.2 Event analysis

Using the collected data, we dove into the details of the failure event. With the use ofGrafana, which in turn queries InfluxDB to visualize metrics, we analyzed the previouslyselected metrics, selecting a time range including the moments before the event, the eventitself and moments after the event. Likewise, each syslog message corresponding to theaforementioned time frame were analyzed.

Splunk was used to analyze the syslog messages both qualitatively as quantitatively.Using qualitative analysis, we tried to understand the cause and the meaning of the syslogmessages, including the relationship between messages. By analyzing the syslog messagesquantitatively, we were able to visualize the frequency of the messages over time.

The technical manuals from the device manufacturers were used as input during thequalitative analysis of syslog messages. We based our understanding of the device messageson these documents.

4.2.3 Validation

We validated our results from the event analysis. Where possible, we tried to simulatethe event by reproducing a similar setting using a test bed of the same model of network

7

devices. Furthermore, if available, root cause analyses documents have been used to ver-ify our findings. At last, the findings have been validated by discussing them with theTeam Expert Networkmanagement (SURFnet TEN). They have in-depth knowledge aboutthe SURFnet network and events which have happened. Their insights about our findingsprovides feedback which we use in our process iteration.

4.3 IdentificationIn this section, we describe per type of failure what data we have collected from what

datasource.

4.3.1 Spontaneous line-card reboots

Using the Jira ticketing system and our classification model, we created a list with anumber of these issues, containing the time and date of the event, device name and theimpacted line-card. This resulted in a list of 7 distinct reported cases in the past 2 years,showed in Table 2.

Most interestingly, in all of the reported cases the Ciena 5410 switches were affected. Noother networking device was subject to rebooting line-cards without any apparent reason.In a handful of other cases devices would suddenly reboot, but root cause analysis showedthat in these cases the power feed was to blame.

In two of the seven cases, the same line-card was affected only two weeks apart. In twoother cases, the same chassis was affected twice in a eight month period, but two individualline-cards rebooted.

Date Time Hostname Line-card22-02-2018 09:59 Gv008a_5410_01 LM506-01-2018 13:55 Asd002A_5410_01 LM607-08-2017 19:07 Dt001B_5410_01 LM424-07-2017 01:58 Dt001B_5410_01 LM414-11-2016 11:19 Gn001A_5410_01 LM118-08-2016 21:14 Asd001A_5410_01 LM812-12-2015 13:37 Asd001A_5410_01 LM7

Table 2: 7 reported cases of spontaneous rebooting line-cards in Ciena 5410 switches

In syslog, spontaneous reboots present themself as described in Listing 2.

Feb 22 09:59:39 active.5410-01.gv008a.dcn.surf.net [Local] 145.145.93.57↪→ 00:03:18:9a:1c:5f Gv008A_5410_01 MODULE -6-THERMAL_STATE_CHANGE:↪→ module(1-A-LM5): :Thermal state changed from NORMAL to unknown

Feb 22 10:01:02 remote.5410-01.Gv008A.dcn.surf.net [Local] 145.145.91.8↪→ 00:03:18:9a:1c:5f Gv008A_5410_01 MODULE -6-OPER_STATE_UPDATE: module↪→ (1-A-LM5): :Module operational state changed to Init

Listing 2: Rebooting line-card in Ciena 5410 (Splunk)

4.3.2 Permanent line-card failures

As pointed out by the evaluation forms, syslog messages were generated prior to thefailing of the line-cards which are listed in Listing 3.

Sep 11 12:05:04 re1.JNR01.Asd001A.dcn.surf.net 1 2017-09-11T10:05:04.139↪→ Z re1-bor4000 - - - - fpc6 XMCHIP(1): FI: Packet CRC error -↪→ Stream 6, Count 1

Listing 3: Syslog messages identifying line-card failure (Splunk)

8

According to the vendor[17], this message indicates that a packet with an incorrect CRCvalue is to be sent through the switch fabric. Cyclic Redundancy Checks (CRCs) are usedas a

The syslog Packet CRC Error messages and the Interface Input Error metrics stored inInfluxDB are related over different devices. If core-router Bor generated Packet CRC Errorsyslog messages than the other core-router Truus will receive packets with an incorrect CRCand vice-versa therefore, increasing the Interface Input Error metric count. Figure 7 inFinding Section 5.2 depicts these occurrence of these errors.

4.3.3 Unusual loss of throughput

During the course of this research we found significant and infrequent drops in the totalthroughput for multiple Ciena 5410 switches. The total throughput is a derived metric,consisting of the sum of the inbound traffic for all interfaces, plotted against the sum of theoutbound traffic for all interfaces over time. This can either be plotted on a per line-cardbasis, or for the whole switch chassis. The time stamp of the anomalies in the metrics relateto the time stamp of the syslog message listed in Listing 4.

2018 May 24 09:50:33 active.5410-01.Asd001A.dcn.surf.net DATAPLANE -4-↪→ FLOOD_CONTAINMENT_THRESHOLD: chassis(1): :Flood Containment↪→ Threshold Event Container LIMIT_2 on l2-ucast EXCEEDED

Listing 4: Flood threshold exceeded (Splunk)

For these type of events, we have used the throughput per interface metrics and syslogevents.

5 Findings5.1 Spontaneous reboots

The data we have identified in Section 4.3.1 is used to analyze the cause of the spon-taneous reboots. The syslog messages provide information about the state of the switch apriori to the event.

In each of the documented cases, we observed that just before the line-card is crashing,two Connectivity Fault Management (CFM) control messages are logged. ConnectivityFault Management is specified in the IEEE 802.1ag standard[18]. The protocol includesa Continuity Check Protocol that provides a means to detect connectivity failures in anEthernet service[19].

Failing CFM Tunnel

The first control message is displayed in Listing 5 and is immediately followed by thesecond message, displayed in Listing 6.

Feb 22 09:59:20 active.5410-01.gv008a.dcn.surf.net [Local] 145.145.93.57↪→ 00:03:18:9a:1c:5f Gv008A_5410_01 CFM-1-CFM_SERVICE_FAULT_SET:↪→ chassis(1): :cfm service fault state changed to RDI for service PBT↪→ -000002252-3474 index 4

Listing 5: Ciena 5410 RDI Fault state changed (Splunk)

The message in Listing 5 indicates that one or more Remote MEPs have detected servicefaults[20]. The normal process of CFM RDI messages is shown in Figure 3.

9

Figure 3: CFM process - detecting a remote defect (Ciena[20])

Feb 22 09:59:20 active.5410-01.gv008a.dcn.surf.net [Local] 145.145.93.57↪→ 00:03:18:9a:1c:5f Gv008A_5410_01 CFM-3-CFM_R_MEP_FAULT_SET: chassis↪→ (1): :cfm fault status set to RDI for remote mep id 5, service PBT↪→ -000002252-3474

Listing 6: Ciena 5410 RDI Fault set (Splunk)

The message in Listing 6 indicates that a CCM-heartbeat has not been received from aremote MEP for a certain period[20].

Line-card initialization

Shortly after these two messages, the line-card triggers a thermal warning (Listing 2,indicating that the thermal state of the line-card switched to unknown. A few seconds later,the switch logs that it tries to (re)initialize the card and the interfaces, as showed in Listing7. After this process completes, the line-module returns to its normal operational state.

Feb 22 10:01:02 remote.5410-01.Gv008A.dcn.surf.net [Local] 145.145.91.8↪→ 00:03:18:9a:1c:5f Gv008A_5410_01 MODULE -6-OPER_STATE_UPDATE: module↪→ (1-A-LM5): :Module operational state changed to Init

Feb 22 10:02:04 remote.5410-01.Gv008A.dcn.surf.net [Local] 145.145.91.8↪→ 00:03:18:9a:1c:5f Gv008A_5410_01 MODULE -6-OPER_STATE_UPDATE: module↪→ (1-A-LM5): :Module operational state changed to Enabled

Feb 22 10:02:06 remote.5410-01.Gv008A.dcn.surf.net [Local] 145.145.91.8↪→ 00:03:18:9a:1c:5f Gv008A_5410_01 XCVR-6-OPTIC_INSERTED: xcvr(port↪→ 5/1): :Pluggable transceiver inserted

Feb 22 10:02:06 remote.5410-01.Gv008A.dcn.surf.net [Local] 145.145.91.8↪→ 00:03:18:9a:1c:5f Gv008A_5410_01 PORT-4-STATE_CHANGE: port(port↪→ 5/1): :Port Link up

Listing 7: Initializing Line-card (Splunk)

The thermal state warning is triggered because the switch is not able to probe theline-card for a thermal reading during the period between the line-card failure and theinitialization process. Once the line-card becomes operational, it reports the correct thermalreading. Listing 8 displays these thermal messages.

Feb 22 10:01:03 remote.5410-01.Gv008A.dcn.surf.net [Local] 145.145.91.8↪→ 00:03:18:9a:1c:5f Gv008A_5410_01 MODULE -6-THERMAL_STATE_CHANGE:↪→ module(1-A-LM5): :Thermal state changed from unknown to COOL

Feb 22 10:02:26 remote.5410-01.Gv008A.dcn.surf.net [Local] 145.145.91.8↪→ 00:03:18:9a:1c:5f Gv008A_5410_01 MODULE -6-THERMAL_STATE_CHANGE:↪→ module(1-A-LM5): :Thermal state changed from COOL to NORMAL

Listing 8: Thermal states during the initalization of the Line-cards in the Ciena 5410 switches(Splunk)

Given these insights and the order at which these messages arrive, it would seem mostlogical that a tunnel defect at a remote switch would cause these messages to be triggered.

Yet, this is not the case. The interconnected switches do not provide any indication thatthe tunnel is defective and show no messages prior to the failure on the other switch.

Messages on connected switch

The switches that are interconnected to the affected Ciena 5410 line-card show no mes-sages prior to the crash of the line-card. As soon as the line-card on the Ciena 5410 crashes,

10

the interconnected switches start logging that the interfaces and tunnels towards the affectedCiena 5410 switch are going down. These messages are displayed in Listing 9.

Feb 22 09:59:20 remote.5142-03.gv001a.dcn.surf.net [local] 145.145.91.112↪→ 2c:39:c1:e8:70:20 Gv001A_5142_03 VCTUNNELING -1-TUNNEL_DOWN:↪→ chassis(1): :Tunnel 1B4-005UL0 down

Feb 22 09:59:20 remote.5142-03.gv001a.dcn.surf.net [local] 145.145.91.112↪→ 2c:39:c1:e8:70:20 Gv001A_5142_03 VCTUNNELING -1-↪→ TUNNEL_ACTIVE_STATE_CHANGED: chassis(1): :Tunnel 1B4-005UL0 active↪→ state changed to inactive , b-vid 3472

Feb 22 09:59:20 remote.5142-03.gv001a.dcn.surf.net [local] 145.145.91.112↪→ 2c:39:c1:e8:70:20 Gv001A_5142_03 PORT-4-STATE_CHANGE: port(port↪→ AGG1): :Port Link down

Listing 9: Syslog messages on switch connected to a failing Ciena 5410 line-card (Splunk)

After the line-card resumes normal operation, the connected switches log that the inter-faces and tunnels are up again.

Network metrics

In all the cases we came across, we identified this exact same pattern between two CFMRDI error messages, thermal state warnings and (re)initialization of the card.

In sharp contrast to this clear pattern of anomalies in the syslog messages, we werenot able to find a similar pattern in the network metrics of the switch. We would haveexpected to identify clear loss of overall throughput during the period that the line-card wasrebooting. These expectations proved to be wrong. Both the metrics per interface as themetrics for the chassis showed no clear extremes in the graphs. The total throughput forthe interfaces on that line-card together are showing only slightly less traffic as illustratedin Figure 4.

Figure 4: Switch Gv008a_5410_01 Line Module 5 - Total throughput (InfluxDB)

The total time between failure of the line-card and returning back to normal operationis between two and three minutes.

The class of line-card failures we have analyzed above, only occurred on one specificmodel of switches; only the Ciena 5410 switches are affected. No other switches in theSURFnet network follow this specific set of events.

Using the pattern we have identified, we were able to find 73 cases in the last year.Figure 5 quantifies these events over time. Interestingly, most of these cases appear to gounnoticed, since Jira only included 7 cases, and we have identified tens of other cases.

Figure 5: Number of cases of spontaneous line-card reboots plotted over time (Splunk)

11

5.2 Line-card errors on core-routersAs mentioned in Section 4, we started with the analysis of syslog messages which were

generated followed by the metrics stored in InfluxDB. Secondly we analyze the syslog mes-sages and the metrics by comparing when they occur. Findings were validated by checkingthe evaluation forms containing a timeline of the events occurring.

Data overview

The Packet CRC Error messages mentioned are syslog messages stored in Splunk, thesemessages form the starting point for the research in this Section. This sections attempts tolink these messages to device metrics and vice-versa.

The Interface Input Errors mentioned in this chapter are device metrics stored in In-fluxDB. We will show the relation between the messages generated by one core-router to themetrics gathered from the other core-routers.

5.2.1 Failure at Bor on 14th of September 2017

The fault was noticed on the 14th of September at 09:30 and identified as a faultyline card at 10:00. Since the interfaces attached to the line-card did drop traffic but didnot trigger failover features, interfaces on the line-card were manually shutdown triggeringfail-over. Notable entries in the ticketing system are mentioned in Table 3.

Date Time Event14-09-2017 09:30 Resilience loss noticed14-09-2017 10:00 Problem identified14-09-2017 10:30 Interfaces manually brought down14-09-2017 17:30 Line card reset

Table 3: Notable events mentioned in the ticketing system

Syslog messagess

The entries in Table 3 are compared to the occurrences in syslog. We plotted the numberof Packet CRC error syslog messages over a timeline to visualize the occurrences.

Figure 6: Number of Packet CRC Error messages (Splunk)

As can be seen in Figure 6, the first Packet CRC Error, which occurred on September11th, went unnoticed. Surprisingly, most syslog messages occurred on 13th of Septemberwhile connectivity remained unaffected according to the ticketing system. This can also be

12

observed in the throughput metrics were no apparent drops can be seen. Next, the metricsstored in influxDB are analyzed and are depicted in Figure 7 and Figure 8.

Device metrics

Because packets are arriving at the switch fabric of Bor with incorrect CRC, we expectto see input errors on the other directly connected core-router Truus. The interface inputerrors are stored as a counter value in InfluxDB, the counter value represents the number ofreceived packets on that interface with an incorrect CRC. We plotted this information usingGrafana and is depicted in Figure 7.

Figure 7: Interface input errors on Truus during the same time frame (InfluxDB)

As can be seen in Figure 7, two major spikes of input interface errors occurred duringthis time frame on one of the interfaces connecting Bor and Truus. No input errors occurredon all other active interfaces within this time frame. This also shows that although PacketCRC Error messages were generated, they do not increase the Interface Input Error counteron Bor.

Firstly we will zoom into the first spike which is depicted in Figure 8.

13

Figure 8: Graph of first spike in Interface input errors on Truus on 12th of September(InfluxDB)

In Figure 8, each data point is clearly visible. Over a period of thirty minutes, thereare 6 data points. This is due to the SNMP polling interval as described in the SURFnetmonitoring architecture[5] and in Section 2.3.1.

Due to this polling resolution, we are unable to plot the metrics in a more granularmanner.

Data comparison

The syslog messages depicted in Figure 6 on September 11th could indicate that con-nected network devices will receive packets with an incorrect CRC. The specific events arespecified in Table 4.

Date Time Event11-09-2017 10:13:25 Packet CRC Error11-09-2017 10:13:27 Packet CRC Error11-09-2017 12:05:00 Packet CRC Error

Table 4: Packet CRC messages with timestamp

Both of the syslog and metrics gathered occurred within the time frame, the significantamount of Interface Input Errors indicate at core-router Truus give an indication of Line-card failure at core-router Bor.

The second spike of interface input errors which occurred on September 13th is depictedin Figure 9.

14

Figure 9: Graph of second spike of Interface Input errors on Truus on 13th of September(InfluxDB)

Figure 9 shows the limited data resolution of one data point per five minutes within atime frame of thirty minutes similar to Figure 8. Grafana plots the values using just twonon-zero data-points stored in InfluxDB.

The number of incorrect CRC messages over time, before the input interface errors startto happen, is depicted in Figure 10.

Figure 10: Syslog messages prior to interface input errors (Splunk)

As can be seen in Figure 10 core-router Bor generates syslog messages prior to theinterface input errors seen in Figure 9. As with the first spike of interface input errors, theSyslog messages do not occur during the receival of the faulty packets. Due to the SNMPpolling interval we are unable to determine a more detailed trend of the interface input errormetrics. Based on this event it seems the Packet CRC Error generated by router Bor are anomen to line-card failure. An increase of Packet CRC Error messages seems to indicate thatother routers will receive packets with an incorrect CRC. Since another event of line-cardfailure occurred, we will compare this in Section 5.2.2.

15

5.2.2 Failure at Truus on 6th of March 2018

As with the results described in Section 5.2.1 we will take the Syslog message as men-tioned in Listing 3. We have taken the number of Packet CRC Error syslog messagesoccurrences and plotted those in Figure 11, all of those messages were generated by thecore-router Truus. Firstly we will analyze the number of generated syslog messages and sec-ondly, the Interface input errors metrics will analyzed and compared to the syslog messages.

Notable events in the ticket systems are mentioned in Table 5. It seems that the firsterrors had no impact on connectivity since no issues were reported by connected customers.

Date Time Event06-03-2018 08:15 First messages in Syslog06-03-2018 17:30 Starts dropping traffic06-03-2018 17:48 Network issues reported by customers06-03-2018 18:45 Issues diagnosed, failing line-card identified06-03-2018 19:00 Failing line-card shutdown manually initiating failover restoring connectivity

Table 5: Notable events mentioned in the ticketing system

Syslog messages

Figure 11: Number of Packet CRC Errors messages (Splunk)

Shown in Figure 11 are the number of Packet CRC Error syslog messages generated byTruus. Connectivity loss was reported at 17:48, therefore, the Packet CRC Error messagecould be seen as an omen to the failing of the line card. According to the ticketing systemthe line card was switched off at 19:00, which initiated the fail-over connection and restoredconnectivity. Shutting down the faulty line-card had an immediate impact on the numberof Packet CRC Error syslog messages being generated which is depicted in Figure 12.

16

Figure 12: Impact on generated Packet CRC Error (Splunk)

Figure 12 shows the drop in generated messages after the faulty line-card was disabled.This corresponds with the events logged in the ticketing system listed in Table 5.core-routershort recurrence can be seen at 21:30, we were unable to link those messages to anevent given the available data because no mention was made in the ticketing system and nocustomers reported connectivity loss. Nonetheless, connectivity could have been impactedbut not have been noticed.

Next the device metrics are plotted after which we attempt to relate those with syslogmessages.

Device metrics

Figure 13: Interface errors occurring on Router B (InfluxDB)

As depicted in Figure 13 interface input errors on Bor which are connected to Truusreach up to 200.000 faulty packets withincore-routertimespan five minutes. The number oferrors is clearly abnormal since normally the number of errors remains well within doubledigits.

17

Data comparision

It is likely that there iscore-routerrelation interface input errors and Packet CRC Errorsyslog messages when a line-card is imminent to fail or has failed. The advent significantamount of syslog messages introduce an immense rise in Interface Input Error metrics whichonly occurred when a line-card failed or malfunctioned.

5.2.3 Analysis

Both line-card events show Packet CRC Error syslog messages before connectivity wasimpacted and notified. Furthermore, routers with a failing line-card transmit packets withan incorrect CRC. The impact can be seen by analyzing the Interface Input Error metrics onother routers, linking the syslog messages and devices metrics. In the evaluation forms thetimeline states when the line-cards were shutdown, reset or replaced. Upon these measures,the syslog events ceased and connectivity was restored.

5.3 Spontaneous throughput lossDuring the research we found a peculiar event where all of a sudden the throughput of

a switch significally drops, as depicted in Figure 14.

Figure 14: Throughput loss input and output (InfluxDB)

The metrics as shown in Figure 14, have been correlated in time to the syslog messagesdepicted in Listing 10. These syslog messages indicate that the switch exceeds a predefinedflooding threshold. This warning is being triggered if the amount of frames being floodedexceeds a set threshold.2018 May 24 09:50:33 active.5410-01.Asd001A.dcn.surf.net DATAPLANE -4-

↪→ FLOOD_CONTAINMENT_THRESHOLD: chassis(1): Flood Containment↪→ Threshold Event Container LIMIT_2 on l2-ucast EXCEEDED

Listing 10: Flood threshold exceeded (Splunk)

Using a MAC-table, a switch keeps state about what MAC-addresses are behind aninterface. As it learns new MAC-addresses, the table is populated and a countdown timeris started.

If the switch receives a frame, it will update the MAC-table, noting the interface of thesource MAC-address and the timer for that MAC-address will be reset. It will look up theinterface where the destination MAC-address belongs to and it will send out the frame to

18

that interface. After the timer times out, the MAC-address is removed from the MAC-tableto ensure the switch only keeps the current mapping of MAC-address to interface. If theswitch receives a frame with a unknown destination MAC-address, it will flood the frame toall interfaces, until it learns behind what interface that MAC-address is located.

There are only a handful of scenario’s where a switch will flood a frame to all interfaces.One case is if the switch receives unknown unicast frames. Another reason would be broad-cast frames, but the metrics would suggest that this was not the case, since there was nosignificant change in the broadcast traffic.

Looking back at the total throughput metrics in Figure 14, we see that the throughputis severely impacted.

After analyzing the documentation of the Ciena 5410 switches and discussing these find-ings with the SURFnet TEN team, we found out that the internal flood channel of theswitch, a dedicated backplane, is only limited to a maximum throughput of 500 Mbit/s.This flood channel is shared between the chassis and all line-cards.

The only remaining logical explanation would be that somehow all unicast packets arebeing flooded, and the internal flood channel is completely congested. For this to be true,the switch would need to loose its MAC-table (or parts of it).

After we identified numerous similar events, we noticed that a large portion of theseevents had one thing in common. Just before the total throughput of the switch woulddecrease and the flood contamination messages would be logged into syslog, one or moreinterfaces on that switch would go down. Using the information for these events in Jira,we found that in almost all cases that the interfaces went down, there was an explanation.Most of the times, the interface would go down because of work on the fiber-optic cable, ormaintenance on the other side of the link.

Combining these findings, we hypothesize that the switch would loose its MAC-tableeach time when an interface would be shutdown. Using an experiment, we set out to testour hypothesis.

Therefore, our hypothesis for this experiment is:

Negative interface state changes cause the switch to loose its MAC-table andthe flood channel to be congested, resulting in limited throughput

5.3.1 Experiment

We suspected that changes to an interface influences the MAC-table. Therefore, weexecuted an experiment where we sent one million protocol data units (PDU) per secondthrough a Ciena 5410 switch, while monitoring the number of received PDU using a packetgenerator and a test bed provided by SURFnet. The size of each PDU is equal to 1,25kilobytes, which simplifies throughput measuring. Therefore, one million packets per secondis equal to ten gigabit per second traffic. The links between the swithes and the hosts wereall 10 Gbit/s Ethernet fiber-optics. We have measured the following:

1. Transmitted PDUs2. Received PDUs3. Opposite traffic

The experimental setup is depicted in Figure 15. The transmitted PDUs (1) are the numberof PDUs generated by the packet generator. These are sent through to the switch measuredat Host A. Received PDUs (2) are the number of PDUs which have traversed through theswitch and are received by Host B. Every 30 seconds, a PDU is returned to keep the switch’sMAC-table populated the MAC-table.

19

Figure 15: Depiction of experimental setup

First, we conducted a baseline-experiment, to verify our setup and compare our resultsto. This experiment is depicted in Figure 16.

Figure 16: Control experiment

During the experiment we have triggered a couple of changes by shutting down an un-related interface. For the duration of the experiment, the MAC-table and throughput wasmonitored. The experiment consisted of four stages.

1. Start transmission2. Introduce state change3. PDU in opposite direction of traffic4. Normal state restored

The results of the experiment is shown in Figure 17. The numbered arrows indicate thestages of the experiment.

20

Figure 17: Throughput measurement of Protocol Data Units while shutting down a remoteinterface

At the initial start of the experiment, indicated by arrow number 1, all PDU transmittedsuccessfully traversed the switch. When we introduce any change in the link-state of anyinterface, indicated by the arrow 2, the entire MAC-table was dropped. In this experiment,we shutdown the interface on the remote side of the link. Listing 11 shows the MAC-tableduring the period between the points 2 and 3. Therefore, if a link state change was causedby a third party, the MAC-address would be dropped as well impacting all traffic traversingthe switch.Asd001A_5410 -01T*> mac-addr show vs pdb-os3generating MAC table, please wait...+---------------------------- MAC TABLE ----------------------------+| | |D| Logical Interface | Oper || Virtual Switch | MAC |S| Type | Name | LPort |+-----------------+-----------------+-+--------+------------+-------+|pdb-os3 |00-01-00-00-06-68|D|sub-port|pdb-os3_4/3 |4/3 || || Entries: 1 |+-------------------------------------------------------------------+Listing 11: State of the MAC-table during the period between points 2 and 3 of theexperiment (Device CLI)

The induced state change caused the entire MAC-table to be dropped, thus, the flood-channel to be congested. Thereby, since this channel only has a maximal throughput of 500Megabits per second, the total throughput is being limited.

During the period between points 2 and 3, the syslog displays the same messages as wehave observed ourselves in Splunk. The messages are displayed in Listing 12.

June 15, 2018 12:03:20.090 [local] Sev:6 port(port 4/4): :Port Link down↪→ , operSpeed , operDuplex , operPause ,↪→ operAneg

June 15, 2018 12:03:21.950 [local] Sev:6 chassis(1): :Flood Containment↪→ Threshold Event Container LIMIT_2 on sub-port pdb-os3_4/3, l2-ucast↪→ EXCEEDED

June 15, 2018 12:03:37.925 [local] Sev:6 health(port 4/4): :Port Link:↪→ State has experienced a negative change in health Normal to Warning

June 15, 2018 12:03:45.046 [local] Sev:6 chassis(1): :Flood Containment↪→ Threshold Event Container LIMIT_2 on sub-port pdb-os3_4/3, l2-ucast↪→ normal

Listing 12: Syslog messages during the period between 2 and 3 of the experiment (DeviceCLI)

Upon relearning the MAC-address as indicate by point three, normal operation wasrestored as indicated by state four.

21

This experiment also shows that the switch does not give any indication that its entireMAC-table has been dropped. Even when accessing the console and setting the highestdebugging level output, nothing indicates directly that the MAC-table has been dropped.Only the flood containment threshold hints at the fact that the MAC-table was dropped.

We repeated the experiment, where we removed a service, being a virtual switch withinthe switch. Adding and removing services, and thus virtual switches, is an automated processwithin the SURFnet network. We wanted to validate if this also affects the switch in anyway. The results are shown in Figure 18.

Figure 18: Throughput measurement of Protocol Data Units while removing a virtual switch

6 DiscussionUsing the experiment as outlined in Section 5.3.1, we proved that negative interface

state changes cause the switch to loose its MAC-table and the flood channel to be congested,resulting in limited throughput.

Although in our experiments we have simulated an extreme case, we did prove that anegative state change to any interface causes the switch to drop its entire MAC-table.When the configuration or state changes it causes all traffic to travel over the flood-channel,triggering the threshold syslog message. Since the flood-channel has a limited capacity, asignificant drop can be seen in the total switch throughput. Therefore, using the experimentswe also proved the correlation between the device metrics dataset and the syslog dataset.We confirmed that events from syslog can be one on one related to device metrics, and viceversa.

Under normal circumstances, there will be traffic going already between two interfaces.Dropping the MAC-table will have little to no effect on interfaces with a bidirectionaldata flow. The switch can populate the MAC-table near-instant, resulting in limited orno dropped traffic. This issue will become more troublesome in cases where the traffic ismore unidirectional oriented.

Furthermore, the results indicate that external factors, such as faulty costumer equip-ment, fiber-cuts and flaping links can severely impact the throughput of the entire switchas any of those factors would cause the switch to drop its entire MAC-table.

While finding relation between data sources given to events mentioned in Section 4.2, wewere not able to determine if the current set of data is suitable for failure prediction. We

22

found the verbosity of the data lacking to give sufficient information about events. A higherverbosity of available data could be helpful but could yet also prove to be a burden becauseit could result in additional noise.

Furthermore, we found that the datasets currently are unsuitable for machine learning.In order to use Machine Learning algorithms, metrics and device events must be carefullylabeled[3] to be suitable for machine learning. Carefully labeling the dataset or generatinga new dataset, including these labels could provide as input to machine learning algorithms.Yet, in comparison with the work of Zhang et al., the amount of events SURFnet has collectedis rather limited[15]. Nonetheless, if focused on frequent events the current dataset could beused as a source for machine learning techniques.

Combining three datasources, being Splunk, network metrics and the ticketing data inJira proved to be really hard, since the data format is different between each source, andthe sources contain numerous inconsistencies. In some cases, the time stamps betweendatasources was off. This makes any automated effort in effect hard to do. To a certainextent are humans capable of accounting for these errors. For algorithms this proves to bea real struggle.

6.1 Future workLabeling datasets

We could not use the data sources in an automated manner to predict failures. But wedo still see this as a fruitful area of research. Especially regarding creating a labeled dataseta lot of progress could be made. Therefore, as future work we would put significant effort inlabeling the dataset. Using these labeled datasets, Machine Learning algorithms could proveuseful in determining if omens exist to predict network device failures in the future. Theseomens could provide sufficient time for network operators to mitigate or prevent failuresbeforehand.

Automated labeling

Others have been developping automated systems which attempt to label new metricsand events automatically. As future work, research could be done in attempting this ap-proach for this specific situation. Overlooking the limited data quality in the Jira ticketingsystem for a second, a great potential would lie in the use of Natural Language Processingto analyze the events stored in Jira, and create labels and identify data that way.

Trend Analysis

At last, we have tried to implement a trend analysis framework to identify anomalies inthe network metrics data. Due to constraints both in the Grafana and InfluxDB toolset andin time, we were not able to succesfully implement such framework.

6.2 RecommendationsThe dataset we were provided with, proved to contain inconsistencies on multiple levels.

In some cases, the timestamps between the two datasources would not match. Therefore,we had to manually adjust for the timing offsets.

In other cases, the dataset would be incomplete. Due to the fact that the data collectorwas not able to poll the devices for certain periods of times, ranging between a few minutesto a few hours.

In order to use the dataset in the future, we would like to recommend SURFnet tocheck and make sure that their dataset is consistent between both devices and data-sources.Furthermore, as mentioned in the Discussion, datasets should be carefully labeled to makethem suitable for machine learning.

23

7 ConclusionWe set out to determine to what extent it is possible to predict intermittent network

device failures based on data from multiple sources. Given the current set of data and eventdescriptions, we were unable to create a reliable model to predict network device failures.

However, we found metrics and syslog messages relevant to the failure of line-cards. Weare capable of identifying three types of line-card failures using patterns between differentdata sources.

We found numerous instances where the total throughput of a Ciena 5410 swith wasseverely impacted. We posed an hypothesis to validate our belief. Not only did we prove ourhypothesis, we also proved the correlation between the device syslog messages and networkmetrics.

Furthermore, we also proved that some events will not generate any messages that couldindicate anything happening, but we showed that these events can be retraced given theright device metrics. Also, the SNMP polling frequency of once every five minutes limits thedata resolution. A more in-depth analyses would be possible given a more frequent pollingfrequency.

References[1] Felix Salfner, Maren Lenk, and Miroslaw Malek. “A Survey of Online Failure Prediction

Methods”. In: ACM Comput. Surv. 42.3 (Mar. 2010), 10:1–10:42. issn: 0360-0300. doi:10.1145/1670679.1670680. url: https://dl.acm.org/citation.cfm?doid=1670679.1670680.

[2] R. Gerhards. The Syslog Protocol. RFC 5424. http://www.rfc-editor.org/rfc/rfc5424.txt. RFC Editor, Mar. 2009. url: http://www.rfc-editor.org/rfc/rfc5424.txt.

[3] S. Ayoubi et al. “Machine Learning for Cognitive Network Management”. In: IEEECommunications Magazine 56.1 (Jan. 2018), pp. 158–165. issn: 0163-6804. doi: 10.1109/MCOM.2018.1700560.

[4] R. Presuhn. Version 2 of the Protocol Operations for the Simple Network ManagementProtocol (SNMP). STD 62. http://www.rfc-editor.org/rfc/rfc3416.txt. RFCEditor, Dec. 2002. url: http://www.rfc-editor.org/rfc/rfc3416.txt.

[5] Next generation netwerkmonitoring. https://blog.surf.nl/next- generation-netwerkmonitoring-waar-kiest-surfnet-voor/. Accessed: 2018-06-19.

[6] Atlassian. Jira - Ticketing and Issue Tracking Software. July 2018. url: https://www.atlassian.com/software/jira.

[7] InfluxData. InfluxDB - Time Series Database Monitoring Analytics. July 2018. url:https://www.influxdata.com/.

[8] SNMPWALK. http://net- snmp.sourceforge.net/docs/man/snmpwalk.html.Accessed: 2018-07-12.

[9] splunk. SIEM, AIOps, Application Management, Log Management, Machine Learning,and Compliance | Splunk. July 2018. url: https://splunk.com.

[10] Forbes. Gartner Magic Quadrant: Who’s Winning In The Data And Machine LearningSpace. Feb. 2018. url: https://www.forbes.com/sites/ciocentral/2018/02/28/gartner- magic- quadrant- whos- winning- in- thedata- machinelearning-space/#4fc6807dab33.

[11] Matthew Roughan et al. “Combining Routing and Traffic Data for Detection of IPForwarding Anomalies”. In: Proceedings of the Joint International Conference on Mea-surement and Modeling of Computer Systems. SIGMETRICS ’04/Performance ’04.New York, NY, USA: ACM, 2004, pp. 416–417. isbn: 1-58113-873-3. doi: 10.1145/1005686.1005745. url: http://doi.acm.org/10.1145/1005686.1005745.

24

https://doi.org/10.1145/1670679.1670680

https://dl.acm.org/citation.cfm?doid=1670679.1670680


http://www.rfc-editor.org/rfc/rfc5424.txt




https://doi.org/10.1109/MCOM.2018.1700560




https://blog.surf.nl/next-generation-netwerkmonitoring-waar-kiest-surfnet-voor/

https://blog.surf.nl/next-generation-netwerkmonitoring-waar-kiest-surfnet-voor/

https://www.atlassian.com/software/jira

https://www.atlassian.com/software/jira

https://www.influxdata.com/

http://net-snmp.sourceforge.net/docs/man/snmpwalk.html

https://splunk.com

https://www.forbes.com/sites/ciocentral/2018/02/28/gartner-magic-quadrant-whos-winning-in-the-data-machine-learning-space/#4fc6807dab33



https://doi.org/10.1145/1005686.1005745

https://doi.org/10.1145/1005686.1005745

http://doi.acm.org/10.1145/1005686.1005745

[12] Giuseppe Aceto et al. “A comprehensive survey on internet outages”. In: Journalof Network and Computer Applications 113 (2018), pp. 36–63. issn: 1084-8045. doi:https://doi.org/10.1016/j.jnca.2018.03.026. url: http://www.sciencedirect.com/science/article/pii/S1084804518301139.

[13] J. C. Duenas et al. “Applying Event Stream Processing to Network Online FailurePrediction”. In: IEEE Communications Magazine 56.1 (Jan. 2018), pp. 166–170. issn:0163-6804. doi: 10.1109/MCOM.2018.1601135.

[14] T. Chalermarrewong, T. Achalakul, and S. C. W. See. “Failure Prediction of DataCenters Using Time Series and Fault Tree Analysis”. In: 2012 IEEE 18th InternationalConference on Parallel and Distributed Systems. Dec. 2012, pp. 794–799. doi: 10.1109/ICPADS.2012.129.

[15] Shenglin Zhang et al. “PreFix: Switch Failure Prediction in Datacenter Networks”.In: Proc. ACM Meas. Anal. Comput. Syst. 2.1 (Apr. 2018), 2:1–2:29. issn: 2476-1249.doi: 10.1145/3179405. url: https://dl.acm.org/citation.cfm?doid=3203302.3179405.

[16] SURFnet protected/redundant fiber paths. https://www.surf.nl/diensten- en-producten/surflichtpaden/aansluitmogelijkheden-surflichtpaden/protected-redundante-lichtpaden/index.html. Accessed: 2018-07-14.

[17] Syslog message: CRC link error detected for FPC.* PFE.* fabric plane.*. https://kb.juniper.net/InfoCenter/index?page=content&id=KB19554&cat=AIS&actp=LIST.Accessed: 2018-06-21.

[18] N. Finn, D. Mohan, and A. Sajassi. 802.1ag - Connectivity Fault Management. STD.http://www.ieee802.org/1/pages/802.1ag.html. IEEE, Dec. 2007. url: http://www.ieee802.org/1/pages/802.1ag.html.

[19] Wikipedia. IEEE 802.1ag —Wikipedia, The Free Encyclopedia. http://en.wikipedia.org/w/index.php?title=IEEE%20802.1ag&oldid=672571099. [Online; accessed 15-July-2018]. 2018.

[20] Ciena. 5410 Service Aggregation Switch, Fault and Performance, SAOS 7.3. Docs.009-3219-009 Rev B. Dec. 2007.

25

https://doi.org/https://doi.org/10.1016/j.jnca.2018.03.026

http://www.sciencedirect.com/science/article/pii/S1084804518301139

http://www.sciencedirect.com/science/article/pii/S1084804518301139


https://doi.org/10.1109/ICPADS.2012.129

https://doi.org/10.1109/ICPADS.2012.129

https://doi.org/10.1145/3179405



https://www.surf.nl/diensten-en-producten/surflichtpaden/aansluitmogelijkheden-surflichtpaden/protected-redundante-lichtpaden/index.html



https://kb.juniper.net/InfoCenter/index?page=content&id=KB19554&cat=AIS&actp=LIST

https://kb.juniper.net/InfoCenter/index?page=content&id=KB19554&cat=AIS&actp=LIST

http://www.ieee802.org/1/pages/802.1ag.html



http://en.wikipedia.org/w/index.php?title=IEEE%20802.1ag&oldid=672571099

http://en.wikipedia.org/w/index.php?title=IEEE%20802.1ag&oldid=672571099

EvaluatieGroteStoringSURFinternet14september2017Storing StoringkwamrondhalftienbijQuanzabinnenpertelefoon.Hetwasop

datmomentnognietgeziendoorhetNOCofTEN.Hetduurdeeven(ong.eenhalfuur)totduidelijkwasdatdestoringopeenspecifiekekaart(FPC6)vanBorwas.Probleemwasdatdekaartniethelemaaldefectwas.Daardoorisderedundantienietinwerkinggetreden.DatgoldpastoendeinterfacesopFPC6downgebrachtwerden.Tochblevenverschillendeklantendaarnaoutageervaren.VermoedelijkhaddittemakenmetdeanderestoringeninDwingeloo(brand)enDenHaag(fiberbreuk).OokspeeldebijleverancierCYSOdatzijhetprimaire(downgebrachte)padblevengebruiken,hierdoorwarenzijnietbereikbaarvoordeinstellingen.MeteenresetbijCYSOisditopgelost.Erbleekalenkeledageneerdereenalarmopdekaartgezetentehebben.Hoedieisgemist,isonduidelijk.Clearenkonalleenmeteenrebootvandekaart.Dathadduswaarschijnlijkplaatsgevondenineenmaintenancewindowgemoeten.

Datum 14september2017Impact OnbetrouwbaargedragBorineersteuur,resilienceverliesrestvande

dagvoorca10instellingen.Mogelijkegrotereimpactopenkeleaansluitingenivmanderestoringen(BrandDwingelo,fibercutDenHaag)VerkeernaarleverancierCYSOlangerverstoorddoordatCYSOnietopsecundaireaansluitingoverschakelde

Betrokken SURFinternet

Oorzaak SystemfailureopkaartFCP6BorBijkomendeproblemen

GelijktijdiganderstoringenengevolgenvaneenbrandinDwingelowarennietverwerkt.Onduidelijkhedenininternecommunicatie,bezetting(wieisbeschikbaar),coördinatieenprobleemdetectie.Veeldirectecontactennaarbetrokkenenwegensontbrekenoverzichtvoortgang.Geenimpactanalyseopcombinatievanstoringen(Bor,Dwingelo,DenHaag)(SURFnet)engineerswarengefocustopoplossing,minderaandachtvoorimpactbijklanten.

Timeline 9.30telefonischmeldingstoringbijQuanzadoorklant10.00probleemduidelijk(defectopkaartBor)10.30interfacesdowngebrachthierdoorgeforceerdresilienceingeschakeld17.30resetvankaartlostprobleemop

Verbeteracties • Hetlijstjevangetroffeninstellingenmoetereerderzijn.

A Evaluation form 14-09-2017

26

Evaluatie Grote Storing - SURFinternet 6 maart 2018 Storing Uitval connectiviteit door fout op lijnkaart core router

Datum 6 maart 2018 Impact 5 instellingen connectiviteitsverlies, drie klanten resilienceverlies en

beperktere verbinding met Google, Facebook en de NL-IX.

Betrokken Connectiviteitsverlies: UU, glaslokaal, hogeschool Leiden, Gilde college, AOC oost: Resilience verlies: Saxion, Aeres en Cito hadden

Oorzaak Defecte lijnkaart in core router Truus, verkeer werd gedropt, maar BGP sessies bleven op en intern routeringsverkeer ook, waardoor geen failover optrad. Als oplossing is de lijnkaart vervangen.

Bijkomende problemen

Tijdens de dag vertoonde Truus al errormeldingen. Door een discussie over support op core routers met Telindus/Juniper heeft dit langer geduurd. Hierdoor ook een deel van tooling SURFcert van slag waardoor in eerste instantie klacht van een getroffen instelling als DDOS werd gezien.

Timeline Omstreeks 8.15 gaf Truus al meldingen op syslog over problemen op alle kaarten, had nog geen impact op verkeer, was niet te relateren aan een specifiek component. Ticket geopend bij Telindus, discussie met Telindus over onderhoudsstatus, hierdoor weigerde Juniper support te leveren. Omdat er nog geen gebruikersimpact was is dit niet geëscaleerd. Telindus heeft wel meegekeken en Quanza heeft een deel van de componenten getest. Rond 17.30 begint de kaart verkeer te droppen, NOC heeft dat niet geconstateerd. Ook BFD-detectie heeft niet geholpen om dit probleem te detecteren. 17.48 eerste telefoontje aan NOC door instelling 18.00 NOC schakelt tweede lijn in UU reageerde pas later omdat ze eerst dachten dat het probleem bij hen zelf lag Tweede lijn kon op afstand beperkt troubleshooting doen 18.10 Max (TEN) gewaarschuwd 18.45 tweede lijns engineer (Quanza) diagnosticeert storing op juiste kaart, 19.00 falende kaart uitgezet, hierdoor treedt resilience verbinding in werking en hebben alle instellingen weer verbinding 18.50 Max vraagt NOC of grote storing procedure gestart is 19.35 Lex pas heel laat gebeld door NOC, Roland Staring heeft geen melding gekregen, communicatie ook niet, Lex besluit geen GS-communicatie procedure te starten. Geen communicatie omdat GS procedure pas laat door NOC gestart is. Op dat moment was de impact gereduceerd tot resilienceverlies.

Verbeteracties • Grote storing procedure beter bekend maken bij NOC, met name opheldering over wat een grote storing is en welke procedures SURFnet zelf gaat opstarten.

B Evaluation form 06-03-2018

27

Predicting intermittent network device failures based on ... · feeds our datasets with real-time...

Documents

Transcript of Predicting intermittent network device failures based on ... · feeds our datasets with real-time...