Continuous fault containment and local stabilization in path-vector routing

April 19, 2023

Continuous fault containment and local stabilization in path-vector routing

Hongwei Zhang Anish Arora

Motivation

Study of fault containment has focused largely on cases where faults either stop occurring after certain moment in time or faults occur with low frequency

In practice, faults may occur with high frequency, and the interval between faults may be shorter than the time taken for the system to stabilize

E.g., under Code Red/Nimda attack (2002), memory overflow

causes edge BGP speakers to repeatedly fail-stop and rejoin at a

frequency as high as once every minute the oscillation propagates farther away, in spite of MRAI timer and RFD

Objectives

Formulate concepts that characterize, and develop

mechanisms that achieve the following properties: in the presence of high-frequency faults

the impact of faults is always locally contained

once faults stop occurring

the system stabilizes within time that is a function of the degree

of

fault perturbation

We study these issues in the context of path-vector routing to simplify the presentation, we first present a solution for continuous fault

containment and local stabilization in path-vector routing, then we present

the concepts

Outline

Fault propagation in path-vector protocols

CPV design pattern protocol

Generic concepts for tolerating high-frequency faults

Analytical & simulation results for CPV

Concluding remarks


e

f

i

h

g

d

[e, d]

[g, f, e, d]

[h, g, f, e, d]

[i, h, g, f, e, d]

[f, e, d]

unaffected ? all are affected

the fresh info. (route-announcement) always lags behind the obsolete info (route-withdrawal)

Outline





Concluding remarks

Design pattern of CPV Key idea: to design a mechanism that

enables information regarding a new network state to catch

up with and stop the propagation of the information regarding

the preceding state (which has become obsolete)

works whether or not faults stop occurring

Parallel diffusing waves (with different propagation speed)

Outline of CPV

Whenever a node j needs to change state, it engages a

containment wave cw0 before engaging a new stabilization wave

sw1

so that cw0 stops the previous stabilization wave sw0 from propagating

the existing state of j

In the presence of high-frequency faults, another fault f may

occur before j executes sw1, then there are two cases

j does not need to change state any more: j engages an undo-

containment wave uw0 to stop cw0

j still needs to change state: j lets cw0 to propagate

A little more detail

Containment wave

piggybacks the expected next state of a node to its neighbors, so

that a neighbor can decide whether to hold an existing SW

is a one-way diffusing process, by which CW can co-exist with the

corresponding SW (which is required to contain continuously-

occurring faults)

Stabilization wave

takes into account predicated state when choosing next-hop

Undo-containment wave

does not introduce new variables

Outline





Concluding remarks

Protocol CPV

ds > α·(dc+U), dc > α·(du+U), du ≥ 0

containment wave

Action SW (contd.)

a node not in CW does not execute SW, if the next-hop has executed CW

loop freedom

1) nodes not involved in any CW rank higher than those involved in a CW

2) consider the expected next route of a neighbor, if available via a CW

CPV (contd.): actions CW and UW

Note: we skip the actions for information synchronization between neighbors here

Example revisited

e

f

i

h

g

d CW1 SW1 CW2 SW2 UW1

Outline





Concluding remarks

Generic concepts

Objective:

to define concepts that capture the desired system properties in the presence of continuously-occurring faults

Key issue:

to differentiate the impact of faults and protocol actions

Concepts defined:

Perturbed vs. contaminated node

Perturbation size & contamination range

F-containment & F-stabilization

Preliminaries

A System History H is a sequence q.0, (e.1, t.1), q.1, (e.2, t.2), …, q.(k-1), (e.k, t.k), q.k, …, of alternating system states and events, where

an event is either the execution of a protocol action or the occurrence of a fault

each state transition “q.(k-1), (e.k, t.k), q.k” means that event e.k at time t.k changes the system state from q.(k-1) to q.k

every moment in time, at most one event can occur at a node

Given a system history H and a state q.k in H, the history prefix H(q.k) = the subsequence of H that is between q.0 and q.k

A computation is a system history (or its suffix) where no fault occurs

Preliminaries (contd.)

Given a state q.k and H(q.k), a protocol execution E(q.k) is a

set of computations

each of which specifies a computation C(q.k, E(q.k)) for a different

state q.k’ in H(q.k) that is either the initial state or a state reached

immediately after a fault occurs

Given q.k, E(q.k), the stabilization set of q.k, S(q.k, E(q.k)),

is the set of nodes that need to change state for the system to

stabilize from q.k in the absence of faults

Perturbation vs. contamination

Given “q.k-1, (e, t), q.k” and E(q.k),

the corruption set of e at t

cpt(e, t, E(q.k)) = S(q.k, E(q.k)) \ S(q.k-1, E(q.k))

if e is not a state corruption, the correction set of e at t

cct(e, t, E(q.k)) = (S(q.k-1, E(q.k)) \ S(q.k, E(q.k))) V.(q.k)

For every node j cpt(e, t, E(q.k)),

j is perturbed by e if e is a fault

j is contaminated via e if e is the execution of a protocol action

For every node j cct(e, t, E(q.k)),

j is corrected by e

Perturbed vs. contaminated node

a perturbed node remains perturbed until it is corrected by a

fault or the system reaches a legitimate state

a contaminated node remains contaminated until it is

corrected by a fault or the execution of a protocol action

Example with existing path-vector protocol

e

f

i

h

g

d

perturbed

corrected

contaminated

Perturbation size & contamination range

Given q.k, H(q.k), and E(q.k), the perturbation size at q.k, P(q.k,

H(q.k), E(q.k)), is the number of perturbed nodes at q.k

The contamination range of a perturbed region S’ at q.k,

R(S’, q.k), is the maximum hop-distance from the corresponding

set of contaminated nodes to S’

F-containment & F-stabilization

A system is F-containing if and only if

for every perturbed region S’ at an arbitrary state q.k, R(S’, q.k) =

O(F(| S’ |), where F is a function

A system is F-stabilizing if and only if

starting at an arbitrary state q. k with an arbitrary H(q. k) and E(q.k),

the system computation is guaranteed to reach a legitimate state

within O(F(P(q.k, H (q.k), E(q.k)))) time in the absence of faults,

where F is a function

Outline





Concluding remarks

Analytical results

L = {q: every up node has found its best route at state q}

Properties of CPV

the contamination range R(S’, q.k) of every perturbed region S’ at

any state q.k is O(|S’|)

the distance to which a state of a node i propagates is proportional

to the time the state lasts

starting at any state q.k with an arbitrary H(q.k) and E(q.k), the

system where CPV is used reaches a legitimate state within

O(F(P(q.k, H(q.k), E(q.k)))) time in the absence of faults

F is function reflecting the routing policies used, and is linear if every

node chooses a shortest path

Simulation results

SSFNet, a network simulator with standard-conforming

protocol implementations

Simulation setup

parameter setup for CPV and BGP

CPV: ds = 30 sec, dc = 10 sec, du = 1 sec

BGP: with MRAI timers (30 seconds) and RFD

Fault scenario

a node repeatedly fail-stops and then rejoins every 30 seconds

Internet-type network topology

the shortest-path-first policy

Contamination range and the number of nodes affected

Time taken to stabilize

Stability adaptiveness

Outline





Concluding remarks

Concluding remarks

Frequent transient faults do happen (especially when systems

work under unexpected conditions)

fault containment and stabilization are desirable as well as possible

Quality of service and system behavior during stabilization

perspectives other than convergence only: time, space, stability,

etc.

modeling issues: descriptive, derivative

Low frequency faults

Destination joins Destination fail-stops

Continuous fault containment and local stabilization in path-vector routing

Documents

Transcript of Continuous fault containment and local stabilization in path-vector routing