Continuous fault containment and local stabilization in path-vector routing
-
Upload
zenobia-rafal -
Category
Documents
-
view
26 -
download
0
description
Transcript of Continuous fault containment and local stabilization in path-vector routing
April 19, 2023
Continuous fault containment and local stabilization in path-vector routing
Hongwei Zhang Anish Arora
Motivation
Study of fault containment has focused largely on cases where faults either stop occurring after certain moment in time or faults occur with low frequency
In practice, faults may occur with high frequency, and the interval between faults may be shorter than the time taken for the system to stabilize
E.g., under Code Red/Nimda attack (2002), memory overflow
causes edge BGP speakers to repeatedly fail-stop and rejoin at a
frequency as high as once every minute the oscillation propagates farther away, in spite of MRAI timer and RFD
Objectives
Formulate concepts that characterize, and develop
mechanisms that achieve the following properties: in the presence of high-frequency faults
the impact of faults is always locally contained
once faults stop occurring
the system stabilizes within time that is a function of the degree
of
fault perturbation
We study these issues in the context of path-vector routing to simplify the presentation, we first present a solution for continuous fault
containment and local stabilization in path-vector routing, then we present
the concepts
Outline
Fault propagation in path-vector protocols
CPV design pattern protocol
Generic concepts for tolerating high-frequency faults
Analytical & simulation results for CPV
Concluding remarks
Fault propagation in path-vector protocols
e
f
i
h
g
d
[e, d]
[g, f, e, d]
[h, g, f, e, d]
[i, h, g, f, e, d]
[f, e, d]
unaffected ? all are affected
the fresh info. (route-announcement) always lags behind the obsolete info (route-withdrawal)
Outline
Fault propagation in path-vector protocols
CPV design pattern protocol
Generic concepts for tolerating high-frequency faults
Analytical & simulation results for CPV
Concluding remarks
Design pattern of CPV Key idea: to design a mechanism that
enables information regarding a new network state to catch
up with and stop the propagation of the information regarding
the preceding state (which has become obsolete)
works whether or not faults stop occurring
Parallel diffusing waves (with different propagation speed)
Outline of CPV
Whenever a node j needs to change state, it engages a
containment wave cw0 before engaging a new stabilization wave
sw1
so that cw0 stops the previous stabilization wave sw0 from propagating
the existing state of j
In the presence of high-frequency faults, another fault f may
occur before j executes sw1, then there are two cases
j does not need to change state any more: j engages an undo-
containment wave uw0 to stop cw0
j still needs to change state: j lets cw0 to propagate
A little more detail
Containment wave
piggybacks the expected next state of a node to its neighbors, so
that a neighbor can decide whether to hold an existing SW
is a one-way diffusing process, by which CW can co-exist with the
corresponding SW (which is required to contain continuously-
occurring faults)
Stabilization wave
takes into account predicated state when choosing next-hop
Undo-containment wave
does not introduce new variables
Outline
Fault propagation in path-vector protocols
CPV design pattern protocol
Generic concepts for tolerating high-frequency faults
Analytical & simulation results for CPV
Concluding remarks
Action SW (contd.)
a node not in CW does not execute SW, if the next-hop has executed CW
loop freedom
1) nodes not involved in any CW rank higher than those involved in a CW
2) consider the expected next route of a neighbor, if available via a CW
CPV (contd.): actions CW and UW
Note: we skip the actions for information synchronization between neighbors here
Outline
Fault propagation in path-vector protocols
CPV design pattern protocol
Generic concepts for tolerating high-frequency faults
Analytical & simulation results for CPV
Concluding remarks
Generic concepts
Objective:
to define concepts that capture the desired system properties in the presence of continuously-occurring faults
Key issue:
to differentiate the impact of faults and protocol actions
Concepts defined:
Perturbed vs. contaminated node
Perturbation size & contamination range
F-containment & F-stabilization
Preliminaries
A System History H is a sequence q.0, (e.1, t.1), q.1, (e.2, t.2), …, q.(k-1), (e.k, t.k), q.k, …, of alternating system states and events, where
an event is either the execution of a protocol action or the occurrence of a fault
each state transition “q.(k-1), (e.k, t.k), q.k” means that event e.k at time t.k changes the system state from q.(k-1) to q.k
every moment in time, at most one event can occur at a node
Given a system history H and a state q.k in H, the history prefix H(q.k) = the subsequence of H that is between q.0 and q.k
A computation is a system history (or its suffix) where no fault occurs
Preliminaries (contd.)
Given a state q.k and H(q.k), a protocol execution E(q.k) is a
set of computations
each of which specifies a computation C(q.k, E(q.k)) for a different
state q.k’ in H(q.k) that is either the initial state or a state reached
immediately after a fault occurs
Given q.k, E(q.k), the stabilization set of q.k, S(q.k, E(q.k)),
is the set of nodes that need to change state for the system to
stabilize from q.k in the absence of faults
Perturbation vs. contamination
Given “q.k-1, (e, t), q.k” and E(q.k),
the corruption set of e at t
cpt(e, t, E(q.k)) = S(q.k, E(q.k)) \ S(q.k-1, E(q.k))
if e is not a state corruption, the correction set of e at t
cct(e, t, E(q.k)) = (S(q.k-1, E(q.k)) \ S(q.k, E(q.k))) V.(q.k)
For every node j cpt(e, t, E(q.k)),
j is perturbed by e if e is a fault
j is contaminated via e if e is the execution of a protocol action
For every node j cct(e, t, E(q.k)),
j is corrected by e
Perturbed vs. contaminated node
a perturbed node remains perturbed until it is corrected by a
fault or the system reaches a legitimate state
a contaminated node remains contaminated until it is
corrected by a fault or the execution of a protocol action
Perturbation size & contamination range
Given q.k, H(q.k), and E(q.k), the perturbation size at q.k, P(q.k,
H(q.k), E(q.k)), is the number of perturbed nodes at q.k
The contamination range of a perturbed region S’ at q.k,
R(S’, q.k), is the maximum hop-distance from the corresponding
set of contaminated nodes to S’
F-containment & F-stabilization
A system is F-containing if and only if
for every perturbed region S’ at an arbitrary state q.k, R(S’, q.k) =
O(F(| S’ |), where F is a function
A system is F-stabilizing if and only if
starting at an arbitrary state q. k with an arbitrary H(q. k) and E(q.k),
the system computation is guaranteed to reach a legitimate state
within O(F(P(q.k, H (q.k), E(q.k)))) time in the absence of faults,
where F is a function
Outline
Fault propagation in path-vector protocols
CPV design pattern protocol
Generic concepts for tolerating high-frequency faults
Analytical & simulation results for CPV
Concluding remarks
Analytical results
L = {q: every up node has found its best route at state q}
Properties of CPV
the contamination range R(S’, q.k) of every perturbed region S’ at
any state q.k is O(|S’|)
the distance to which a state of a node i propagates is proportional
to the time the state lasts
starting at any state q.k with an arbitrary H(q.k) and E(q.k), the
system where CPV is used reaches a legitimate state within
O(F(P(q.k, H(q.k), E(q.k)))) time in the absence of faults
F is function reflecting the routing policies used, and is linear if every
node chooses a shortest path
Simulation results
SSFNet, a network simulator with standard-conforming
protocol implementations
Simulation setup
parameter setup for CPV and BGP
CPV: ds = 30 sec, dc = 10 sec, du = 1 sec
BGP: with MRAI timers (30 seconds) and RFD
Fault scenario
a node repeatedly fail-stops and then rejoins every 30 seconds
Internet-type network topology
the shortest-path-first policy
Outline
Fault propagation in path-vector protocols
CPV design pattern protocol
Generic concepts for tolerating high-frequency faults
Analytical & simulation results for CPV
Concluding remarks
Concluding remarks
Frequent transient faults do happen (especially when systems
work under unexpected conditions)
fault containment and stabilization are desirable as well as possible
Quality of service and system behavior during stabilization
perspectives other than convergence only: time, space, stability,
etc.
modeling issues: descriptive, derivative