Autonomic distributed systems
description
Transcript of Autonomic distributed systems
Autonomic distributed systems
2
Think about this
Human population
1980 1990 2000 2010
5
4
6
7
x109 computer population
3
Think about this
Machines will fail from time to time, regardless of how carefully
they are designed. But who will manage these systems? Even if everyone joins IT, it is not enough! Isn’t this a crisis?
Systems have to take care of themselves.
Self-help is the best help.
4
What does it mean?
These are many such desirable self-- properties that be added to theWish list. These properties collectively called self-* properties characterize an Autonomic System.
Self-help
Self-healing
Self-organizing
Self-optimizing
Self-protecting
Self-managing
Self-stabilizing
5
Self-healing
The Spirit Mars rover has a
radiation-hardened R6000 CPU from
Lockheed-Martin Federal Systems.
One day, while performing a crucial
task, Spirit Mars Rover fell silent,
alone on the emptiness of Mars.
What next?
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Courtesy: Jet Propulsion Lab
6
Self-healing
The problem was eventually remotely detected by ground control.
The operating system tried to allocate more files than the RAM-based directory structure could accommodate. It caused an exception that suspended the task that attempted the allocation. NASA ground control deleted some files, and reformatted the entire flash memory system. On February 6, 2004 the rover was restored to its original working condition, and science activities resumed.
It would have been nice if the detection and repair could be done by the rover itself …
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Courtesy: Jet Propulsion Lab
Self-stabilization
• Technique for spontaneous restoration of a system predicate.
• Forward error recovery (memoryless) -- does not bother about
the impact of the failure as long as the recovery is
guaranteed.
• Guarantees eventual safety following failures.
Feasibility demonstrated by Dijkstra (CACM 1974)
Self-stabilizing systems
Starting from any initial configuration, the system is guaranteed
to recover to a legitimate configuration (L is true) in a bounded
number of steps, as long as the codes are not corrupted.
Self-stabilizing systems
Transient failures perturb the global state. The ability to spontaneously recover from any initial state implies that no initialization is ever required.
State space
legal
Self-stabilizing systems
Self-stabilizing systems exhibits non-masking
fault-tolerance. It satisfies the following two
criteria
fault
1. Convergence
2. Closure
Not L Lconvergence
closure
Adaptive Distributed Systems
System behavior spontaneously changes when the environment changes
A traffic control system
AM / PM
AM L AM holdsPM L PM holds
L = (AM L AM ) (PM L PM )
defines the system invariant
Example 1: Stabilizing mutual exclusion
01 62 4 753
N-1
Consider a unidirectional ring of processes. In the legal configuration, exactly one tokenwill circulate in the network
A solution
1 4320
{Process 0} repeat x[0] = x[N-1] x[0] := x[0] N 1 forever
{Process j > 0} repeat x[j] ≠ x[j -1] x[j] := x[j-1] forever
The state of process j is x[j] {0, 1, 2, K-1}, and N > K
TOKEN = ENABLED GUARD
Guard or condition
action
0n
Does it work?
First, be convinced that it works.
Then think about why it will work.
Example 2: Stabilizing spanning tree
• Given a connected graph G = (V,E) and a root r,
design an algorithm for maintaining a spanning
tree in presence of transient failures that may
corrupt the local states of processes.
• Let n = |V|
A solution
Each process i has two variables L(i) and P(i):L(i) = Distance from the root via tree edgesP(i) = parent of process i
By definition L(r) = 0, and P(r) is undefined. In a legal state
i V | i ≠ r : L(i) ≠ n L(i) = L(P(i)) +1.
Sample case
0
1
2
5
4
3
0
1
2
5
4
3
1
2
3 4
5
P(2) is corrupted
The algorithm
(R0) (L(i) ≠ n) (L(i) ≠ L(P(i)) +1) (L(P) ≠ n) L(i) :=L(P(i)) +1
(R1) (L(i) n) (L(P(i)) =n) L(i):=n
(R2) (L(i) =n) (k Neighbors(i):L(k) < n-1) L(i) :=L(k)+1; P(i):=k
The algorithm has three rules R0, R1, R2:
Proof of stabilization
Define an edge from i to P(i) to be well-formed,
when L(i) ≠ n, L(P(i) ≠ n and L(i) = L(P(i)) +1.
In any configuration, the well-formed edges form
a spanning forest. Delete all edges that are not
well-formed. Designate each tree T(k) in the
forest by the lowest value of L in it.
Example
In the sample graph shown earlier.T(0) = {0, 1, T(2) = {2, 3, 4, 5}
Let F(k) denote the number of T(k)’s in the forest.
Define a tuple F= (F(0), F(1), F(2) …, F(n)).
For the sample graph, F = (1, 0, 1, 0, 0, 0) after node 2
had the transient failure that changed P(2) from 2 to 4.
Skeleton of the proof
Minimum F = (1,0,0,0,0,0) {legal configuration}
Maximum F = (1, n-1, 0, 0, 0, 0).
With each action, F decreases lexicographically.
Verify the claim!
This proves that eventually F becomes (1,0,0,0,0,0) and
the spanning tree stabilizes.