16th of January, 20073rd ACS Workshop, Garching ACS Reliability Klemen Žagar...

16th of January, 2007 3rd ACS Workshop, Garching

ACS Reliability

Klemen Ž[email protected]

mailto:[email protected]?subject=ACS%20Services

3rd ACS Workshop, Garching 2

Overview

• Reliability defined

• General approach

• Single points of failure in ACS

– Manager

– Configuration database

– Notification Service

– Components

• Approach to improve reliability of ACS

– Replication of state

– Fault transparency via redirection

– Replicated components


Dependability of distributed systems

• Nodes of a distributed system are like dominos

– The domino effect: one falls, all may go down

– May happen often, and takes a long time to rebuild

• Thus, fault tolerance is important:

– Improved mean-time-to-failure of the system as a whole

– Lower mean-time-to-repair

Improved availability

Reduced maintenance effort


Reliability defined

• Reliability, , is the probability that a system will perform as specified for a given period of time.

– Typically exponential:

– Alternative measure is the mean time to failure (MTTF/MTBF):

R(t)

t

1

49.7 days

Relability of the Microsoft Windows 95 operating system


Reliability of composed systems

• Weakest link: reliability of a coupled composed system is less than the reliability of its least reliable constituent:

• Redundancy: reliability of a redundant subsystem is greater than the reliability of its most reliable constituent:


Reliability of composed systems

• Maintainability: how long it takes to repair a system after a failure.

– The measure is mean time to repair (MTTR)

• Availability: percentage of time the system is actually available during periods when it should be available.

– Directly experienced by users!

– Expressed in percent. In marketing, also with number of nines(e.g., 99.999% reliability unavailable 7 minutes per year).

• Example: a gas station (working hours 6AM to 10PM – 16 hours)

– Ran out of gas at 10AM (2h)

– Pump malfunction at 2PM (2h)

– Availability: 12h/16h = 75%12AM 6AM

2h 2h10PM


Faults in distributed systems

Node failures

• A host crashes or a process dies

• Volatile state is lost

Link failures

• A network link is broken

• Results in two or more partitions

• Difficult to distinguish from a host crash

Client 1

Client 4

Client 5

Crashed Server

Severed Network Link

Copy 1: activeavailable

Copy 3:active

inconsistentavailable

Copy 2: crashed

Client 2Client 6

Client 3


Fault mitigation mechanisms:improve hardware MTTF

• Reduce the number of mechanical parts:

– Solid-state storage instead of hard disks

– Passive cooling of power supplies and CPUs (no fans)

• High-quality power supplies

• Replication:

– network links

– CPU boards

• Remote reset (e.g., via power cycling)


Fault mitigation mechanisms:improve software MTTF

• Ensure that overflows of variables that constantly increase (handle IDs, timers, counters, ...) are properly handled.

• Ensure all resources are properly released when no longer needed (memory leaks, …)

– Use a managed platform (Java, .NET)

– Use auto-pointers (C++)

• Avoid using heap storage on a per-request basis (may result in memory fragmentation); e.g., use free-lists

• Restart a process in a controllable fashion (rejuvenation)

• Isolate processes through inter-process communication

• Recovery:

– Recover state after a crash

– Effective for host and process crashes

– Automated repair


Fault mitigation mechanisms:decrease MTTR

• Foresee failures during design

– The major difference between a thing that might go wrong and a thing that cannot possibly go wrong is that when a thing that cannot possibly go wrong goes wrong it usually turns out to be impossible to get at or repair. – Douglas Adams: Mostly Harmless

• Provide good diagnostics

– Alarms

– Logs

– Detailed description of where and when an error occurred

– State-dump at failures (e.g., log ADC buffers during beam dumps)

• Automated fail-over

– In combination with redundancy

– Passive replica must have up-to-date state of the primary copy

– Fault detection (network ping, analog signal, …)


General approach

• Redundancy via replication of data and services

– Stateless:

• Easy – just copy the service to another server

• Very scalable

– Stateful

• State update propagation

• Concurrency

– Distributed locking, distributed agreement/coordination


Active vs. passive replicationof stateful services

• Passive:– One active service

– Zero or more backup services

– Only active service can perform mutable operations;updates propagated to backups

– In principle, any replica can perform read-only operations(concurrency issues)

– Fail-over: hot/warm/cold standby(how fast the backup takes over)

• Active:– One or more active services

(no distinction between active and backup)

– All replicas can perform both mutable or read-write operations

– No need for fail-over (just find a reachable working replica)

– For best results, requires a fully asynchronous application design• State diagram (state transitions triggered by receipt of async messages)

• Formal analysis possible

• Writes are slower, reads are faster (load balancing)


Passive replication

client 1 active backup client 2

set()

propagate updates

get()

get()

set()

propagate_updates

Synch

ronous

update

pro

pagati

on

Asy

nch

ronous

update

pro

pagatio

n


Group Communication/Group Membership Service

• A common tool for active replication

• Delivery of messages to a group of processes

– Guarantees of order (e.g., total order: all processes receive messages in exactly the same order).

– Delivery of membership change events

• Ordered relative to other messages on all nodes


Active replication

GC/GMSreplica1 replica2client1 client2

set()

update

update update

get

Network partition (e.g., router failure)

membership {replica1} membership {replica2}

set

update

update

Network OKmembership {replica1,replica2}

propagate_missed_updates


Transactions

• ACID:– Atomicity: either all sub-operations succeed, or all are aborted

(rolled-back)

– Concurrency: transactions can proceed simultaneously; access to shared resources is exclusive (locking)

– Isolation: two concurrent transactions don’t see effects of each other

– Durability: once transaction is committed, its effects are persisted (e.g., they are there even if server crashes immediately afterwards)

• Required to maintain a consistent state– Not all applications have stringent requirements for

consistency

– Does ALMA?

try { doA();} catch() { undoA(); throw;}

try { doB();} catch() { undoA(); undoB(); throw;}

beginTx();try { doA(); doB(); commitTx();} catch() { rollbackTx(); throw;}


ACS weaknesses:the Manager

• Single point of failure

• Very rich state– Component/container/client info

– Well-connected object graph

• Currently, crash recovery is implemented (Prevayler)

• Impact of manager unavailability:– No component resolutions can take place

– Well written clients:• System suspended.

• Keep retrying with the manager.

• When finally giving up, restores a consistent state.

– Badly written clients:• Erratic behavior (unhandled exceptions, …)

• Failure on first attempt (also CORBA::TRANSIENT)

• Incomplete rollback – state corruption.

– In any case: full system unavailability


Sidenote: Prevayler

• Resilience to crash failures (recovery)– Implement atomic commands (extending org.prevayler.command):

public Serializable execute(PrevalentSystem system) throws Exception {

((ManagerImpl)system).getAdministrators().deallocate(handle);

return null;

}

– Execute commands:prevayler.executeCommand(new MyCommand());

• Behind the scenes:– Prevayler serializes the command to a log file (transaction journal).

– Prevayler executes the command (takes care of synchronization, too).

– Snapshot: every now and then (on command), Prevayler persists the entire object graph (PrevalentSystem) to disk and removes the journal log.

– Recovery: the object graph is loaded from disk (latest snapshot + replay of journal log afterwards).


ACS weaknesses:components

• Every component is a single point of failure

– Dynamic components are better off.

• No mechanism in place for:

– Component migration (from failed/unreachable containers to reachable ones)

• Not all components can migrate (e.g., those bound to hardware resources).

– Fail-over

• On failure, clients would have to check if the component had migrated


ACS weaknesses:services

• CDB

– Standalone server

– If it fails, no new components can be constructed

– State:

• Change listeners.

• The data itself (in principle static).

• Notification Channel:

– Standalone server

– Many parts of ALMA depend on it!

– State:

• Notification listeners.

• Queues of undelivered messages?!


ACS weaknesses:services

• Naming Service– Standalone server

– Probably not very frequently used

– Could impact the Manager(unable to update Naming Service)

– Stateful• But the state can be deduced from the Manager’s state.

• Archive?– Stateful.

– Possible to reduce replication-related load with clever architecture, e.g.:

• Partitioned/distributed data.

• Bulk data transfer?


Improving Manager availability:the Manager

• Adjust Prevayler to propagate updates (commands).

• Approaches:

– Passive replication:

• E.g., send via TCP socket (notification service?) to all (pre)configured backups.

• Active replica selection (voting): ordered list of backups

• Not resilient to network partitions.

– Active replication

• Send via Group Communication service (e.g., Spread).

• Upon receipt, update the state.

• Prefers a fully asynchronous implementation.


Improving Manager availability:clients and container services

• Every client to the Manager should keep a list of all backup Managers

– Could be obtained from the Manager itself.

– Passive replication: each Manager replica should know whether it is active or not.

• When contacting the Manager:

– If it fails, retry with the backup (next from the list).


Improving component availability

• Manager:– In CDB, assign a list of possible containers to each component.

• Affinity: list is prioritized or ordered.

• Some components may have several replicas, others not.

– If Manager detects failure of a container (e.g., via existing heartbeat mechanism), it relocates all of its components.

• Should it notify clients of these components that the migration occurred? What can clients do?

• Components:– Stateless: no problems.

– Stateful:• Container would have to provide state replication services (active/passive?)

– BACI properties – reusable implementation of reliability mechanisms.

• Clients:– Implement failover. But how?

• CORBA interceptors.

• Adjusting JacORB or IDL generated code

• FT-CORBA?

• Non-Java ORBs?

– Beware of retrying non-idempotent operations!


Improving CDB availability

• CDB packaged as a component

– Re-use of mechanisms for component availability.

– (Solvable) chicken-and-egg problem with the Manager.

• List of notification listeners must be propagated

– A general problem, solvable once-and-for-all (BACI?)

• Synchronous update propagation

– With CDB, it is acceptable to deny updates if in degraded mode.

• Is it?


Improving service availability

• Naming service:

– Have Manager implement naming service interfaces.

• Abolishes a plethora of other problems (updating of naming service, …)

• Notification Service:

– We need to check TAO implementation.

– Some ideas:

• Listener propagation.

• Network storage for persistent messages (?)

• Archive:

– Use of database replication?


Testing reliability

• A test suite needs to be prepared:

– Distributed test

• Test environment (how many machines, …)?

– Injection of network failures

• E.g., iptables reconfiguration

– Injection of node crashes

• Or network isolation + process kill


Discussion

16th of January, 20073rd ACS Workshop, Garching ACS Reliability Klemen Žagar...

Documents

Transcript of 16th of January, 20073rd ACS Workshop, Garching ACS Reliability Klemen Žagar...