16th of January, 20073rd ACS Workshop, Garching ACS Reliability Klemen Žagar...
-
Upload
audra-elliott -
Category
Documents
-
view
216 -
download
2
Transcript of 16th of January, 20073rd ACS Workshop, Garching ACS Reliability Klemen Žagar...
3rd ACS Workshop, Garching 2
Overview
• Reliability defined
• General approach
• Single points of failure in ACS
– Manager
– Configuration database
– Notification Service
– Components
• Approach to improve reliability of ACS
– Replication of state
– Fault transparency via redirection
– Replicated components
3rd ACS Workshop, Garching 3
Dependability of distributed systems
• Nodes of a distributed system are like dominos
– The domino effect: one falls, all may go down
– May happen often, and takes a long time to rebuild
• Thus, fault tolerance is important:
– Improved mean-time-to-failure of the system as a whole
– Lower mean-time-to-repair
Improved availability
Reduced maintenance effort
3rd ACS Workshop, Garching 4
Reliability defined
• Reliability, , is the probability that a system will perform as specified for a given period of time.
– Typically exponential:
– Alternative measure is the mean time to failure (MTTF/MTBF):
R(t)
t
1
49.7 days
Relability of the Microsoft Windows 95 operating system
3rd ACS Workshop, Garching 5
Reliability of composed systems
• Weakest link: reliability of a coupled composed system is less than the reliability of its least reliable constituent:
• Redundancy: reliability of a redundant subsystem is greater than the reliability of its most reliable constituent:
3rd ACS Workshop, Garching 6
Reliability of composed systems
• Maintainability: how long it takes to repair a system after a failure.
– The measure is mean time to repair (MTTR)
• Availability: percentage of time the system is actually available during periods when it should be available.
– Directly experienced by users!
– Expressed in percent. In marketing, also with number of nines(e.g., 99.999% reliability unavailable 7 minutes per year).
• Example: a gas station (working hours 6AM to 10PM – 16 hours)
– Ran out of gas at 10AM (2h)
– Pump malfunction at 2PM (2h)
– Availability: 12h/16h = 75%12AM 6AM
2h 2h10PM
3rd ACS Workshop, Garching 7
Faults in distributed systems
Node failures
• A host crashes or a process dies
• Volatile state is lost
Link failures
• A network link is broken
• Results in two or more partitions
• Difficult to distinguish from a host crash
Client 1
Client 4
Client 5
Crashed Server
Severed Network Link
Copy 1: activeavailable
Copy 3:active
inconsistentavailable
Copy 2: crashed
Client 2Client 6
Client 3
3rd ACS Workshop, Garching 8
Fault mitigation mechanisms:improve hardware MTTF
• Reduce the number of mechanical parts:
– Solid-state storage instead of hard disks
– Passive cooling of power supplies and CPUs (no fans)
• High-quality power supplies
• Replication:
– network links
– CPU boards
• Remote reset (e.g., via power cycling)
3rd ACS Workshop, Garching 9
Fault mitigation mechanisms:improve software MTTF
• Ensure that overflows of variables that constantly increase (handle IDs, timers, counters, ...) are properly handled.
• Ensure all resources are properly released when no longer needed (memory leaks, …)
– Use a managed platform (Java, .NET)
– Use auto-pointers (C++)
• Avoid using heap storage on a per-request basis (may result in memory fragmentation); e.g., use free-lists
• Restart a process in a controllable fashion (rejuvenation)
• Isolate processes through inter-process communication
• Recovery:
– Recover state after a crash
– Effective for host and process crashes
– Automated repair
3rd ACS Workshop, Garching 10
Fault mitigation mechanisms:decrease MTTR
• Foresee failures during design
– The major difference between a thing that might go wrong and a thing that cannot possibly go wrong is that when a thing that cannot possibly go wrong goes wrong it usually turns out to be impossible to get at or repair. – Douglas Adams: Mostly Harmless
• Provide good diagnostics
– Alarms
– Logs
– Detailed description of where and when an error occurred
– State-dump at failures (e.g., log ADC buffers during beam dumps)
• Automated fail-over
– In combination with redundancy
– Passive replica must have up-to-date state of the primary copy
– Fault detection (network ping, analog signal, …)
3rd ACS Workshop, Garching 11
General approach
• Redundancy via replication of data and services
– Stateless:
• Easy – just copy the service to another server
• Very scalable
– Stateful
• State update propagation
• Concurrency
– Distributed locking, distributed agreement/coordination
3rd ACS Workshop, Garching 12
Active vs. passive replicationof stateful services
• Passive:– One active service
– Zero or more backup services
– Only active service can perform mutable operations;updates propagated to backups
– In principle, any replica can perform read-only operations(concurrency issues)
– Fail-over: hot/warm/cold standby(how fast the backup takes over)
• Active:– One or more active services
(no distinction between active and backup)
– All replicas can perform both mutable or read-write operations
– No need for fail-over (just find a reachable working replica)
– For best results, requires a fully asynchronous application design• State diagram (state transitions triggered by receipt of async messages)
• Formal analysis possible
• Writes are slower, reads are faster (load balancing)
3rd ACS Workshop, Garching 13
Passive replication
client 1 active backup client 2
set()
propagate updates
get()
get()
set()
propagate_updates
Synch
ronous
update
pro
pagati
on
Asy
nch
ronous
update
pro
pagatio
n
3rd ACS Workshop, Garching 14
Group Communication/Group Membership Service
• A common tool for active replication
• Delivery of messages to a group of processes
– Guarantees of order (e.g., total order: all processes receive messages in exactly the same order).
– Delivery of membership change events
• Ordered relative to other messages on all nodes
3rd ACS Workshop, Garching 15
Active replication
GC/GMSreplica1 replica2client1 client2
set()
update
update update
get
Network partition (e.g., router failure)
membership {replica1} membership {replica2}
set
update
update
Network OKmembership {replica1,replica2}
propagate_missed_updates
3rd ACS Workshop, Garching 16
Transactions
• ACID:– Atomicity: either all sub-operations succeed, or all are aborted
(rolled-back)
– Concurrency: transactions can proceed simultaneously; access to shared resources is exclusive (locking)
– Isolation: two concurrent transactions don’t see effects of each other
– Durability: once transaction is committed, its effects are persisted (e.g., they are there even if server crashes immediately afterwards)
• Required to maintain a consistent state– Not all applications have stringent requirements for
consistency
– Does ALMA?
try { doA();} catch() { undoA(); throw;}
try { doB();} catch() { undoA(); undoB(); throw;}
beginTx();try { doA(); doB(); commitTx();} catch() { rollbackTx(); throw;}
3rd ACS Workshop, Garching 17
ACS weaknesses:the Manager
• Single point of failure
• Very rich state– Component/container/client info
– Well-connected object graph
• Currently, crash recovery is implemented (Prevayler)
• Impact of manager unavailability:– No component resolutions can take place
– Well written clients:• System suspended.
• Keep retrying with the manager.
• When finally giving up, restores a consistent state.
– Badly written clients:• Erratic behavior (unhandled exceptions, …)
• Failure on first attempt (also CORBA::TRANSIENT)
• Incomplete rollback – state corruption.
– In any case: full system unavailability
3rd ACS Workshop, Garching 18
Sidenote: Prevayler
• Resilience to crash failures (recovery)– Implement atomic commands (extending org.prevayler.command):
public Serializable execute(PrevalentSystem system) throws Exception {
((ManagerImpl)system).getAdministrators().deallocate(handle);
return null;
}
– Execute commands:prevayler.executeCommand(new MyCommand());
• Behind the scenes:– Prevayler serializes the command to a log file (transaction journal).
– Prevayler executes the command (takes care of synchronization, too).
– Snapshot: every now and then (on command), Prevayler persists the entire object graph (PrevalentSystem) to disk and removes the journal log.
– Recovery: the object graph is loaded from disk (latest snapshot + replay of journal log afterwards).
3rd ACS Workshop, Garching 19
ACS weaknesses:components
• Every component is a single point of failure
– Dynamic components are better off.
• No mechanism in place for:
– Component migration (from failed/unreachable containers to reachable ones)
• Not all components can migrate (e.g., those bound to hardware resources).
– Fail-over
• On failure, clients would have to check if the component had migrated
3rd ACS Workshop, Garching 20
ACS weaknesses:services
• CDB
– Standalone server
– If it fails, no new components can be constructed
– State:
• Change listeners.
• The data itself (in principle static).
• Notification Channel:
– Standalone server
– Many parts of ALMA depend on it!
– State:
• Notification listeners.
• Queues of undelivered messages?!
3rd ACS Workshop, Garching 21
ACS weaknesses:services
• Naming Service– Standalone server
– Probably not very frequently used
– Could impact the Manager(unable to update Naming Service)
– Stateful• But the state can be deduced from the Manager’s state.
• Archive?– Stateful.
– Possible to reduce replication-related load with clever architecture, e.g.:
• Partitioned/distributed data.
• Bulk data transfer?
3rd ACS Workshop, Garching 22
Improving Manager availability:the Manager
• Adjust Prevayler to propagate updates (commands).
• Approaches:
– Passive replication:
• E.g., send via TCP socket (notification service?) to all (pre)configured backups.
• Active replica selection (voting): ordered list of backups
• Not resilient to network partitions.
– Active replication
• Send via Group Communication service (e.g., Spread).
• Upon receipt, update the state.
• Prefers a fully asynchronous implementation.
3rd ACS Workshop, Garching 23
Improving Manager availability:clients and container services
• Every client to the Manager should keep a list of all backup Managers
– Could be obtained from the Manager itself.
– Passive replication: each Manager replica should know whether it is active or not.
• When contacting the Manager:
– If it fails, retry with the backup (next from the list).
3rd ACS Workshop, Garching 24
Improving component availability
• Manager:– In CDB, assign a list of possible containers to each component.
• Affinity: list is prioritized or ordered.
• Some components may have several replicas, others not.
– If Manager detects failure of a container (e.g., via existing heartbeat mechanism), it relocates all of its components.
• Should it notify clients of these components that the migration occurred? What can clients do?
• Components:– Stateless: no problems.
– Stateful:• Container would have to provide state replication services (active/passive?)
– BACI properties – reusable implementation of reliability mechanisms.
• Clients:– Implement failover. But how?
• CORBA interceptors.
• Adjusting JacORB or IDL generated code
• FT-CORBA?
• Non-Java ORBs?
– Beware of retrying non-idempotent operations!
3rd ACS Workshop, Garching 25
Improving CDB availability
• CDB packaged as a component
– Re-use of mechanisms for component availability.
– (Solvable) chicken-and-egg problem with the Manager.
• List of notification listeners must be propagated
– A general problem, solvable once-and-for-all (BACI?)
• Synchronous update propagation
– With CDB, it is acceptable to deny updates if in degraded mode.
• Is it?
3rd ACS Workshop, Garching 26
Improving service availability
• Naming service:
– Have Manager implement naming service interfaces.
• Abolishes a plethora of other problems (updating of naming service, …)
• Notification Service:
– We need to check TAO implementation.
– Some ideas:
• Listener propagation.
• Network storage for persistent messages (?)
• Archive:
– Use of database replication?
3rd ACS Workshop, Garching 27
Testing reliability
• A test suite needs to be prepared:
– Distributed test
• Test environment (how many machines, …)?
– Injection of network failures
• E.g., iptables reconfiguration
– Injection of node crashes
• Or network isolation + process kill
3rd ACS Workshop, Garching 28
Discussion