Fundamentals Stream Session 8: Fault Tolerance & Dependability I CSC 253 Gordon Blair, François...
-
Upload
maria-needham -
Category
Documents
-
view
219 -
download
4
Transcript of Fundamentals Stream Session 8: Fault Tolerance & Dependability I CSC 253 Gordon Blair, François...
![Page 1: Fundamentals Stream Session 8: Fault Tolerance & Dependability I CSC 253 Gordon Blair, François Taïani Distributed Systems.](https://reader035.fdocuments.us/reader035/viewer/2022062511/551ae5ad550346b2288b6585/html5/thumbnails/1.jpg)
Fundamentals Stream Session 8:Fault Tolerance &Dependability I
CSC 253
Gordon Blair, François Taïani
Distributed Systems
![Page 2: Fundamentals Stream Session 8: Fault Tolerance & Dependability I CSC 253 Gordon Blair, François Taïani Distributed Systems.](https://reader035.fdocuments.us/reader035/viewer/2022062511/551ae5ad550346b2288b6585/html5/thumbnails/2.jpg)
CSC253 / 2005-06 G. Blair/ F. Taiani 2
Overview of the dayToday: Fault-Tolerance in DS / Guest Lecture
9-10am: Fundamental Stream / Fault Tolerance (a)
10-11am: Guest Lecturer Prof. Jean-Charles FabreHow to assess the robustness of OS and middleware
12-1pm: Fundamental Stream / Fault Tolerance (b)
Next week: Advanced Fault-Tolerance
![Page 3: Fundamentals Stream Session 8: Fault Tolerance & Dependability I CSC 253 Gordon Blair, François Taïani Distributed Systems.](https://reader035.fdocuments.us/reader035/viewer/2022062511/551ae5ad550346b2288b6585/html5/thumbnails/3.jpg)
CSC253 / 2005-06 G. Blair/ F. Taiani 3
Overview of the session A famous example (video)
Basic dependability concepts Also covered in CSc 365 (Year B) "Safety Critical Systems”
Fault-tolerance at the node level: Replication
Fault-tolerance at the network level: TCP
Fault-tolerance at the environment level: Data centres
Associated Reading: Chapter 7. Fault Tolerance of Tanenbaum & van Steen
![Page 4: Fundamentals Stream Session 8: Fault Tolerance & Dependability I CSC 253 Gordon Blair, François Taïani Distributed Systems.](https://reader035.fdocuments.us/reader035/viewer/2022062511/551ae5ad550346b2288b6585/html5/thumbnails/4.jpg)
CSC253 / 2005-06 G. Blair/ F. Taiani 4
Video?A Famous Example
April 14, 1912
![Page 5: Fundamentals Stream Session 8: Fault Tolerance & Dependability I CSC 253 Gordon Blair, François Taïani Distributed Systems.](https://reader035.fdocuments.us/reader035/viewer/2022062511/551ae5ad550346b2288b6585/html5/thumbnails/5.jpg)
CSC253 / 2005-06 G. Blair/ F. Taiani 5
A Major Disaster The worst peacetime maritime disaster
more than 1500 dead only 706 survivors
The wreck only discovered in 1985 2 1/2 miles down on the ocean floor
![Page 6: Fundamentals Stream Session 8: Fault Tolerance & Dependability I CSC 253 Gordon Blair, François Taïani Distributed Systems.](https://reader035.fdocuments.us/reader035/viewer/2022062511/551ae5ad550346b2288b6585/html5/thumbnails/6.jpg)
CSC253 / 2005-06 G. Blair/ F. Taiani 6
The Titanic and Fault-Tolerance How fault-tolerant was the Titanic?
best technology of the time: deemed “practicably” unsinkable 16 watertight compartments Titanic could float with its first 4 compartments flooded
Some flaws in the design: poor turning ability bulkheads only went as high as E-deck not enough lifeboats!
![Page 7: Fundamentals Stream Session 8: Fault Tolerance & Dependability I CSC 253 Gordon Blair, François Taïani Distributed Systems.](https://reader035.fdocuments.us/reader035/viewer/2022062511/551ae5ad550346b2288b6585/html5/thumbnails/7.jpg)
CSC253 / 2005-06 G. Blair/ F. Taiani 7
What does It Teach Us? Good to plan ahead for problems
Like collision and hull damage for a ship
Replication / redundancy is the key to survival Titanic contained 16 watertight compartments Could still float with the first four flooded (& other combinations) But always a limit to what you can tolerate
But you also need diversity / independence Compartments are watertight Critical issue: prevent water propagation (deck E issue)
The case of the titanic can be applied to many other systems Generic concepts have been developed to discuss them
![Page 8: Fundamentals Stream Session 8: Fault Tolerance & Dependability I CSC 253 Gordon Blair, François Taïani Distributed Systems.](https://reader035.fdocuments.us/reader035/viewer/2022062511/551ae5ad550346b2288b6585/html5/thumbnails/8.jpg)
CSC253 / 2005-06 G. Blair/ F. Taiani 8
Dependability Concepts Failure: what we want to avoid
Titanic: sinking and killing people Distributed System: stop functioning correctly
Fault: a problem that could cause a failure Titanic: the iceberg, design flaws Distributed System: hardware crash, software bug
Errors: an erroneous internal state that can lead to a failure Titanic: damaged hull, water in compartment DS: erroneous data, inconsistent internal
behaviour
![Page 9: Fundamentals Stream Session 8: Fault Tolerance & Dependability I CSC 253 Gordon Blair, François Taïani Distributed Systems.](https://reader035.fdocuments.us/reader035/viewer/2022062511/551ae5ad550346b2288b6585/html5/thumbnails/9.jpg)
CSC253 / 2005-06 G. Blair/ F. Taiani 9
System and Sub-Systems Faults, Errors, Failures are recursive notions
A systems usually made of components e.g. sub-systems Titanic: hull, decks, compartments, smoke stacks Distributed system: nodes, network links, routers, hubs,
services Nodes contain CPUs, hard disks, network card, ... Network comprises routers, hubs, gateways, name servers ...
Failure of internal component: a fault for enclosing system
![Page 10: Fundamentals Stream Session 8: Fault Tolerance & Dependability I CSC 253 Gordon Blair, François Taïani Distributed Systems.](https://reader035.fdocuments.us/reader035/viewer/2022062511/551ae5ad550346b2288b6585/html5/thumbnails/10.jpg)
CSC253 / 2005-06 G. Blair/ F. Taiani 10
Means to Dependability Fault prevention:
Do not go through the Artic Ocean (no icebergs) Buy good hardware (less faulty hardware) Hire skilled programmers (less bugs)
Fault removal Blast the iceberg (not practicable as external fault) Test and replace faulty hardware before it is used Test and remove bugs
Fault tolerance Use replication / redundancy and diversity Prevent water / error propagation
![Page 11: Fundamentals Stream Session 8: Fault Tolerance & Dependability I CSC 253 Gordon Blair, François Taïani Distributed Systems.](https://reader035.fdocuments.us/reader035/viewer/2022062511/551ae5ad550346b2288b6585/html5/thumbnails/11.jpg)
CSC253 / 2005-06 G. Blair/ F. Taiani 11
Quality of Service Notion of 'failure' is not black or white
Consider a web server taking 5 minute to deliver each request It fulfils its functional specification (to deliver web pages) It works better than a server that would not reply at all But can it be considered to be functioning correctly?
Lots of intermediary situations
Quality of Service of a system: how 'good' a system is Multiple dimension: latency, bandwidth, security, availability, reliability, ...
'Failure' usually reserved to when a service becomes unusable Goal: highest QoS in spite of faults (and at the lowest cost)
optimal service complete failure
![Page 12: Fundamentals Stream Session 8: Fault Tolerance & Dependability I CSC 253 Gordon Blair, François Taïani Distributed Systems.](https://reader035.fdocuments.us/reader035/viewer/2022062511/551ae5ad550346b2288b6585/html5/thumbnails/12.jpg)
CSC253 / 2005-06 G. Blair/ F. Taiani 12
Graceful Degradation Ideally fault tolerance should mask any internal failure/fault
A router crashes but you do not even notice
In practice not always possible Essentially because of cost + fault assumption
Goal is then to minimise impact of faults on system user Avoid complete failure Maintain highest possible QoS in spite of faults
This is called "Graceful Degradation" Extremely important: leaves time to administrators to react Users continue using the service, albeit with a lower QoS
![Page 13: Fundamentals Stream Session 8: Fault Tolerance & Dependability I CSC 253 Gordon Blair, François Taïani Distributed Systems.](https://reader035.fdocuments.us/reader035/viewer/2022062511/551ae5ad550346b2288b6585/html5/thumbnails/13.jpg)
CSC253 / 2005-06 G. Blair/ F. Taiani 13
Distributed Systems andFault Tolerance
Fault tolerance requires distribution For replication and independence of failure modes I.e. to tolerate earthquakes not all servers in San Francisco
Distribution requires fault-tolerance In large distributed systems, partial node failures unavoidable Single points of failure to be avoided as much as possible
![Page 14: Fundamentals Stream Session 8: Fault Tolerance & Dependability I CSC 253 Gordon Blair, François Taïani Distributed Systems.](https://reader035.fdocuments.us/reader035/viewer/2022062511/551ae5ad550346b2288b6585/html5/thumbnails/14.jpg)
CSC253 / 2005-06 G. Blair/ F. Taiani 14
What Can Go Wrong in a DS? Nodes (servers, clients, peers)
Faulty hardware crash or data corruption Power failure crash (and sometimes data corruption) Any "physical" accident: fire, flood, earthquake, ...
Network Same as nodes.
Routers gateways: whole subnet impacted Name servers: whole name domain impacted
Congestion (dropped packets)
Users / People Involuntary mistakes Security attacks (external & internal, i.e. from legitimate users)
![Page 15: Fundamentals Stream Session 8: Fault Tolerance & Dependability I CSC 253 Gordon Blair, François Taïani Distributed Systems.](https://reader035.fdocuments.us/reader035/viewer/2022062511/551ae5ad550346b2288b6585/html5/thumbnails/15.jpg)
CSC253 / 2005-06 G. Blair/ F. Taiani 15
Main Steps of Fault Tolerance Error Detection
Detecting that something is wrong (like water in the keel) The earlier the better The earlier the more difficult
Error Recovery Preventing errors from propagating (watertight door) Removing errors Does not mean root cause (fault) removed
Fault treatment (usually off-line, manually) Fault Diagnostic Fault Removal
![Page 16: Fundamentals Stream Session 8: Fault Tolerance & Dependability I CSC 253 Gordon Blair, François Taïani Distributed Systems.](https://reader035.fdocuments.us/reader035/viewer/2022062511/551ae5ad550346b2288b6585/html5/thumbnails/16.jpg)
CSC253 / 2005-06 G. Blair/ F. Taiani 16
Error Detection Error detection in the Titanic quite “easy”
No water should leak inside the boat (safety invariant)
Distributed Systems: more difficult What should be the result of a request is usually unknown
In some cases sanity or correctness check possible In many cases some form of redundancy needed
√xC2
1.414213
(1.414213)2=?
![Page 17: Fundamentals Stream Session 8: Fault Tolerance & Dependability I CSC 253 Gordon Blair, François Taïani Distributed Systems.](https://reader035.fdocuments.us/reader035/viewer/2022062511/551ae5ad550346b2288b6585/html5/thumbnails/17.jpg)
CSC253 / 2005-06 G. Blair/ F. Taiani 17
Error Recovery Backward error recovery
Return system into a previous error-free (hopefully) state Can be achieved in generic manner (replication, checkpointing) Risk of inconsistency in general case Can be minimised (see next week)
Forward error recovery Bring system in a valid future state Highly application dependent Makes sense as soon as real-time is involved Examples: video streaming, flight control software
![Page 18: Fundamentals Stream Session 8: Fault Tolerance & Dependability I CSC 253 Gordon Blair, François Taïani Distributed Systems.](https://reader035.fdocuments.us/reader035/viewer/2022062511/551ae5ad550346b2288b6585/html5/thumbnails/18.jpg)
CSC253 / 2005-06 G. Blair/ F. Taiani 18
Node FT: Replication Why replicate?
Performance: e.g. replication of heavily loaded web servers Allows error detection (when replicas disagree) Allows backward error recovery
Importance of fault-model Effect of replication depends on how individual replicas fail Crash faults availability: % availability = 1-pn
(where n = no of replicas, p = probability of individual failure)
Requirements for a replication service Replication transparency ... in spite of network partitions,
disconnections, etc.
![Page 19: Fundamentals Stream Session 8: Fault Tolerance & Dependability I CSC 253 Gordon Blair, François Taïani Distributed Systems.](https://reader035.fdocuments.us/reader035/viewer/2022062511/551ae5ad550346b2288b6585/html5/thumbnails/19.jpg)
CSC253 / 2005-06 G. Blair/ F. Taiani 19
0
0.2
0.4
0.6
0.8
1
1.2
0.5 0.4 0.3 0.2 0.1 0
Probability of failure of 1 replica
Whole systemavailability
1 replica2 replicas3 replicas4 replicas5 replicas
Focus on Availability Assumptions (very important!)
Replicas fail independently Replicas fail silently (crashes, no arbitrary behaviour)
![Page 20: Fundamentals Stream Session 8: Fault Tolerance & Dependability I CSC 253 Gordon Blair, François Taïani Distributed Systems.](https://reader035.fdocuments.us/reader035/viewer/2022062511/551ae5ad550346b2288b6585/html5/thumbnails/20.jpg)
CSC253 / 2005-06 G. Blair/ F. Taiani 20
A Generic Replication Architecture
C
C
FE
FE
Front endsClients
RM
RM
RM
Replicamanagers
ServiceRequest &
replies
![Page 21: Fundamentals Stream Session 8: Fault Tolerance & Dependability I CSC 253 Gordon Blair, François Taïani Distributed Systems.](https://reader035.fdocuments.us/reader035/viewer/2022062511/551ae5ad550346b2288b6585/html5/thumbnails/21.jpg)
CSC253 / 2005-06 G. Blair/ F. Taiani 21
Passive Replication What is passive replication (primary backup)?
FEs communicate with a single primary RM, which must then communicate with secondary RMs (slaves)
Requires election of new primary in case of primary failure Only tolerate crash faults (silent failure of replicas) Variant: primary’s state saved to stable storage (cold passive)
saving to stable storage known as “checkpointing”
C FE
FE
RM
RM
RM
Primary
C
Backup
Backup
![Page 22: Fundamentals Stream Session 8: Fault Tolerance & Dependability I CSC 253 Gordon Blair, François Taïani Distributed Systems.](https://reader035.fdocuments.us/reader035/viewer/2022062511/551ae5ad550346b2288b6585/html5/thumbnails/22.jpg)
CSC253 / 2005-06 G. Blair/ F. Taiani 22
Active Replication What is active replication?
Front Ends multicast requests to every Replica Manager Appropriate reliable group communications needed ordering guarantees crucial (see W4: Group Communication)
Tolerate crash-faults and arbitrary faults
FE
RM
RM
RM CFEC
![Page 23: Fundamentals Stream Session 8: Fault Tolerance & Dependability I CSC 253 Gordon Blair, François Taïani Distributed Systems.](https://reader035.fdocuments.us/reader035/viewer/2022062511/551ae5ad550346b2288b6585/html5/thumbnails/23.jpg)
CSC253 / 2005-06 G. Blair/ F. Taiani 23
Comparing Active vs. Passive
Passive ActiveComm. overheadProcessing overheadRecovery overheadFault modelDeterminism
LowLowHighOnly crash faultNot required
HighHighLowArbitrary faultsRequired
less expensiveless complexless powerful
more powerfulmore expensivemore complex
![Page 24: Fundamentals Stream Session 8: Fault Tolerance & Dependability I CSC 253 Gordon Blair, François Taïani Distributed Systems.](https://reader035.fdocuments.us/reader035/viewer/2022062511/551ae5ad550346b2288b6585/html5/thumbnails/24.jpg)
CSC253 / 2005-06 G. Blair/ F. Taiani 24
Network FT: TCP TCP is a fault-tolerant protocol
correct messages arrives even if packets get lost or corrupted
A TCP packet:
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Error detection against corruption
Fault-model
C.SC 251: Networking
![Page 25: Fundamentals Stream Session 8: Fault Tolerance & Dependability I CSC 253 Gordon Blair, François Taïani Distributed Systems.](https://reader035.fdocuments.us/reader035/viewer/2022062511/551ae5ad550346b2288b6585/html5/thumbnails/25.jpg)
CSC253 / 2005-06 G. Blair/ F. Taiani 25
TCP Checksum TCP checksum : 16bit blocks of packet added & complemented
(additional one-complementing here and there) 216 (~65 000) possible combinations TCP packets typically around 1500 bytes 212000 combinations Several packets bounds to have the same checksum
Roughly: 212000 ÷ 216 = 211984 ~ 16 x 103594
The Trick Usual corruption unlikely to produce a consistent checksum Notion of hamming distance Additional checksum at the network (IP) and link layers (Ethernet)
(!)multiple corruptions
![Page 26: Fundamentals Stream Session 8: Fault Tolerance & Dependability I CSC 253 Gordon Blair, François Taïani Distributed Systems.](https://reader035.fdocuments.us/reader035/viewer/2022062511/551ae5ad550346b2288b6585/html5/thumbnails/26.jpg)
CSC253 / 2005-06 G. Blair/ F. Taiani 26
Between 2 string of bits: the number of bit switches required to go from 1 to the other
Distance between 1011101 and 1001001? 2
Minimum Hamming Distance of a code minimal hamming distance between 2 valid "code words" i.e. minimal number of corruptions to fall back on a valid packet
Simple checksum: packets of 8 bits checksum on 4 bits:
adds the two 4-bit blocks
What is the minimal hamming distance of this code?
Between 2 string of bits: the number of bit switches required to go from 1 to the other
Distance between 1011101 and 1001001?
Hamming Distance
0110 1010 1100
0110
1010+
????
![Page 27: Fundamentals Stream Session 8: Fault Tolerance & Dependability I CSC 253 Gordon Blair, François Taïani Distributed Systems.](https://reader035.fdocuments.us/reader035/viewer/2022062511/551ae5ad550346b2288b6585/html5/thumbnails/27.jpg)
CSC253 / 2005-06 G. Blair/ F. Taiani 27
0100 110010101000
0110
1000+
0100
Hamming Distance
Hamming distance is 2 Note that the 2 corruptions must obey a particular pattern In a network corruption usually occur in bursts
0110 1010 11001000
0110
1000+
0100
1000+Matches again.Valid packet.Corruption not detected.
Does not match. Invalid packet.Corruption detected.
![Page 28: Fundamentals Stream Session 8: Fault Tolerance & Dependability I CSC 253 Gordon Blair, François Taïani Distributed Systems.](https://reader035.fdocuments.us/reader035/viewer/2022062511/551ae5ad550346b2288b6585/html5/thumbnails/28.jpg)
CSC253 / 2005-06 G. Blair/ F. Taiani 28
What Is in a Checksum? The TCP checksum is a form of error detection code
Min. Hamming distance reflects strength of the detection Far more complex error detection codes exist For instance CRC: cyclic redundancy check
It is a form of redundancy information redundancy: if packet OK no need for checksum partial redundancy: impossible to reconstruct packet from
checksum
When code used "strong" enough (min. Hamming dist > 2) possible reconstruct "most likely" correct packet Error correction code (not seen in this course) Provides error recovery
![Page 29: Fundamentals Stream Session 8: Fault Tolerance & Dependability I CSC 253 Gordon Blair, François Taïani Distributed Systems.](https://reader035.fdocuments.us/reader035/viewer/2022062511/551ae5ad550346b2288b6585/html5/thumbnails/29.jpg)
CSC253 / 2005-06 G. Blair/ F. Taiani 29
Environment FT: Data Centres Mission critical systems (your company’s web server)
Little point in addressing only one type of fault Major risks: power failure & overheating
Collocation Provider: rents luxury “parking places” for servers Multiple network operators Secured physical access Multiple power sources (including power generator) Highly robust air conditioning
Example: Redbus Interhouse plc (http://www.interhouse.org/) 3 centres in London, others in Amsterdam, Frankfurt, Milan, Paris Even there things can go wrong
http://www.theregister.co.uk/2004/06/16/redbus_power_fails/
![Page 30: Fundamentals Stream Session 8: Fault Tolerance & Dependability I CSC 253 Gordon Blair, François Taïani Distributed Systems.](https://reader035.fdocuments.us/reader035/viewer/2022062511/551ae5ad550346b2288b6585/html5/thumbnails/30.jpg)
CSC253 / 2005-06 G. Blair/ F. Taiani 30
What we'll see next week
Refinement of what we have seen today
Replication scheme: the problem of consistencyConsistent check-pointing for primary backupFault-tolerant total ordering for active replication
Contrast replication and distributed transactionsShort flash-back to Week 3 lecture
![Page 31: Fundamentals Stream Session 8: Fault Tolerance & Dependability I CSC 253 Gordon Blair, François Taïani Distributed Systems.](https://reader035.fdocuments.us/reader035/viewer/2022062511/551ae5ad550346b2288b6585/html5/thumbnails/31.jpg)
CSC253 / 2005-06 G. Blair/ F. Taiani 31
References
Chapter 7 of Tanenbaum & van Steen
On the Titanic http://www.charlespellegrino.com/Time%20Line.htm
Very detailed account http://octopus.gma.org/space1/titanic.html
Fundamental concepts of dependability by Algirdas Avizenis Jean-Claude Laprie Brian Randell http://www.cert.org/research/isw/isw2000/papers/56.pdf
![Page 32: Fundamentals Stream Session 8: Fault Tolerance & Dependability I CSC 253 Gordon Blair, François Taïani Distributed Systems.](https://reader035.fdocuments.us/reader035/viewer/2022062511/551ae5ad550346b2288b6585/html5/thumbnails/32.jpg)
CSC253 / 2005-06 G. Blair/ F. Taiani 32
Expected Learning Outcomes
At the end of this 8th Unit:
You should understand the basic dependability concepts of fault, error, failure
You should know the basic steps of fault-tolerance
You should appreciate fundamental fault-techniques like replication and error-detection codes
You should be able to explain and compare active and passive replication style