System Reliability and Fault Tolerance Reliable Communication Byzantine Fault Tolerance.
Lesson 20. Fault Tolerance and Disaster Recovery.
-
Upload
paul-george -
Category
Documents
-
view
218 -
download
0
Transcript of Lesson 20. Fault Tolerance and Disaster Recovery.
Lesson 20. Fault Tolerance and Disaster Recovery
Objectives
At the end of this presentation, you will be able to:
• Identify the purpose and characteristics of fault tolerance.
• Explain how redundancy is used in servers and networks to eliminate single points of failure.
• Identify several techniques used in servers and network systems to increase fault tolerance.
• Define: Fault tolerance, redundancy, RAID, mirror server, and cluster.
• Plan for disaster recovery.• Develop a disaster recovery plan.• Implement a disaster recovery plan.• Document and regularly test the disaster
recovery plan.• Explain standard backup procedures and
backup media storage practices.• Identify types of backups and restoration
schemes• Confirm and use off-site storage of backups
• 3.11• 3.12
Network+ Domain covered:
Fault Tolerance
• The ability of a network or a computer to go on working in spite of one or more component failures.
• Achieved by eliminating “single points of failure.”
• Achieved primarily through redundancy.
Redundancy in the Server
• Eliminates the most common “single points of failure.”
• Uses multiple components in parallel so that if one component fails another takes over.
Hardware Failure
Disk Drives50%Power
Supply28%
Fan 8%
CPU 5%Memory 4%Controller 4%
Motherboard 1%
Source: Intel
CourtesyIntel Corp.
Redundant Array of Inexpensive Drives (RAID)
• RAID is a way of coaxing two or more inexpensive, slow, unreliable drives to perform in concert so that they act like a more expensive, faster, reliable drive.
A disk system with RAID capability:
• Protects its data and provides on-line, immediate access to its data, despite a single disk failure.
• Provides for the on-line reconstruction of the contents of a failed disk to a replacement disk.
RAID Advisory Board (RAB)
Various RAID implementations exist. They are identified as Levels.
• The basic implementations are called level 0 through level 6.
• A higher level is not necessarily better than a lower level.
RAID can be implemented in:
• Softwareo Slowero Less expensive
• Hardwareo Fastero More expensive
RAID Level 1 - data is written to two separate drives.
0 1 2 3
0 1 2 30 1 2 3
Provides access to data despite a disk failure.
0 1 2 3
0 1 2 3
Provides for Reconstruction of the contents of the failed disk.
0 1 2 30 1 2 3
ServerChassis
Five Hard Drives
Redundant Hard Drives in a Server
CourtesyIntel Corp.
Redundant Power Supplies
Spare Power Supply
Redundant Power Supplies
CourtesyIntel Corp.
Caution
Hot-SwapFans
Hot-Swap Fan
CourtesyIntel Corp.
CPU Socket
CPU Socket
DualProcessor
Slots
CourtesyIntel Corp.
Redundant NICs
Active NIC
Spare NIC
Backup Power
• Standby Power Supply (SPS)• Uninterruptible Power Supply (UPS)
Standby Power Supply (SPS)
• An “off-line” device that functions only when normal power fails.
• A sensor detects AC power failure and switches over to standby power.
• Standby power is provided by a battery and a power inverter.
Battery Pack
Standby Power Supply (Normal)
Charger
Battery Pack
Standby Power Supply (AC Power Fails)
Charger Inverter
Uninterruptible Power Supply (UPS)
• An “on-line” device that constantly provides power.
• In the event of an AC power failure there is no switchover to standby power, because the UPS is constantly “on-line.”
• It “conditions” the AC input, isolating the computer equipment from all variations in AC power.
The UPS conditions the AC line against:
• Power outages – Total loss of AC power.• Surges – Temporary voltage rises.• Sags – Temporary voltage drops.• Noise – High frequency voltage spikes,
both up and down.
Battery Pack
Uninterruptible Power Supply (Normal)
Charger Inverter
Battery Pack
Uninterruptible Power Supply (AC Power Fails)
Charger Inverter
Increase Fault Tolerance
• RAID• Multiple power supplies• Multiple fans• Multiple CPUs• Redundant PCI cards• Backup power sources
Disaster Recovery
Types of Disasters
• Fires• Floods• Wind and water damage• Accidents• Power outages• Civil unrest• Malicious attacks
Disaster Recovery
• The ability to return to an acceptable level of operation after a disaster.
• Requires a well thought-out disaster recovery plan.
• A comprehensive implementation of the plan.
• Frequent testing and updating of the plan.
7 Steps to Disaster Tolerance
• Initiate the project• Form a project team• Complete a needs analysis• Develop a plan that encompasses both protection
and recovery• Implement the plan• Test the plan• Constantly update the plan
What’s in the Protection Plan?
• Procedures and policies describing how the facility, its functions, and data are to be protected.
• List of new protective equipment, software, and services needed along with a budget, procurement schedule and installation schedule.
• A step-by-step procedure and timetable for upgrading the data center from its present state to a protected state.
What’s in the Recovery Plan?
• Procedures and policies describing how and under what conditions the recovery plan should be activated.
• Basic protective and recovery information on each major piece of equipment.
• Names and telephone numbers of key corporate officials and the emergency management team members.
• Address of off-site backup facilities, with name and number of contact person.
• Location of backup tapes and disks.
What’s in the Recovery Plan? (continued)
• Names and phone numbers of key hardware, software and services vendors.
• Model numbers, serial numbers, as well as warranty and service agreement information on major pieces of equipment.
• Insurance policy numbers and information.• Documentation of the equipment, software,
configuration and wiring infrastructure of the data center.
24 X 7 X 365
• 24 hours per day• 7 days per week• 365 days per year
Backing up the Main System
• Hot Site Backup• Warm Site Backup• Cold Site Backup
Hot Site Backup
• A duplicate and running complement of computer hardware and software ready to take over immediately should the main system become unavailable for any reason.
• Data on the main system is backed up to the duplicate system in real time.
• If the main system fails the duplicate can take over operation without any downtime.
Warm Site Backup
• A duplicate complement of computer hardware and software ready to take over in a reasonable length of time should the main system become unavailable.
• Data is not backed up to the duplicate system in real-time, but could be restored from back up tapes or other media.
Cold Site Backup
• An off-site location that can be used in case the main site is inoperable.
• Ready to go, but with no equipment installed.
• Least expensive to maintain.• The recovery time is quite long compared
to hot-site or even warm-site backup.
Implement the Plan
• Buying and installing the equipment, software, and services necessary to bring the data center up to a protected state.
• Training in the discipline of new policies and procedures.
• The plan may be so extensive and so expensive that it must be phased in over time.
• Every day that the plan is delayed, the company is at risk.
Test the Plan
• The only way to insure the plan works.• Simulate a disaster.• Test the plan regularly and thoroughly.
Constantly Update the Plan
• New equipment• New software• New people• New tasks
Safeguard the Disaster Recovery Plan
• Make duplicate copies.• Make certain that duplicate copies are
updated when the master copy is updated.• Make sure key people know where the
document can be found.
Need for Frequent and Regular Backup
• The most effective way to prevent data loss.• Protects all but the most recent data.• Protects against hardware failure,
equipment theft, hackers, viruses and vandals.
• Storing backups in a different location protects against fire, flood and other natural disasters.
Backup Considerations
• What data should be backed up?• How often should the data be backed up?• What type of backup media should be used?• What type of backup scheme should be
used?
What data should be backed up?
• Backups require time and media.• Backups must not exceed the capacity of
the backup device. • Ask yourself: “Can I afford to lose this?”• It is better to err on the side of caution.
What data need not be routinely backed up?
• Operating Systems• Application software• Historical data that does not change
How often should the data be backed up?
• Trade-off of risk versus benefit.• Ask yourself: “How much data can I afford
to lose?”• Daily backups are the most common.• But circumstances may dictate anything
from continuous real-time backups to weekly backups.
What type of backup media should be used?
• Magnetic Disks• Optical Disks• Magnetic Tape• Internet Backup
Types of Backup
• Full• Incremental• Differential
Full Backup
• The backup of all files on the drive.• Takes the longest time to record because
every file is copied.• Takes the shortest time to restore because
everything is on a single tape.
Full Backup
Monday
Tuesday
Wednesday
Thursday
Friday
Saturday
Sunday
Restoration From A Full Backup Requires Only One Tape
Wednesday
The Full Backup is:
• A straightforward method of insuring good backups and quick, easy restorations.
• The starting point for the Incremental and the Differential Backups.
Incremental Backup
• Records only those files that have changed since the last Incremental or Full Backup.
• Takes the shortest time to record.• Generally takes the longest time to restore.• Generally requires several tapes to restore.
Incremental Backup
Monday
Tuesday
Wednesday
Thursday
Friday
Saturday
Sunday (Full Backup)
IncrementalBackups
Restoration From an Incremental Backup May Require Several
Tapes
Monday
Tuesday
Wednesday
Sunday (Full Backup)
IncrementalBackups
Differential Backup
• Records only those files that have changed since the last Full backup.
• Takes less time to record than a Full backup.
• Takes less time to restore than an Incremental backup.
• The restore process requires only two tapes.
Differential Backup
Monday
Tuesday
Wednesday
Thursday
Friday
Saturday
Sunday (Full Backup)
DifferentialBackups
Restoration From a Differential Backup Requires Only Two
Tapes
Wednesday (Differential
Backup)
Sunday (Full Backup)
Grandfather - Father - Son (GFS) Tape Rotation Scheme
• Son – The daily backup tapes.• Father – The full backup tape for the week.• Grandfather – The full backup tape for the
month.
Reuse Tapes
• Son – After one week• Father – Five weeks• Grandfather – Save indefinitely at an off-
site location.
Verify that the Backup Works
• Can you restore from the backup tape?• Can you still restore from the backup tape if
the original tape drive is destroyed?
Problems with Tapes
• Problem 1: Tape drive heads are dirty.o Solution: Clean the tape heads.
• Problem 2: The tapes become worn with time and use.o Solution: Replace the worn tape with a new
tape.
Backup Software
• A Utility designed to make routine backups as effortless and as effective as possible.
• Suppliers of Backup softwareo NOS vendoro Third parties
• Identify the purpose and characteristics of fault tolerance.
• Explain how redundancy is used in servers and networks to eliminate single points of failure.
• Identify several techniques used in servers and network systems to increase fault tolerance.
• Define: Fault tolerance, redundancy, RAID, mirror server, and cluster.
• Plan for disaster recovery.• Develop a disaster recovery plan.• Implement a disaster recovery plan.• Document and regularly test the disaster
recovery plan.• Explain standard backup procedures and
backup media storage practices.• Identify types of backups and restoration
schemes• Confirm and use off-site storage of backups