Hadoop availability

Availability

Reliability in distributed system • To be truly reliable, a distributed system must have the following

characteristics:

– Fault-Tolerant: It can recover from component failures without performing incorrect actions.

– Highly Available: It can restore operations, permitting it to resume providing services even when some components have failed.

– Recoverable: Failed components can restart themselves and rejoin the system, after the cause of failure has been repaired.

– Consistent: The system can coordinate actions by multiple components often in the presence of concurrency and failure. This underlies the ability of a distributed system to act like a non-distributed system.

– Scalable: It can operate correctly even as some aspect of the system is scaled to a larger size. For example, we might increase the size of the network on which the system is running. This increases the frequency of network outages and could degrade a "non-scalable" system. Similarly, we might increase the number of users or servers, or overall load on the system. In a scalable system, this should not have a significant effect.

– Predictable Performance: The ability to provide desired responsiveness in a timely manner.

– Secure: The system authenticates access to data and services

SPOF

• The combination of

– replicating namenode metadata on multiple file-systems, and

– using the secondary namenode to create checkpoints

• Protects against data loss

• But does not provide high-availability of the file-system.

SPOF

• The namenode is still

• a single point of failure (SPOF),

• since if it did fail– all client

– including MapReduce jobs

• would be unable to read, write, or list files

• Because the namenode is the sole repository of – the metadata and the

– file-to-block mapping

SPOF

• In such an event

• The whole Hadoop system would effectively be

• “Out of service”

• Until a new namenode could be brought online.

Reasons for downtime• An important part of

• improving availability and

• articulating requirements is

• understanding the causes of downtime

• There are many types of failures in distributed systems, ways to classify them, and analyses of how failures result in downtime.

Maintenance

• Maintenance to a master host normally requires a restart of the entire system.

Hardware failures

• Hosts and their connections may fail

• Hardware failures on the master host

• or a failure in the connection between the master and the majority of the slaves

• can cause system downtime

Software failures

• Software bugs may cause a component in the system to stop functioning or require a restart.

• For example, a bug in upgrade code could result in downtime due to data corruption.

• A dependent software component may become unavailable (e.g. the Java garbage collector enters a stop-the-world phase).

• A software bug in a master service will likely cause downtime.

Software failures

• Software failures are a significant issue in distributed systems.

• Even with rigorous testing, software bugs account for a substantial fraction of unplanned downtime (estimated at 25-35%).

• Residual bugs in mature systems can be classified into two main categories .

Heisenbug

• A bug that seems to disappear or alter its characteristics when it is observed or researched.

• A common example is a bug that occurs in a release-mode compile of a program, but not when researched under debug-mode.

• The name "heisenbug" is a pun on the "Heisenberg uncertainty principle," a quantum physics term which is commonly (yet inaccurately) used to refer to the way in which observers affect the measurements of the things that they are observing, by the act of observing alone (this is actually the observer effect, and is commonly confused with the Heisenberg uncertainty principle).

Bohrbug

• A bug (named after the Bohr atom model) that, in contrast to a heisenbug, does not disappear or alter its characteristics when it is researched.

• A Bohrbug typically manifests itself reliably under a well-defined set of conditions.

Software failures

• Heisenbugs tend to be more prevalent in distributed systems than in local systems.

• One reason for this is the difficulty programmers have in obtaining a coherent and comprehensive view of the interactions of concurrent processes.

Operator errors

• People make mistakes.

• Hadoop attempts to limit operator error by simplifying administration, validating its configuration, and providing useful messages in logs and UI components;

• however operator mistakes may still cause downtime.

Strategy

Severity

of D

ata

base D

ow

ntim

e

Planned

Unplanned

Catastrophic

Latency of Database Recovery

No Downtime

High

Availability

Continuous

Availability

Disaster

Recovery

Online

Maintenance

Offline

Maintenance

High

Availability

Clusters

Switching

and Warm

Standby

Replication

Cold

Standby

Recall• The NameNode stores modifications to the file system as a log appended to

a native file system file, edits.

• When a NameNode starts up, it reads HDFS state from an image file, fsimage, and then applies edits from the edits log file.

• It then writes new HDFS state to the fsimage and starts normal operation with an empty edits file.

• Since NameNode merges fsimage and edits files only during start up, the edits log file could get very large over time on a busy cluster.

• Another side effect of a larger edits file is that next restart of NameNodetakes longer.

Availability – Attempt 1 - Secondary namenode• Its main role is to periodically merge the namespace image with the

edit log to prevent the edit log from becoming too large.

• The secondary namenode usually runs on a separate physical machine, since it requires plenty of CPU and as much memory as the namenode to perform the merge.

• It keeps a copy of the merged namespace image, which can be used in the event of the namenode failing.

• However, the state of the secondary namenode lags that of the primary, so in the event of total failure of the primary, data loss is almost certain.

• The usual course of action in this case is to copy the namenode’smetadata files that are on NFS to the secondary and run it as the new primary.

• The secondary NameNode stores the latest checkpoint in a directory which is structured the same way as the primary NameNode'sdirectory.

• So that the check pointed image is always ready to be read by the primary NameNode if necessary.

Long Recovery• To recover from a failed namenode, an administrator starts a new primary

namenode with one of the file-system metadata replicas, and configures datanodes and clients to use this new namenode.

• The new namenode is not able to serve requests until it has

– loaded its namespace image into memory,

– replayed its editlog, and

– received enough block reports from the datanodes to leave safe mode.

• On large clusters with many files and blocks, the time it takes for a namenode to start from cold can be 30 minutes or more.

• The long recovery time is a problem for routine maintenance too.

• In fact, since unexpected failure of the namenode is so rare, the case for planned downtime is actually more important in practice.

Other roads to availability

• NameNode persists its namespace using two files:– fsimage, which is the latest checkpoint of the namespace

and edits,

– a journal (log) of changes to the namespace since the checkpoint.

• When a NameNode starts up, it merges the fsimage and edits journal to provide an up-to-date view of the file system metadata.

• The NameNode then overwrites fsimage with the new HDFS state and begins a new edits journal.

• Secondary name-node acts as mere checkpointer.

• Secondary name-node should be transformed into a standby name-node (SNN).

• Make it a warm standby.

• Provide real time streaming of edits to SNN so that it contained the up-to-date namespace state.

Availability – Attempt 2 - Backup node Checkpoint node• The Checkpoint node periodically creates checkpoints of the namespace.

• Downloads fsimage and edits from the active NameNode

• Merges locally, and uploads the new image back to the active NameNode.

• The Backup node provides

– the same checkpointing functionality as the Checkpoint node,

– as well as maintaining an in-memory, up-to-date copy of the file system namespace

• Always synchronized with the active NameNode state.

• Maintain up-to-date copy of the filysystem namespace in the memory

• Both run on different server

– Primary and backup node

– Since memory requirements are of same order

• The Backup node does not need to download - since it already has an up-to-date state of the namespace state in memory.

Terminology• Active NN

– NN that is actively serving the read and write operations from the clients.

• Standby NN– this NN waits and becomes active when the Active dies or is unhealthy.

– Backup Node as in Hadoop release 0.21 could be used to implement the Standby for the “shared-nothing” storage of filesystem namespace.

• Cold Standby– Standby NN has zero state (e.g.it is started after the Active is declared dead)

• Warm Standby– Standby has partial state:

– it has loaded fsImage and editLogs but has not received any block reports

– it has loaded fsImage and rolledlogs and all blockreports

• Hot Standby– Standby has all most of the Active’s state and start immediately

High Level Use Cases• Planned Downtime :

– A Hadoop cluster is often shut down in order to upgrade the software or configuration.

– A Hadoop cluster of 4000 nodes takes approximately 2 hours to be restarted.

• Unplanned Downtime or Unresponsive Service.

– The failover of the Namenode service can occur due to hardware, os failure, a failure of the Namenode daemon, or because the

– Namenode daemon becomes unresponsive for a few minutes.

– While this is not as common as one may expect the failure can occur at unexpected times and may have an impact on meeting the SLAs of some critical applications.

Specific use case1. Single NN configuration; no failover.

2. Active and Standby with manual failover.

a) Standby could be cold/warm/hot.

3. Active and Standby with automatic failover.

a) Both NNs started, one automatically becomes active and the other standby

b) Active and Standby running

c) Active fails, or is unhealthy; Standby takes over.

d) Active and Standby running - Active is shutdown

e) Active and Standby running, Standby fails. Active continues.

f) Active running, Standby down for maintenance. Active dies and cannot start. Standby is started and takes over as active.

g) Both NNs started, only one comes up. It becomes active

h) Active and Standby running; Active state is unknown (e.g. disconnected from heartbeat) and Standby takes over.

HDFS high-availability (HDFS-HA)

HDFS-HA• In this implementation there is a pair of namenodes in an active-standby

configuration.

• In the event of the failure of the active namenode, the standby takes over its duties to continue servicing client requests without a significant interruption.

• A few architectural changes are needed to allow this to happen:

– The namenodes must use highly-available shared storage to share the edit log. (In the initial implementation of HA this will require an NFS filer, but in future releases more options will be provided, such as a BookKeeper-based system built on Zoo-Keeper.)

– When a standby namenode comes up it reads up to the end of the shared edit log to synchronize its state with the active namenode, and then continues to read new entries as they are written by the active namenode.

– Datanodes must send block reports to both namenodes since the block mappings are stored in a namenode’s memory, and not on disk.

– Clients must be configured to handle namenode failover, which uses a mechanism that is transparent to users.

NN HA with Shared Storage and Zookeeper

Failover in HDFS-HA• If the active namenode fails, then the standby can take over very quickly (in

a few tens of seconds) since it has the latest state available in memory: – both the latest edit log entries, and

– an up-to-date block mapping.

• The actual observed failover time will be longer in practice (around a minute or so), since the system needs to be conservative in deciding that the active namenode has failed.

• In the unlikely event of the standby being down when the active fails, the administrator can still start the standby from cold.

• This is no worse than the non-HA case, and from an operational point of view it’s an improvement, since the process is a standard operational procedure built into Hadoop.

• The transition from the active namenode to the standby is managed by a new entity in the system called the failover controller.

• Failover controllers are pluggable, but the first implementation uses ZooKeeper to ensure that only one namenode is active.

• Each namenode runs a lightweight failover controller process whose job it is to monitor its namenode for failures (using a simple heartbeatingmechanism) and trigger a failover should a namenode fail.

Fencing• It is vital for the correct operation of an HA cluster that only one of the

NameNodes be Active at a time.

• Otherwise, the namespace state would quickly diverge between the two, risking data loss or other incorrect results.

• In order to ensure this and prevent the so-called "split-brain scenario," the administrator must configure at least one fencing method for the shared storage.

• The HA implementation goes to great lengths to ensure that the previously active namenode is prevented from doing any damage and causing corruption—a method known as fencing.

• Fencing mechanism:– killing the namenode’s process,

– revoking its access to the shared storage directory (typically by using a vendor-specific NFS command), and

– disabling its network port via a remote management command.

– As a last resort, the previously active namenode can be fenced with a technique rather graphically known as STONITH, or “shoot the other node in the head”, which uses a specialized power distribution unit to forcibly power down the host machine.

Client side• Client failover is handled transparently by the client library.

• The simplest implementation uses client-side configuration to control failover.

• The HDFS URI uses a logical hostname which is mapped to a pair of namenode addresses (in the configuration file), and the client library tries each namenode address until the operation succeeds.

End of session

Day – 1: Availability

Hadoop availability

Data & Analytics

Transcript of Hadoop availability