Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |

www.edureka.in/hadoop

http://www.edureka.in/hadoop

Hello There!!My name is Annie.

Let me test your Hadoop 1.x knowledge?

Annie’s Introduction

Hello There!!My name is Annie. I love quizzes and

puzzles and I am here to make you guys think and

answer my questions.

Can you store 1 billion files in a Hadoop 1.x cluster?- Yes- No

Annie’s Question

No. Even though you have hundreds of DataNodes in the cluster, the NameNode keeps all its metadata in memory, so you are limited to a maximum of only 50-100M files in the entire cluster because of a Single NameNode in Hadoop 1.x.

Annie’s Answer




A Hadoop 1.x cluster can have multiple HDFS Namespaces.- True- False

Annie’s Question

False. Not possible with Hadoop 1.x.

Annie’s Answer




Which of the following is (are) a significant disadvantage in Hadoop 1.0?- ‘Single Point Of Failure’ of NameNode- Too much burden on Job Tracker

Annie’s Question

Single Point of Failure of NameNode and too much burden on Job Tracker.

Annie’s Answer




Can you use hundreds of Hadoop DataNode for any other processing than MapReduce in Hadoop 1.x?- Yes- No

Annie’s Question

No. Hadoop 1.x dedicates all the DataNode resources to Map and Reduce slots with no or little room for processing any other workload.

Annie’s Answer




Can you use Hadoop for Real-time processing?- Yes- No

Annie’s Question

No. Hadoop is designed and developer for massively parallel batch processing.

Annie’s Answer

Limitations of Hadoop 1.x

No horizontal scalability of NameNode

Does not support NameNode High Availability

Overburdened JobTracker

Not possible to run Non-MapReduce Big Data Applications on HDFS

Does not support Multi-tenancy


Hadoop 1.x – In Summary

Client

HDFS Map Reduce

Secondary NameNode

Data BlocksDataNode

NameNode Job Tracker

Task Tracker

Map Reduce

DataNode Task Tracker

Map Reduce….

DataNode DataNodeTask Tracker

Map Reduce

Task Tracker

Map Reduce



Problem Description

NameNode – No Horizontal Scalability

Single NameNode and Single Namespace, limited by NameNode RAM

NameNode – No High Availability (HA) NameNode is Single Point of Failure, Need manual recovery usingSecondary NameNode in case of failure

Job Tracker – Overburdened Spends significant portion of time and effort managing the life cycle ofApplications

MRv1 – Only Map and Reduce tasks Humongous Data stored in HDFS remains unutilized and cannot be usedfor other workloads such as Graph processing etc.

Hadoop 1.x - Challenges


NameNode - No High Availability

NameNode - No Horizontal Scale

DataNode

DataNode

DataNode

….

Client Get Block Locations

Block Management

Read Data

NameNodeNS

Slide 16 www.edureka.in/hadoop

NameNode – Scale and HA



Name Node –Single Point of Failure

Secondary NameNode:

“Not a hot standby” for the NameNode Connects to NameNode every hour* Housekeeping, backup of NameNode metadata Saved metadata can build a failed NameNode

You give me metadata every hour, I will make

it secure

Single Point Failure

Secondary NameNode

NameNode

metadata

metadata



Job Tracker – Overburdened

CPU

Spends a very significant portion of time and effort managing the life cycle of applications

Network

Single Listener Thread to communicate with thousands of

Map and Reduce Jobs

Task Tracker Task Tracker Task Tracker….

Job Tracker



MRv1 – Unpredictability in Large Clusters

As the cluster size grow and reaches to 4000 Nodes

Cascading Failures

The DataNode failures results in a seriousdeterioration of the overall clusterperformance because of attempts to replicatedata and overload live nodes, through networkflooding.

Multi-tenancy

As clusters increase in size, you may want toemploy these clusters for a variety of models.MRv1 dedicates its nodes to Hadoop andcannot be re-purposed for other applicationsand workloads in an Organization. With thegrowing popularity and adoption of cloudcomputing among enterprises, this becomesmore important.


Unutilized Data in HDFS

Terabytes and Petabytes of data in HDFS can only be used for MapReduce processing



Introducing Hadoop 2.0

Features Hadoop 1.x Hadoop 2.0

HDFS Federation One NameNode and a Namespace Multiple NameNode and Namespaces

NameNode High Availability Not present Highly Available

YARN - Processing Control and Multi-tenancy

JobTracker, TaskTracker Resource Manager, Node Manager, App Master, Capacity Scheduler

Other important Hadoop 2.0 features HDFS Snapshots NFSv3 access to data in HDFS Support for running Hadoop on MS Windows Binary Compatibility for MapReduce applications built on Hadoop 1.0 Substantial amount of Integration testing with rest of the projects (such as PIG, HIVE) in Hadoop ecosystem



Namenode

Block Management

NS

Storage

Datanode Datanode…

Nam

esp

ace

Blo

ckSto

rage

Nam

esp

ace

NS1 NSk NSn

NN-1 NN-k NN-n

Common Storage

Datanode 1

…Datanode 2

…Datanode m

…Blo

ckSto

rage

Pool 1 Pool k Pool n

Block Pools

… …

Hadoop 1.0 Hadoop 2.0


http://hadoop.apache.org/docs/stable2/hadoop-project-dist/hadoop-hdfs/Federation.html

Hadoop 2.0 Cluster Architecture - Federation


http://hadoop.apache.org/docs/stable2/hadoop-project-dist/hadoop-hdfs/Federation.html


cluster.

Annie’s Question

How does HDFS Federation help HDFS Scale horizontally?A) Reduces the load on any single NameNode by using the multiple, independent NameNodes to manage individual parts of the file system namespace.B) Provides cross-data centre (non-local) support for HDFS, allowing a cluster administrator to split the Block Storage outside the local cluster.



Annie’s Answer

(A). In order to scale the name service horizontally, HDFS federation uses multiple independent NameNodes. The NameNodes are federated, that is, the NameNodes are independent and do not require coordination with each other.


Annie’s Question

You have configured two NameNodes to manage /marketing and /finance namespaces respectively. What will happen if you try to ‘put’ a file in /accounting directory?



Annie’s Answer

The ‘put’ will fail. None of the namespaces will manage the file and you will get an IOException with a “No such file or directory error”.



Node Manager

HDFS

YARN

Resource Manager

Shared edit logs

All name space edits logged to shared NFS storage; single writer

(fencing)

Read edit logs and applies to its own namespace

Data Node

Standby NameNode

Active NameNode

ContainerApp

Master

Node Manager

Data Node

ContainerApp

Master

Data Node

Client

Data Node

ContainerApp

Master

Node Manager

Data Node

ContainerApp

Master

Node Manager

Hadoop 2.0 Cluster Architecture - HA

NameNode High Availability

Next Generation MapReduce

HDFS HIGH AVAILABILITY

http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithNFS.html

http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithNFS.html

Hadoop 2.0 Cluster Architecture - HA


High Availability in Hadoop 2.0

NameNode recovery in Hadoop 1.0

Secondary NameNode

Standby NameNode

Active NameNode

Secondary NameNode

NameNode

Edit logs

Meta-Data

Automatic failover to Standby NameNode

Manually Recover using Secondary

NameNodeFSImage


Annie’s Question

NameNode HA was developed to overcome the following disadvantage in Hadoop 1.0?a) Single Point Of Failure of NameNodeb) Too much burden on Job Tracker



Annie’s Answer

Single Point of Failure of NameNode.



Apache Oozie (Workflow)

HDFS(Hadoop Distributed File System)

Pig LatinData Analysis

HiveDW System

MapReduce Framework

HBase

Apache Oozie (Workflow)

HDFS(Hadoop Distributed File System)

Pig LatinData Analysis

HiveDW System

MapReduce Framework HBase

OtherYARN

Frameworks(MPI, GIRAPH)


YARNCluster Resource Management

YARN adds a more general interface to run non-MapReduce jobs (such as Graph Processing) within the Hadoop framework

YARN and Hadoop Ecosystem


BATCH(MapReduce)

INTERACTIVE(Text)

ONLINE(HBase)

STREAMING(Storm, S4, …)

GRAPH(Giraph)

IN-MEMORY(Spark)

HPC MPI(OpenMPI)

OTHER(Search)

(Weave..)


http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/YARN.html

YARN – Moving beyond MapReduce


http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/YARN.html


Organizes jobs into queues

Queue shares as %’s of cluster

FIFO scheduling within eachqueue

Data locality-aware Scheduling

Hierarchical QueuesTo manage the resource within an organization.

Capacity GuaranteesA fraction to the total available capacity allocated to each Queue.

SecurityTo safeguard applications from other users.

ElasticityResources are available in a predictable and elastic manner to queues.

Multi-tenancySet of limit to prevent over-utilization of resources by a singleapplication.

OperabilityRuntime configuration of Queues.

Resource-based schedulingIf needed, Applications can request more resources than the default.

Multi-tenancy - Capacity Scheduler


Annie’s Question

YARN was developed to overcome the following disadvantage in Hadoop 1.0 MapReduce framework?a) Single Point Of Failure Of NameNodeb) Too much burden on Job Tracker



Annie’s Answer

Too much burden on Job Tracker.



NameNode HighAvailability

Next Generation MapReduce

Hadoop 2.0 – In Summary

Client

HDFS YARN

Resource ManagerStandby NameNode

Active NameNode

Distributed Data Storage Distributed Data Processing

DataNode

Node Manager

ContainerApp

Master …….

Mast

ers

Sla

ves

Node Manager

DataNode

ContainerApp

Master

DataNode

Node Manager

ContainerApp

Master

Shared edit logs

ORJournal Node

Scheduler

Applications Manager

(AsM)






Can you use Hadoop 2.0 for Real-time processing?- Yes- No

Annie’s Question

No. Even though YARN in Hadoop 2.0 supports multiple frameworks for different workloads other than batch, you need Storm or S4 for real-time processing.

Annie’s Answer


What about Real-time Processing?

Hadoop is good for Batch but

How do I process Big Data in Real-time?



Storm is coming….

APACHE STORM

The Real-time Hadoop

• Continuous commutation system

Distributed, Reliable, Fault-tolerant,

Scalable and Robust

• Suitable for Big Data processing• Guarantees no data loss

Programming Language agnostic

• JSON-based for Ruby, Python etc.

Use case

• Stream processing• Distributed RPC• Continuous Computation


Hadoop Vs. Storm

Hadoop Storm

Differences

Fundamentally as Batch processing system

Real-time processing, process unterminated streams (e.g. twitter feeds) of data, process data as it arrives

MapReduce Jobs run to completion

Topologies (Computation Graph) run forever

Stateful NodesStateless Nodes

Hadoop Storm

Similarities

Scalable Scalable

Guarantees no data loss Guarantees no data loss

Open Source Open Source

Storm Use Cases

Data Normalization• Groupon uses Storm to build real-time data integration

systems.

Analytics• Storm powers Twitter’s publisher analytics product,

processing every tweet and click that happens on Twitter toprovide analytics for Twitter's publisher partners.

• Flipboard use Storm across a wide range of services rangingfrom Content Search to real-time analytics, to generatingcustom magazine fields.

Log processing• Alibaba uses Storm to process the application log and data

change in databases to supply real-time data stats for dataapps.

• NaviSite uses Storm in its server log monitoring and auditingsystem.

Thank YouSee You in Next Class

Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |

Technology

Transcript of Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |