Hadoop 1.0 vs Hadoop 2.0

www.edureka.in/hadoopSlide 1


Topics for Today

Hadoop Architecture

Problems with Hadoop 1.0

Solution: Hadoop 2.0 and YARN

Hadoop 2.0 New Features

HDFS High Availability

HDFS Federation

YARN and MRv2

YARN and Hadoop ecosystem

YARN Components

JobTracker and Job Submission

MR Application Execution in YARN


Hadoop is a system for large scale data processing.

It has two main components:

HDFS Hadoop Distributed File System (Storage)

Distributed across nodes

Natively redundant

NameNode tracks locations.

MapReduce (Processing)

Splits a task across processors

near the data & assembles results

Self-Healing, High Bandwidth

Clustered storage

JobTracker manages the TaskTrackers

Hadoop Core Components


Data Node

TaskTracker

Data Node

TaskTracker

Data Node

TaskTracker

Data Node

TaskTracker

MapReduceEngine

HDFSCluster

Job Tracker

Admin Node

Name node

Hadoop Core Components (Contd.)


Client

HDFS Map Reduce

Secondary Name Node

Data BlocksData Node

Name Node Job Tracker

Task Tracker

Map Reduce

Data Node Task Tracker

Map Reduce.

Hadoop 1.0 In Summary

Data Node Data NodeTask Tracker

Map Reduce

Task Tracker

Map Reduce


Hadoop 1.0 - Challenges

Problem Description

NameNode No Horizontal Scalability

Single NameNode and Single Namespaces, limited by NameNode RAM

NameNode No High Availability (HA)

NameNode is Single Point of Failure, Need manual recovery using Secondary NameNode in case of failure

Job Tracker Overburdened Spends significant portion of time and effort managing the life cycle of applications

MRv1 Only Map and Reduce tasks Humongous Data stored in HDFS remains unutilized and cannot be used for other workloads such as Graph processing etc.


NameNode HA and Scale

NameNode - No Horizontal Scale

NameNode - No High Availability

DataNode

DataNode

DataNode

.

Clientget Block Locations

Read Data

NameNodeNS

Block Management


Secondary NameNode:

Not a hot standby for the NameNode

Connects to NameNode every hour*

Housekeeping, backup of NemeNode metadata

Saved metadata can build a failed NameNode

You give me metadata every hour, I will make

it secure

Single Point of Failure

Secondary NameNode

NameNode

metadata

metadata

Name Node Single Point of Failure


Job Tracker Overburdened

CPU Spends a very significant portion of time and

effort managing the life cycle of applications

Network Single Listener Thread to communicate with

thousands of Map and Reduce Jobs

Task Tracker Task Tracker Task Tracker.

Job Tracker


MRv1 Unpredictability in Large Clusters

Cascading Failures The DataNode failures results in a serious

deterioration of the overall cluster performance because of attempts to replicate data and overload live nodes, through network flooding.

As the cluster size grow and reaches to 4000 Nodes

Multi-tenancy As clusters increase in size, you may want

to employ these clusters for a variety of models. MRv1 dedicates its nodes to Hadoop and cannot be re-purposed for other applications and workloads in an Organization. With the growing popularity and adoption of cloud computing among enterprises, this becomes more important.

www.edureka.in/hadoop

Terabytes and Petabytes of data in HDFS can be used only for MapReduce processing

Unutilized Data in HDFS


Hadoop 2.0 New Features

Property Hadoop 1.0 Hadoop 2.0

Federation One Namenode and Namespaces

Multiple Namenode and Namespaces

High Availability Not present Highly Available

YARN - Processing Control and Multi-tenancy

JobTracker, Task Tracker Resource Manager, Node Manager, App Master, Capacity Scheduler

Other important Hadoop 2.0 features HDFS Snapshots NFSv3 access to data in HDFS Support for running Hadoop on MS Windows Binary Compatibility for MapReduce applications built on Hadoop 1.0 Substantial amount of Integration testing with rest of the projects (such as PIG, HIVE) in

Hadoop ecosystem


YARN and Hadoop Ecosystem

Apache Oozie (Workflow)

HDFS (Hadoop Distributed File System)

Pig LatinData Analysis

HiveDW System

MapReduce Framework

HBase

Apache Oozie (Workflow)

HDFS (Hadoop Distributed File System)

Pig LatinData Analysis

HiveDW System

MapReduce Framework HBase

Other YARN

Frameworks(MPI, GIRAPH)

YARNCluster Resource Management

YARN adds a more general interface to run non-MapReduce jobs (such as Graph Processing) within the Hadoop framework


Hadoop 2.0 Cluster Architecture - Federation

Namenode

Block Management

NS

Storage

Datanode Datanode

Nam

esp

ace

Blo

ck S

tora

ge

Nam

esp

ace

NS1 NSk NSn

NN-1 NN-k NN-n

Common Storage

Datanode 1

Datanode 2

Datanode m

Blo

ck S

tora

ge

Pool 1 Pool k Pool n

Block Pools

Hadoop 1.0 Hadoop 2.0

http://hadoop.apache.org/docs/stable2/hadoop-project-dist/hadoop-hdfs/Federation.html


Hello There!!My name is Annie. I love quizzes and

puzzles and I am here to make you guys think and

answer my questions.

How does HDFS Federation help HDFS Scale horizontally?- reduces the load on any single NameNode by using the multiple,

independent NameNode to manage individual parts of the filesystem namespace

- provides cross-data center (non-local) support for HDFS, allowing a cluster administrator to split the Block Storage outside the local cluster.

Annies Question


A. In order to scale the name service horizontally, HDFS federation uses multiple independent Namenodes. The Namenodes are federated, that is, the Namenodes are independent and dont require coordination with each other.

Annies Answer





You have configured two name nodes to manage /marketing and /finance respectively. What will happen if you try to put a file in /accounting directory?

Annies Question


The put will fail. None of the namespace will manage the file and you will get an IOException with a No such file or directory error.

Annies Answer


Node Manager

HDFS

YARN

Resource Manager

Shared edit logs

All name space edits logged to shared NFS storage; single writer

(fencing)

Read edit logs and applies to its own namespace

Secondary Name Node

Data Node

Standby NameNode

Active NameNode

ContainerApp

Master

Node Manager

Data Node

ContainerApp

Master

Data Node

Client

Data Node

ContainerApp

Master

Node Manager

Data Node

ContainerApp

Master

Node Manager

Hadoop 2.0 Cluster Architecture - HA

NameNode High Availability

Next Generation MapReduce

HDFS HIGH AVAILABILITY

http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithNFS.html


NameNode Recovery Vs. Failover

High Availability in Hadoop 2.0

NameNode recovery in Hadoop 1.0

Secondary Name Node

Standby NameNode

Active NameNode

Secondary Name Node

NameNode

Edit logs

Meta-Data

Automatic failover to Standby NameNode

Manually Recover using Secondary

NameNodeFSImage





HDFS HA was developed to overcome the following disadvantage in Hadoop 1.0?- Single Point Of Failure Of NameNode- Only one version can be run in classic MapReduce- Too much burden on Job Tracker

Annies Question


Single Point of Failure of NameNode.

Annies Answer


BATCH(MapReduce)

INTERACTIVE(Text)

ONLINE(HBase)

STREAMING(Storm, S4, )

GRAPH(Giraph)

IN-MEMORY(Spark)

HPC MPI(OpenMPI)

OTHER(Search)

(Weave..)

YARN Moving beyond MapReduce

http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/YARN.html


Multi-tenancy - Capacity Scheduler

Organizes jobs into queues

Queue shares as %s of cluster

FIFO scheduling within each queue

Data locality-aware Scheduling

Hierarchical QueuesTo manage the resource within an organization.

Capacity GuaranteesA fraction to the total available capacity allocated to each Queue.

SecurityTo safeguard applications from other users.

ElasticityResources are available in a predictable and elastic manner to queues.

Multi-tenancySet of limit to prevent over-utilization of resources by a single application.

OperabilityRuntime configuration of Queues.

Resource-based schedulingIf needed, Applications can request more resources than the default.


Executing MapReduce Application on YARN

MapReduce Application Execution


DFS

Job Tracker

1. Copy Input Files

User

2. Submit Job

3. Get Input Files Info

6. Submit Job

4. Create Splits

5. Upload Job Information

Input Files

Client

Job.xml.Job.jar.

JobTracker

Job Execution


DFS

Client Job.xml.Job.jar.

6. Submit Job

8. Read Job Files

7. Initialize JobJob Queue

As many maps as splitsInput Spilts

Maps Reduces

9. Create maps and reduces

Job Tracker

JobTracker (Contd.)


Job Tracker

Job QueueH3

H4

H5

H1

Task Tracker -H2

Task Tracker -H4

10. Heartbeat

12. Assign Tasks

10. Heartbeat

10. Heartbeat

10. Heartbeat

11. Picks Tasks(Data Local if possible)

Task Tracker -H3

Task Tracker -H1

JobTracker (Contd.)


YARN = Yet Another Resource Negotiator

Node Manager

Container Container

Node Manager

AppMaster

Container

Node Manager

ContainerApp

Master

ResourceManager

Client

Client

MapReduce StatusJob SubmissionNode StatusResource Request

Resource Manager Cluster Level resource

manager Long Life, High Quality

Hardware

Node Manager One per Data Node Monitors resources on Data

Node

Application Master One per application Short life Manages task /scheduling

YARN Beyond MapReduce

JobHistoryServer


YARN MR Application Execution Flow

Client

ResourceManager

HDFS

Application

Node Manager

MR AppMaster

YARNChild

Map or Reduce

Task

Job Object

1. Run App 2. Get New Application

Client JVM

3. Copy Job Resources

4. Submit Application

DataNode

Management Node

Task JVM



Client

ResourceManager

HDFS

Application

Node Manager

MR AppMaster

YarnChild

Map or Reduce

Task

Job Object


Client JVM



5. Start MR AppMaster container

7. Get Input Splits

8. Request Resources

9. Start container

6. Create container

DataNode

Management Node

Task JVM



Client

ResourceManager

HDFS

Application

Node Manager

MR AppMaster

YarnChild

Map or Reduce

Task

Job Object


Client JVM



5. Start MR AppMaster container

7. Get Input Splits

8. Request Resources

9. Start container

6. Create container

DataNode

12. Execute

10. Create Container

11. Acquire Job Resources

Management Node

Task JVM



Client

ResourceManager

HDFS

Application

Node Manager

MR AppMaster

YarnChild

Map or Reduce

Task

Job Object

Client JVM

DataNode

Management Node

Task JVMPoll for Status

Update Status


Hadoop 2.0 : YARN Workflow

ResourceManager

Node Manager

Node Manager

MR AppMaster 2

Node Manager

Node Manager

Node Manager

Container 2.2

Node Manager

Container 2.3

Node Manager

Node Manager

Container 2.1

Container 1.1

Node Manager

Container 1.2

Scheduler

Applications Manager (AsM)

Node Manager

Node Manager

Node Manager

MR AppMaster 1



MapReduce Application Execution Submission Initialization Tasks Assignment Tasks Memory Status Updates Failure Recovery

Client Submit a MapReduce Application

Resource Manager Manage the resource utilization across

Hadoop Cluster Node Manager

Runs on each Data Node Creates execution container Monitors Containers usage

MapReduce Application Master Coordinates and Manages MapReduce

tasks Negotiates with Resource Manager to

schedule asks The tasks are started by NodeManager(s)

HDFS shares resources and tasks artefacts

among YARN components

Application Vs. Job Execution


Hadoop 2.0 In Summary

Client

HDFS YARN

Resource ManagerSecondary Name Node

Standby NameNode

Active NameNode

Distributed Data Storage Distributed Data Processing

Data Node

Node Manager

ContainerApp

Master .

Mast

ers

Sla

ves

Node Manager

Data Node

ContainerApp

Master

Data Node

Node Manager

ContainerApp

Master

Shared edit logs

Scheduler

Applications Manager

(AsM)

NameNode High Availability

Next Generation MapReduce

Manages the App

App Master can request the

container and run the code in them.





YARN/MRv2 was developed to overcome the following disadvantage in MRv1?- Single Point Of Failure Of NameNode- Only one version can be run in classic MapReduce- Too much burden on Job Tracker

Annies Question


Too much burden on Job Tracker.

Annies Answer





In YARN (MRv2), the functionality of JobTracker has been replaced by which of the following YARN features:- Job Scheduling - Task Monitoring - Resource Management - Node management

Annies Question


Task Monitoring and Resource Management. The fundamental idea of MRv2 is to split up the two major functionalities of the JobTracker, i.e. resource management and job scheduling/monitoring, into separate daemons. A global ResourceManager (RM) for resources and per-application ApplicationMaster (AM) for task monitoring.

Annies Answer





In YARN, which of the following daemons takes care of the container and the resource utilization by the applications?:- Node Manager- Job Tracker- Task tracker- Application Master- Resource manager

Annies Question


ApplicationMaster.

Annies Answer





Can we run MRv1 Jobs in a YARN/MRv2 enabled Hadoop Cluster?- Yes- No

Annies Question


Yes. You need recompile the Jobs in MRv2 after enabling YARN/Mrv2 to run the Job successfully on Mrv2 Hadoop Cluster.

Annies Answer





Which of the following YARN/MRv2 daemon is responsible for launching the tasks?- Task Tracker - Resource Manager- ApplicationMaster- ApplicationMasterServer

Annies Question


ApplicationMaster.

Annies Answer

Thank You See You in Class Next Week


NN High Availability Quorum Journal Manager

Standby Name Node

Active Name Node

Data Node Data Node Data Node Data Node

.

Journal Node

Write namespace modification to edit logs; only Active Name Node is the writer

Journal Node

Journal Node

Read edit logs and applies to its own namespace

Data Blocks

Data Nodes are configured with the location of both Name Nodes, and send block location information and heartbeats to both.

http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithQJM.html

Hadoop 1.0 vs Hadoop 2.0

Documents

Transcript of Hadoop 1.0 vs Hadoop 2.0