Hadoop 1.0 vs Hadoop 2.0
-
Upload
deepesh-tripathi -
Category
Documents
-
view
105 -
download
11
description
Transcript of Hadoop 1.0 vs Hadoop 2.0
-
www.edureka.in/hadoopSlide 1
-
www.edureka.in/hadoopSlide 2
Topics for Today
Hadoop Architecture
Problems with Hadoop 1.0
Solution: Hadoop 2.0 and YARN
Hadoop 2.0 New Features
HDFS High Availability
HDFS Federation
YARN and MRv2
YARN and Hadoop ecosystem
YARN Components
JobTracker and Job Submission
MR Application Execution in YARN
-
www.edureka.in/hadoopSlide 3
Hadoop is a system for large scale data processing.
It has two main components:
HDFS Hadoop Distributed File System (Storage)
Distributed across nodes
Natively redundant
NameNode tracks locations.
MapReduce (Processing)
Splits a task across processors
near the data & assembles results
Self-Healing, High Bandwidth
Clustered storage
JobTracker manages the TaskTrackers
Hadoop Core Components
-
www.edureka.in/hadoopSlide 4
Data Node
TaskTracker
Data Node
TaskTracker
Data Node
TaskTracker
Data Node
TaskTracker
MapReduceEngine
HDFSCluster
Job Tracker
Admin Node
Name node
Hadoop Core Components (Contd.)
-
www.edureka.in/hadoopSlide 5
Client
HDFS Map Reduce
Secondary Name Node
Data BlocksData Node
Name Node Job Tracker
Task Tracker
Map Reduce
Data Node Task Tracker
Map Reduce.
Hadoop 1.0 In Summary
Data Node Data NodeTask Tracker
Map Reduce
Task Tracker
Map Reduce
-
www.edureka.in/hadoopSlide 6
Hadoop 1.0 - Challenges
Problem Description
NameNode No Horizontal Scalability
Single NameNode and Single Namespaces, limited by NameNode RAM
NameNode No High Availability (HA)
NameNode is Single Point of Failure, Need manual recovery using Secondary NameNode in case of failure
Job Tracker Overburdened Spends significant portion of time and effort managing the life cycle of applications
MRv1 Only Map and Reduce tasks Humongous Data stored in HDFS remains unutilized and cannot be used for other workloads such as Graph processing etc.
-
www.edureka.in/hadoopSlide 7
NameNode HA and Scale
NameNode - No Horizontal Scale
NameNode - No High Availability
DataNode
DataNode
DataNode
.
Clientget Block Locations
Read Data
NameNodeNS
Block Management
-
www.edureka.in/hadoopSlide 8
Secondary NameNode:
Not a hot standby for the NameNode
Connects to NameNode every hour*
Housekeeping, backup of NemeNode metadata
Saved metadata can build a failed NameNode
You give me metadata every hour, I will make
it secure
Single Point of Failure
Secondary NameNode
NameNode
metadata
metadata
Name Node Single Point of Failure
-
www.edureka.in/hadoopSlide 9
Job Tracker Overburdened
CPU Spends a very significant portion of time and
effort managing the life cycle of applications
Network Single Listener Thread to communicate with
thousands of Map and Reduce Jobs
Task Tracker Task Tracker Task Tracker.
Job Tracker
-
www.edureka.in/hadoopSlide 10
MRv1 Unpredictability in Large Clusters
Cascading Failures The DataNode failures results in a serious
deterioration of the overall cluster performance because of attempts to replicate data and overload live nodes, through network flooding.
As the cluster size grow and reaches to 4000 Nodes
Multi-tenancy As clusters increase in size, you may want
to employ these clusters for a variety of models. MRv1 dedicates its nodes to Hadoop and cannot be re-purposed for other applications and workloads in an Organization. With the growing popularity and adoption of cloud computing among enterprises, this becomes more important.
-
www.edureka.in/hadoop
Terabytes and Petabytes of data in HDFS can be used only for MapReduce processing
Unutilized Data in HDFS
-
www.edureka.in/hadoopSlide 12
Hadoop 2.0 New Features
Property Hadoop 1.0 Hadoop 2.0
Federation One Namenode and Namespaces
Multiple Namenode and Namespaces
High Availability Not present Highly Available
YARN - Processing Control and Multi-tenancy
JobTracker, Task Tracker Resource Manager, Node Manager, App Master, Capacity Scheduler
Other important Hadoop 2.0 features HDFS Snapshots NFSv3 access to data in HDFS Support for running Hadoop on MS Windows Binary Compatibility for MapReduce applications built on Hadoop 1.0 Substantial amount of Integration testing with rest of the projects (such as PIG, HIVE) in
Hadoop ecosystem
-
www.edureka.in/hadoopSlide 13
YARN and Hadoop Ecosystem
Apache Oozie (Workflow)
HDFS (Hadoop Distributed File System)
Pig LatinData Analysis
HiveDW System
MapReduce Framework
HBase
Apache Oozie (Workflow)
HDFS (Hadoop Distributed File System)
Pig LatinData Analysis
HiveDW System
MapReduce Framework HBase
Other YARN
Frameworks(MPI, GIRAPH)
YARNCluster Resource Management
YARN adds a more general interface to run non-MapReduce jobs (such as Graph Processing) within the Hadoop framework
-
www.edureka.in/hadoopSlide 14
Hadoop 2.0 Cluster Architecture - Federation
Namenode
Block Management
NS
Storage
Datanode Datanode
Nam
esp
ace
Blo
ck S
tora
ge
Nam
esp
ace
NS1 NSk NSn
NN-1 NN-k NN-n
Common Storage
Datanode 1
Datanode 2
Datanode m
Blo
ck S
tora
ge
Pool 1 Pool k Pool n
Block Pools
Hadoop 1.0 Hadoop 2.0
http://hadoop.apache.org/docs/stable2/hadoop-project-dist/hadoop-hdfs/Federation.html
-
www.edureka.in/hadoopSlide 15
Hello There!!My name is Annie. I love quizzes and
puzzles and I am here to make you guys think and
answer my questions.
How does HDFS Federation help HDFS Scale horizontally?- reduces the load on any single NameNode by using the multiple,
independent NameNode to manage individual parts of the filesystem namespace
- provides cross-data center (non-local) support for HDFS, allowing a cluster administrator to split the Block Storage outside the local cluster.
Annies Question
-
www.edureka.in/hadoopSlide 16
A. In order to scale the name service horizontally, HDFS federation uses multiple independent Namenodes. The Namenodes are federated, that is, the Namenodes are independent and dont require coordination with each other.
Annies Answer
-
www.edureka.in/hadoopSlide 17
Hello There!!My name is Annie. I love quizzes and
puzzles and I am here to make you guys think and
answer my questions.
You have configured two name nodes to manage /marketing and /finance respectively. What will happen if you try to put a file in /accounting directory?
Annies Question
-
www.edureka.in/hadoopSlide 18
The put will fail. None of the namespace will manage the file and you will get an IOException with a No such file or directory error.
Annies Answer
-
www.edureka.in/hadoopSlide 19
Node Manager
HDFS
YARN
Resource Manager
Shared edit logs
All name space edits logged to shared NFS storage; single writer
(fencing)
Read edit logs and applies to its own namespace
Secondary Name Node
Data Node
Standby NameNode
Active NameNode
ContainerApp
Master
Node Manager
Data Node
ContainerApp
Master
Data Node
Client
Data Node
ContainerApp
Master
Node Manager
Data Node
ContainerApp
Master
Node Manager
Hadoop 2.0 Cluster Architecture - HA
NameNode High Availability
Next Generation MapReduce
HDFS HIGH AVAILABILITY
http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithNFS.html
-
www.edureka.in/hadoop
NameNode Recovery Vs. Failover
High Availability in Hadoop 2.0
NameNode recovery in Hadoop 1.0
Secondary Name Node
Standby NameNode
Active NameNode
Secondary Name Node
NameNode
Edit logs
Meta-Data
Automatic failover to Standby NameNode
Manually Recover using Secondary
NameNodeFSImage
-
www.edureka.in/hadoopSlide 21
Hello There!!My name is Annie. I love quizzes and
puzzles and I am here to make you guys think and
answer my questions.
HDFS HA was developed to overcome the following disadvantage in Hadoop 1.0?- Single Point Of Failure Of NameNode- Only one version can be run in classic MapReduce- Too much burden on Job Tracker
Annies Question
-
www.edureka.in/hadoopSlide 22
Single Point of Failure of NameNode.
Annies Answer
-
www.edureka.in/hadoopSlide 23
BATCH(MapReduce)
INTERACTIVE(Text)
ONLINE(HBase)
STREAMING(Storm, S4, )
GRAPH(Giraph)
IN-MEMORY(Spark)
HPC MPI(OpenMPI)
OTHER(Search)
(Weave..)
YARN Moving beyond MapReduce
http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/YARN.html
-
www.edureka.in/hadoopSlide 24
Multi-tenancy - Capacity Scheduler
Organizes jobs into queues
Queue shares as %s of cluster
FIFO scheduling within each queue
Data locality-aware Scheduling
Hierarchical QueuesTo manage the resource within an organization.
Capacity GuaranteesA fraction to the total available capacity allocated to each Queue.
SecurityTo safeguard applications from other users.
ElasticityResources are available in a predictable and elastic manner to queues.
Multi-tenancySet of limit to prevent over-utilization of resources by a single application.
OperabilityRuntime configuration of Queues.
Resource-based schedulingIf needed, Applications can request more resources than the default.
-
www.edureka.in/hadoop
Executing MapReduce Application on YARN
MapReduce Application Execution
-
www.edureka.in/hadoopSlide 26
DFS
Job Tracker
1. Copy Input Files
User
2. Submit Job
3. Get Input Files Info
6. Submit Job
4. Create Splits
5. Upload Job Information
Input Files
Client
Job.xml.Job.jar.
JobTracker
Job Execution
-
www.edureka.in/hadoopSlide 27
DFS
Client Job.xml.Job.jar.
6. Submit Job
8. Read Job Files
7. Initialize JobJob Queue
As many maps as splitsInput Spilts
Maps Reduces
9. Create maps and reduces
Job Tracker
JobTracker (Contd.)
-
www.edureka.in/hadoopSlide 28
Job Tracker
Job QueueH3
H4
H5
H1
Task Tracker -H2
Task Tracker -H4
10. Heartbeat
12. Assign Tasks
10. Heartbeat
10. Heartbeat
10. Heartbeat
11. Picks Tasks(Data Local if possible)
Task Tracker -H3
Task Tracker -H1
JobTracker (Contd.)
-
www.edureka.in/hadoopSlide 29
YARN = Yet Another Resource Negotiator
Node Manager
Container Container
Node Manager
AppMaster
Container
Node Manager
ContainerApp
Master
ResourceManager
Client
Client
MapReduce StatusJob SubmissionNode StatusResource Request
Resource Manager Cluster Level resource
manager Long Life, High Quality
Hardware
Node Manager One per Data Node Monitors resources on Data
Node
Application Master One per application Short life Manages task /scheduling
YARN Beyond MapReduce
JobHistoryServer
-
www.edureka.in/hadoopSlide 30
YARN MR Application Execution Flow
Client
ResourceManager
HDFS
Application
Node Manager
MR AppMaster
YARNChild
Map or Reduce
Task
Job Object
1. Run App 2. Get New Application
Client JVM
3. Copy Job Resources
4. Submit Application
DataNode
Management Node
Task JVM
-
www.edureka.in/hadoopSlide 31
YARN MR Application Execution Flow
Client
ResourceManager
HDFS
Application
Node Manager
MR AppMaster
YarnChild
Map or Reduce
Task
Job Object
1. Run App 2. Get New Application
Client JVM
3. Copy Job Resources
4. Submit Application
5. Start MR AppMaster container
7. Get Input Splits
8. Request Resources
9. Start container
6. Create container
DataNode
Management Node
Task JVM
-
www.edureka.in/hadoopSlide 32
YARN MR Application Execution Flow
Client
ResourceManager
HDFS
Application
Node Manager
MR AppMaster
YarnChild
Map or Reduce
Task
Job Object
1. Run App 2. Get New Application
Client JVM
3. Copy Job Resources
4. Submit Application
5. Start MR AppMaster container
7. Get Input Splits
8. Request Resources
9. Start container
6. Create container
DataNode
12. Execute
10. Create Container
11. Acquire Job Resources
Management Node
Task JVM
-
www.edureka.in/hadoopSlide 33
YARN MR Application Execution Flow
Client
ResourceManager
HDFS
Application
Node Manager
MR AppMaster
YarnChild
Map or Reduce
Task
Job Object
Client JVM
DataNode
Management Node
Task JVMPoll for Status
Update Status
-
www.edureka.in/hadoop
Hadoop 2.0 : YARN Workflow
ResourceManager
Node Manager
Node Manager
MR AppMaster 2
Node Manager
Node Manager
Node Manager
Container 2.2
Node Manager
Container 2.3
Node Manager
Node Manager
Container 2.1
Container 1.1
Node Manager
Container 1.2
Scheduler
Applications Manager (AsM)
Node Manager
Node Manager
Node Manager
MR AppMaster 1
-
www.edureka.in/hadoopSlide 35
YARN MR Application Execution Flow
MapReduce Application Execution Submission Initialization Tasks Assignment Tasks Memory Status Updates Failure Recovery
Client Submit a MapReduce Application
Resource Manager Manage the resource utilization across
Hadoop Cluster Node Manager
Runs on each Data Node Creates execution container Monitors Containers usage
MapReduce Application Master Coordinates and Manages MapReduce
tasks Negotiates with Resource Manager to
schedule asks The tasks are started by NodeManager(s)
HDFS shares resources and tasks artefacts
among YARN components
Application Vs. Job Execution
-
www.edureka.in/hadoop
Hadoop 2.0 In Summary
Client
HDFS YARN
Resource ManagerSecondary Name Node
Standby NameNode
Active NameNode
Distributed Data Storage Distributed Data Processing
Data Node
Node Manager
ContainerApp
Master .
Mast
ers
Sla
ves
Node Manager
Data Node
ContainerApp
Master
Data Node
Node Manager
ContainerApp
Master
Shared edit logs
Scheduler
Applications Manager
(AsM)
NameNode High Availability
Next Generation MapReduce
Manages the App
App Master can request the
container and run the code in them.
-
www.edureka.in/hadoopSlide 37
Hello There!!My name is Annie. I love quizzes and
puzzles and I am here to make you guys think and
answer my questions.
YARN/MRv2 was developed to overcome the following disadvantage in MRv1?- Single Point Of Failure Of NameNode- Only one version can be run in classic MapReduce- Too much burden on Job Tracker
Annies Question
-
www.edureka.in/hadoopSlide 38
Too much burden on Job Tracker.
Annies Answer
-
www.edureka.in/hadoopSlide 39
Hello There!!My name is Annie. I love quizzes and
puzzles and I am here to make you guys think and
answer my questions.
In YARN (MRv2), the functionality of JobTracker has been replaced by which of the following YARN features:- Job Scheduling - Task Monitoring - Resource Management - Node management
Annies Question
-
www.edureka.in/hadoopSlide 40
Task Monitoring and Resource Management. The fundamental idea of MRv2 is to split up the two major functionalities of the JobTracker, i.e. resource management and job scheduling/monitoring, into separate daemons. A global ResourceManager (RM) for resources and per-application ApplicationMaster (AM) for task monitoring.
Annies Answer
-
www.edureka.in/hadoopSlide 41
Hello There!!My name is Annie. I love quizzes and
puzzles and I am here to make you guys think and
answer my questions.
In YARN, which of the following daemons takes care of the container and the resource utilization by the applications?:- Node Manager- Job Tracker- Task tracker- Application Master- Resource manager
Annies Question
-
www.edureka.in/hadoopSlide 42
ApplicationMaster.
Annies Answer
-
www.edureka.in/hadoopSlide 43
Hello There!!My name is Annie. I love quizzes and
puzzles and I am here to make you guys think and
answer my questions.
Can we run MRv1 Jobs in a YARN/MRv2 enabled Hadoop Cluster?- Yes- No
Annies Question
-
www.edureka.in/hadoopSlide 44
Yes. You need recompile the Jobs in MRv2 after enabling YARN/Mrv2 to run the Job successfully on Mrv2 Hadoop Cluster.
Annies Answer
-
www.edureka.in/hadoopSlide 45
Hello There!!My name is Annie. I love quizzes and
puzzles and I am here to make you guys think and
answer my questions.
Which of the following YARN/MRv2 daemon is responsible for launching the tasks?- Task Tracker - Resource Manager- ApplicationMaster- ApplicationMasterServer
Annies Question
-
www.edureka.in/hadoopSlide 46
ApplicationMaster.
Annies Answer
-
Thank You See You in Class Next Week
-
www.edureka.in/hadoop
NN High Availability Quorum Journal Manager
Standby Name Node
Active Name Node
Data Node Data Node Data Node Data Node
.
Journal Node
Write namespace modification to edit logs; only Active Name Node is the writer
Journal Node
Journal Node
Read edit logs and applies to its own namespace
Data Blocks
Data Nodes are configured with the location of both Name Nodes, and send block location information and heartbeats to both.
http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithQJM.html