Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
-
Upload
edureka -
Category
Technology
-
view
141 -
download
3
description
Transcript of Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
![Page 2: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |](https://reader036.fdocuments.us/reader036/viewer/2022081718/54c67c754a79598d528b45b0/html5/thumbnails/2.jpg)
Slide 2
Hello There!!My name is Annie.
Let me test your Hadoop 1.x knowledge?
Annie’s Introduction
![Page 3: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |](https://reader036.fdocuments.us/reader036/viewer/2022081718/54c67c754a79598d528b45b0/html5/thumbnails/3.jpg)
Slide 3
Hello There!!My name is Annie. I love quizzes and
puzzles and I am here to make you guys think and
answer my questions.
Can you store 1 billion files in a Hadoop 1.x cluster?- Yes- No
Annie’s Question
![Page 4: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |](https://reader036.fdocuments.us/reader036/viewer/2022081718/54c67c754a79598d528b45b0/html5/thumbnails/4.jpg)
Slide 4
No. Even though you have hundreds of DataNodes in the cluster, the NameNode keeps all its metadata in memory, so you are limited to a maximum of only 50-100M files in the entire cluster because of a Single NameNode in Hadoop 1.x.
Annie’s Answer
![Page 5: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |](https://reader036.fdocuments.us/reader036/viewer/2022081718/54c67c754a79598d528b45b0/html5/thumbnails/5.jpg)
Slide 5
Hello There!!My name is Annie. I love quizzes and
puzzles and I am here to make you guys think and
answer my questions.
A Hadoop 1.x cluster can have multiple HDFS Namespaces.- True- False
Annie’s Question
![Page 6: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |](https://reader036.fdocuments.us/reader036/viewer/2022081718/54c67c754a79598d528b45b0/html5/thumbnails/6.jpg)
Slide 6
False. Not possible with Hadoop 1.x.
Annie’s Answer
![Page 7: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |](https://reader036.fdocuments.us/reader036/viewer/2022081718/54c67c754a79598d528b45b0/html5/thumbnails/7.jpg)
Slide 7
Hello There!!My name is Annie. I love quizzes and
puzzles and I am here to make you guys think and
answer my questions.
Which of the following is (are) a significant disadvantage in Hadoop 1.0?- ‘Single Point Of Failure’ of NameNode- Too much burden on Job Tracker
Annie’s Question
![Page 8: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |](https://reader036.fdocuments.us/reader036/viewer/2022081718/54c67c754a79598d528b45b0/html5/thumbnails/8.jpg)
Slide 8
Single Point of Failure of NameNode and too much burden on Job Tracker.
Annie’s Answer
![Page 9: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |](https://reader036.fdocuments.us/reader036/viewer/2022081718/54c67c754a79598d528b45b0/html5/thumbnails/9.jpg)
Slide 9
Hello There!!My name is Annie. I love quizzes and
puzzles and I am here to make you guys think and
answer my questions.
Can you use hundreds of Hadoop DataNode for any other processing than MapReduce in Hadoop 1.x?- Yes- No
Annie’s Question
![Page 10: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |](https://reader036.fdocuments.us/reader036/viewer/2022081718/54c67c754a79598d528b45b0/html5/thumbnails/10.jpg)
Slide 10
No. Hadoop 1.x dedicates all the DataNode resources to Map and Reduce slots with no or little room for processing any other workload.
Annie’s Answer
![Page 11: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |](https://reader036.fdocuments.us/reader036/viewer/2022081718/54c67c754a79598d528b45b0/html5/thumbnails/11.jpg)
Slide 11
Hello There!!My name is Annie. I love quizzes and
puzzles and I am here to make you guys think and
answer my questions.
Can you use Hadoop for Real-time processing?- Yes- No
Annie’s Question
![Page 12: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |](https://reader036.fdocuments.us/reader036/viewer/2022081718/54c67c754a79598d528b45b0/html5/thumbnails/12.jpg)
Slide 12
No. Hadoop is designed and developer for massively parallel batch processing.
Annie’s Answer
![Page 13: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |](https://reader036.fdocuments.us/reader036/viewer/2022081718/54c67c754a79598d528b45b0/html5/thumbnails/13.jpg)
Limitations of Hadoop 1.x
No horizontal scalability of NameNode
Does not support NameNode High Availability
Overburdened JobTracker
Not possible to run Non-MapReduce Big Data Applications on HDFS
Does not support Multi-tenancy
![Page 14: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |](https://reader036.fdocuments.us/reader036/viewer/2022081718/54c67c754a79598d528b45b0/html5/thumbnails/14.jpg)
Slide 14 www.edureka.in/hadoop
Hadoop 1.x – In Summary
Client
HDFS Map Reduce
Secondary NameNode
Data BlocksDataNode
NameNode Job Tracker
Task Tracker
Map Reduce
DataNode Task Tracker
Map Reduce….
DataNode DataNodeTask Tracker
Map Reduce
Task Tracker
Map Reduce
![Page 15: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |](https://reader036.fdocuments.us/reader036/viewer/2022081718/54c67c754a79598d528b45b0/html5/thumbnails/15.jpg)
Slide 15 www.edureka.in/hadoop
Problem Description
NameNode – No Horizontal Scalability
Single NameNode and Single Namespace, limited by NameNode RAM
NameNode – No High Availability (HA) NameNode is Single Point of Failure, Need manual recovery usingSecondary NameNode in case of failure
Job Tracker – Overburdened Spends significant portion of time and effort managing the life cycle ofApplications
MRv1 – Only Map and Reduce tasks Humongous Data stored in HDFS remains unutilized and cannot be usedfor other workloads such as Graph processing etc.
Hadoop 1.x - Challenges
![Page 16: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |](https://reader036.fdocuments.us/reader036/viewer/2022081718/54c67c754a79598d528b45b0/html5/thumbnails/16.jpg)
NameNode - No High Availability
NameNode - No Horizontal Scale
DataNode
DataNode
DataNode
….
Client Get Block Locations
Block Management
Read Data
NameNodeNS
Slide 16 www.edureka.in/hadoop
NameNode – Scale and HA
![Page 17: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |](https://reader036.fdocuments.us/reader036/viewer/2022081718/54c67c754a79598d528b45b0/html5/thumbnails/17.jpg)
Slide 17 www.edureka.in/hadoop
Name Node –Single Point of Failure
Secondary NameNode:
“Not a hot standby” for the NameNode Connects to NameNode every hour* Housekeeping, backup of NameNode metadata Saved metadata can build a failed NameNode
You give me metadata every hour, I will make
it secure
Single Point Failure
Secondary NameNode
NameNode
metadata
metadata
![Page 18: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |](https://reader036.fdocuments.us/reader036/viewer/2022081718/54c67c754a79598d528b45b0/html5/thumbnails/18.jpg)
Slide 18 www.edureka.in/hadoop
Job Tracker – Overburdened
CPU
Spends a very significant portion of time and effort managing the life cycle of applications
Network
Single Listener Thread to communicate with thousands of
Map and Reduce Jobs
Task Tracker Task Tracker Task Tracker….
Job Tracker
![Page 19: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |](https://reader036.fdocuments.us/reader036/viewer/2022081718/54c67c754a79598d528b45b0/html5/thumbnails/19.jpg)
Slide 19 www.edureka.in/hadoop
MRv1 – Unpredictability in Large Clusters
As the cluster size grow and reaches to 4000 Nodes
Cascading Failures
The DataNode failures results in a seriousdeterioration of the overall clusterperformance because of attempts to replicatedata and overload live nodes, through networkflooding.
Multi-tenancy
As clusters increase in size, you may want toemploy these clusters for a variety of models.MRv1 dedicates its nodes to Hadoop andcannot be re-purposed for other applicationsand workloads in an Organization. With thegrowing popularity and adoption of cloudcomputing among enterprises, this becomesmore important.
![Page 20: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |](https://reader036.fdocuments.us/reader036/viewer/2022081718/54c67c754a79598d528b45b0/html5/thumbnails/20.jpg)
Unutilized Data in HDFS
Terabytes and Petabytes of data in HDFS can only be used for MapReduce processing
Slide 11 www.edureka.in/hadoop
![Page 21: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |](https://reader036.fdocuments.us/reader036/viewer/2022081718/54c67c754a79598d528b45b0/html5/thumbnails/21.jpg)
Introducing Hadoop 2.0
Features Hadoop 1.x Hadoop 2.0
HDFS Federation One NameNode and a Namespace Multiple NameNode and Namespaces
NameNode High Availability Not present Highly Available
YARN - Processing Control and Multi-tenancy
JobTracker, TaskTracker Resource Manager, Node Manager, App Master, Capacity Scheduler
Other important Hadoop 2.0 features HDFS Snapshots NFSv3 access to data in HDFS Support for running Hadoop on MS Windows Binary Compatibility for MapReduce applications built on Hadoop 1.0 Substantial amount of Integration testing with rest of the projects (such as PIG, HIVE) in Hadoop ecosystem
Slide 12 www.edureka.in/hadoop
![Page 22: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |](https://reader036.fdocuments.us/reader036/viewer/2022081718/54c67c754a79598d528b45b0/html5/thumbnails/22.jpg)
Namenode
Block Management
NS
Storage
Datanode Datanode…
Nam
esp
ace
Blo
ckSto
rage
Nam
esp
ace
NS1 NSk NSn
NN-1 NN-k NN-n
Common Storage
Datanode 1
…Datanode 2
…Datanode m
…Blo
ckSto
rage
Pool 1 Pool k Pool n
Block Pools
… …
Hadoop 1.0 Hadoop 2.0
Slide 22 www.edureka.in/hadoop
http://hadoop.apache.org/docs/stable2/hadoop-project-dist/hadoop-hdfs/Federation.html
Hadoop 2.0 Cluster Architecture - Federation
![Page 23: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |](https://reader036.fdocuments.us/reader036/viewer/2022081718/54c67c754a79598d528b45b0/html5/thumbnails/23.jpg)
Slide 23 www.edureka.in/hadoop
cluster.
Annie’s Question
How does HDFS Federation help HDFS Scale horizontally?A) Reduces the load on any single NameNode by using the multiple, independent NameNodes to manage individual parts of the file system namespace.B) Provides cross-data centre (non-local) support for HDFS, allowing a cluster administrator to split the Block Storage outside the local cluster.
![Page 24: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |](https://reader036.fdocuments.us/reader036/viewer/2022081718/54c67c754a79598d528b45b0/html5/thumbnails/24.jpg)
Slide 24 www.edureka.in/hadoop
Annie’s Answer
(A). In order to scale the name service horizontally, HDFS federation uses multiple independent NameNodes. The NameNodes are federated, that is, the NameNodes are independent and do not require coordination with each other.
![Page 25: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |](https://reader036.fdocuments.us/reader036/viewer/2022081718/54c67c754a79598d528b45b0/html5/thumbnails/25.jpg)
Slide 25
Annie’s Question
You have configured two NameNodes to manage /marketing and /finance namespaces respectively. What will happen if you try to ‘put’ a file in /accounting directory?
www.edureka.in/hadoop
![Page 26: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |](https://reader036.fdocuments.us/reader036/viewer/2022081718/54c67c754a79598d528b45b0/html5/thumbnails/26.jpg)
Slide 26
Annie’s Answer
The ‘put’ will fail. None of the namespaces will manage the file and you will get an IOException with a “No such file or directory error”.
www.edureka.in/hadoop
![Page 27: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |](https://reader036.fdocuments.us/reader036/viewer/2022081718/54c67c754a79598d528b45b0/html5/thumbnails/27.jpg)
Slide 27
Node Manager
HDFS
YARN
Resource Manager
Shared edit logs
All name space edits logged to shared NFS storage; single writer
(fencing)
Read edit logs and applies to its own namespace
Data Node
Standby NameNode
Active NameNode
ContainerApp
Master
Node Manager
Data Node
ContainerApp
Master
Data Node
Client
Data Node
ContainerApp
Master
Node Manager
Data Node
ContainerApp
Master
Node Manager
Hadoop 2.0 Cluster Architecture - HA
NameNode High Availability
Next Generation MapReduce
HDFS HIGH AVAILABILITY
http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithNFS.html
![Page 28: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |](https://reader036.fdocuments.us/reader036/viewer/2022081718/54c67c754a79598d528b45b0/html5/thumbnails/28.jpg)
Slide 28
Hadoop 2.0 Cluster Architecture - HA
www.edureka.in/hadoop
High Availability in Hadoop 2.0
NameNode recovery in Hadoop 1.0
Secondary NameNode
Standby NameNode
Active NameNode
Secondary NameNode
NameNode
Edit logs
Meta-Data
Automatic failover to Standby NameNode
Manually Recover using Secondary
NameNodeFSImage
![Page 29: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |](https://reader036.fdocuments.us/reader036/viewer/2022081718/54c67c754a79598d528b45b0/html5/thumbnails/29.jpg)
Slide 29
Annie’s Question
NameNode HA was developed to overcome the following disadvantage in Hadoop 1.0?a) Single Point Of Failure of NameNodeb) Too much burden on Job Tracker
www.edureka.in/hadoop
![Page 30: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |](https://reader036.fdocuments.us/reader036/viewer/2022081718/54c67c754a79598d528b45b0/html5/thumbnails/30.jpg)
Slide 30
Annie’s Answer
Single Point of Failure of NameNode.
www.edureka.in/hadoop
![Page 31: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |](https://reader036.fdocuments.us/reader036/viewer/2022081718/54c67c754a79598d528b45b0/html5/thumbnails/31.jpg)
Apache Oozie (Workflow)
HDFS(Hadoop Distributed File System)
Pig LatinData Analysis
HiveDW System
MapReduce Framework
HBase
Apache Oozie (Workflow)
HDFS(Hadoop Distributed File System)
Pig LatinData Analysis
HiveDW System
MapReduce Framework HBase
OtherYARN
Frameworks(MPI, GIRAPH)
Slide 23 www.edureka.in/hadoop
YARNCluster Resource Management
YARN adds a more general interface to run non-MapReduce jobs (such as Graph Processing) within the Hadoop framework
YARN and Hadoop Ecosystem
![Page 32: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |](https://reader036.fdocuments.us/reader036/viewer/2022081718/54c67c754a79598d528b45b0/html5/thumbnails/32.jpg)
BATCH(MapReduce)
INTERACTIVE(Text)
ONLINE(HBase)
STREAMING(Storm, S4, …)
GRAPH(Giraph)
IN-MEMORY(Spark)
HPC MPI(OpenMPI)
OTHER(Search)
(Weave..)
Slide 32 www.edureka.in/hadoop
http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/YARN.html
YARN – Moving beyond MapReduce
![Page 33: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |](https://reader036.fdocuments.us/reader036/viewer/2022081718/54c67c754a79598d528b45b0/html5/thumbnails/33.jpg)
Slide 33 www.edureka.in/hadoop
Organizes jobs into queues
Queue shares as %’s of cluster
FIFO scheduling within eachqueue
Data locality-aware Scheduling
Hierarchical QueuesTo manage the resource within an organization.
Capacity GuaranteesA fraction to the total available capacity allocated to each Queue.
SecurityTo safeguard applications from other users.
ElasticityResources are available in a predictable and elastic manner to queues.
Multi-tenancySet of limit to prevent over-utilization of resources by a singleapplication.
OperabilityRuntime configuration of Queues.
Resource-based schedulingIf needed, Applications can request more resources than the default.
Multi-tenancy - Capacity Scheduler
![Page 34: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |](https://reader036.fdocuments.us/reader036/viewer/2022081718/54c67c754a79598d528b45b0/html5/thumbnails/34.jpg)
Slide 34
Annie’s Question
YARN was developed to overcome the following disadvantage in Hadoop 1.0 MapReduce framework?a) Single Point Of Failure Of NameNodeb) Too much burden on Job Tracker
www.edureka.in/hadoop
![Page 35: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |](https://reader036.fdocuments.us/reader036/viewer/2022081718/54c67c754a79598d528b45b0/html5/thumbnails/35.jpg)
Slide 35
Annie’s Answer
Too much burden on Job Tracker.
www.edureka.in/hadoop
![Page 36: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |](https://reader036.fdocuments.us/reader036/viewer/2022081718/54c67c754a79598d528b45b0/html5/thumbnails/36.jpg)
Slide 36
NameNode HighAvailability
Next Generation MapReduce
Hadoop 2.0 – In Summary
Client
HDFS YARN
Resource ManagerStandby NameNode
Active NameNode
Distributed Data Storage Distributed Data Processing
DataNode
Node Manager
ContainerApp
Master …….
Mast
ers
Sla
ves
Node Manager
DataNode
ContainerApp
Master
DataNode
Node Manager
ContainerApp
Master
Shared edit logs
ORJournal Node
Scheduler
Applications Manager
(AsM)
www.edureka.in/hadoop
![Page 37: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |](https://reader036.fdocuments.us/reader036/viewer/2022081718/54c67c754a79598d528b45b0/html5/thumbnails/37.jpg)
Slide 37
Hello There!!My name is Annie. I love quizzes and
puzzles and I am here to make you guys think and
answer my questions.
Can you use Hadoop 2.0 for Real-time processing?- Yes- No
Annie’s Question
![Page 38: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |](https://reader036.fdocuments.us/reader036/viewer/2022081718/54c67c754a79598d528b45b0/html5/thumbnails/38.jpg)
Slide 38
No. Even though YARN in Hadoop 2.0 supports multiple frameworks for different workloads other than batch, you need Storm or S4 for real-time processing.
Annie’s Answer
![Page 39: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |](https://reader036.fdocuments.us/reader036/viewer/2022081718/54c67c754a79598d528b45b0/html5/thumbnails/39.jpg)
Slide 39 www.edureka.in/hadoop
What about Real-time Processing?
Hadoop is good for Batch but
How do I process Big Data in Real-time?
![Page 40: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |](https://reader036.fdocuments.us/reader036/viewer/2022081718/54c67c754a79598d528b45b0/html5/thumbnails/40.jpg)
Slide 40 www.edureka.in/hadoop
Storm is coming….
APACHE STORM
The Real-time Hadoop
• Continuous commutation system
Distributed, Reliable, Fault-tolerant,
Scalable and Robust
• Suitable for Big Data processing• Guarantees no data loss
Programming Language agnostic
• JSON-based for Ruby, Python etc.
Use case
• Stream processing• Distributed RPC• Continuous Computation
![Page 41: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |](https://reader036.fdocuments.us/reader036/viewer/2022081718/54c67c754a79598d528b45b0/html5/thumbnails/41.jpg)
Hadoop Vs. Storm
Hadoop Storm
Differences
Fundamentally as Batch processing system
Real-time processing, process unterminated streams (e.g. twitter feeds) of data, process data as it arrives
MapReduce Jobs run to completion
Topologies (Computation Graph) run forever
Stateful NodesStateless Nodes
Hadoop Storm
Similarities
Scalable Scalable
Guarantees no data loss Guarantees no data loss
Open Source Open Source
![Page 42: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |](https://reader036.fdocuments.us/reader036/viewer/2022081718/54c67c754a79598d528b45b0/html5/thumbnails/42.jpg)
Storm Use Cases
Data Normalization• Groupon uses Storm to build real-time data integration
systems.
Analytics• Storm powers Twitter’s publisher analytics product,
processing every tweet and click that happens on Twitter toprovide analytics for Twitter's publisher partners.
• Flipboard use Storm across a wide range of services rangingfrom Content Search to real-time analytics, to generatingcustom magazine fields.
Log processing• Alibaba uses Storm to process the application log and data
change in databases to supply real-time data stats for dataapps.
• NaviSite uses Storm in its server log monitoring and auditingsystem.
![Page 43: Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |](https://reader036.fdocuments.us/reader036/viewer/2022081718/54c67c754a79598d528b45b0/html5/thumbnails/43.jpg)
Thank YouSee You in Next Class