Differences between Hadoop 1.x and Hadoop 2.x, …...How Hadoop 2.x solves Hadoop 1.x Limitations...

Differences between Hadoop 1.x and Hadoop

2.x, Hadoop 1.x Limitations and Hadoop 2.x

YARN Benefits

Hadoop V.1.x Components

Apache Hadoop V.1.x has the following two major Components

1. HDFS (HDFS V1)

2. MapReduce (MR V1)

In Hadoop V.1.x, these two are also know as Two Pillars of Hadoop.

Hadoop V.2.x Components

Apache Hadoop V.2.x has the following three major Components

1. HDFS V.2

2. YARN (MR V2) 3. MapReduce (MR V1)

In Hadoop V.2.x, these two are also know as Three Pillars of Hadoop.

https://cdn.journaldev.com/wp-content/uploads/2015/08/hadoop1.x-components.png

Hadoop 1.x Limitations

Hadoop 1.x has many limitations or drawbacks. Main drawback of Hadoop 1.x is that MapReduce Component in it’s Architecture. That means it supports only MapReduce-based

Batch/Data Processing Applications.

Hadoop 1.x has the following Limitations/Drawbacks:

It is only suitable for Batch Processing of Huge amount of Data, which is already in Hadoop System.

It is not suitable for Real-time Data Processing. It is not suitable for Data Streaming.

It supports upto 4000 Nodes per Cluster. It has a single component : JobTracker to perform many activities like Resource

Management, Job Scheduling, Job Monitoring, Re-scheduling Jobs etc.

JobTracker is the single point of failure. It does not support Multi-tenancy Support.

It supports only one Name Node and One Namespace per Cluster. It does not support Horizontal Scalability. It runs only Map/Reduce jobs.

https://cdn.journaldev.com/wp-content/uploads/2015/09/hadoop2.x-components.png

It follows Slots concept in HDFS to allocate Resources (Memory, RAM, CPU). It has static Map and Reduce Slots. That means once it assigns resources to Map/Reduce jobs, it

cannot re-use them even though some slots are idle.

For Example:- Suppose, 10 Map and 10 Reduce Jobs are running with 10 + 10 Slots to perform a computation. All Map Jobs are doing their tasks but all Reduce jobs are idle.

We cannot use these Idle jobs for other purpose.

NOTE:- In Summary, Hadoop 1.x System is a Single Purpose System. We can use it only

for MapReduce Based Applications.

Differences between Hadoop 1.x and Hadoop 2.x

If we observe the components of Hadoop 1.x and 2.x, Hadoop 2.x Architecture has one extra and new component that is : YARN (Yet Another Resource Negotiator).

It is the game changing component for BigData Hadoop System.

New Components and API

As shown in the below diagram, Hadoop 1.x is re-architected and introduced new

component to solve Hadoop 1.x Limitations.

https://cdn.journaldev.com/wp-content/uploads/2015/11/hadoop1_vs_hadoop2.png

Hadoop 1.x Job Tracker

As shown in the below diagram, Hadoop 1.x Job Tracker component is divided into two components:

1. Resource Manager:-

To manage resources in cluster

2. Application Master:-

To manage applications like MapReduce, Spark etc.

Hadoop 1.x supports only one namespace for managing HDFS filesystem whereas

Hadoop 2.x supports multiple namespaces. Hadoop 1.x supports one and only one programming model: MapReduce. Hadoop 2.x

supports multiple programming models with YARN Component like MapReduce, Interative, Streaming, Graph, Spark, Storm etc.

Hadoop 1.x has lot of limitations in Scalability. Hadoop 2.x has overcome that limitation

with new architecture. Hadoop 2.x has Multi-tenancy Support, but Hadoop 1.x doesn’t.

Hadoop 1.x HDFS uses fixed-size Slots mechanism for storage purpose whereas Hadoop 2.x uses variable-sized Containers.

Hadoop 1.x supports maximum 4,000 nodes per cluster where Hadoop 2.x supports more

than 10,000 nodes per cluster.

https://cdn.journaldev.com/wp-content/uploads/2015/11/hadoop1_jobtracker_hadoop2.png

How Hadoop 2.x solves Hadoop 1.x Limitations

Hadoop 2.x has resolved most of the Hadoop 1.x limitations by using new architecture.

By decoupling MapReduce component responsibilities into different components.

By Introducing new YARN component for Resource management. By decoupling component’s responsibilities, it supports multiple namespace, Multi-

tenancy, Higher Availability and Higher Scalability.

Hadoop 2.x YARN Benefits

Hadoop 2.x YARN has the following benefits.

Highly Scalability

Highly Availability Supports Multiple Programming Models Supports Multi-Tenancy

Supports Multiple Namespaces Improved Cluster Utilization

Supports Horizontal Scalability

https://acadgild.com/blog/10-big-differences-between-hadoop1-and-hadoop2

https://acadgild.com/blog/10-big-differences-between-hadoop1-and-hadoop2

https://acadgild.com/blog/top-10-differences-between-hadoop-2-x-and-3-x

https://acadgild.com/blog/top-10-differences-between-hadoop-2-x-and-3-x

Introduction to HDFS and Map Reduce

As the name suggests, HDFS is a storage system for very large amounts of files. It has some distinct advantages like its scalability and distributed nature that make so good to work with Big

Data. HDFS stores data on commodity hardware and can run on huge clusters with the opportunity to stream data for instant processing.

Though HDFS is similar to various other distributed file systems, it has some very distinctive features and advantages that make it so universally deployable for Big Data Hadoop projects.

Here are some of the most important:

Working on very large sets of data

HDFS is built for scale. This is one of its greatest strengths. Where other distributed files systems fail miserably, HDFS triumphs. It can store petabytes of data and retrieve it on demand making it

a best fit for Big Data applications. The way it is able to store huge amounts of data is by spreading it on hundreds or even thousands of commodity hardware that are cheap and readily

available. The data bandwidth of HDFS is unlike any other competing file system.

Storing on cheap commodity hardware

HDFS is built from the ground up to work on Big Data applications. One of the biggest constraints when working with Big Data is the cost overruns. The hardware and infrastructure if not properly managed can run into the millions. This is where HDFS comes as a blessing since it

can successfully run on cheap commodity hardware. Hadoop can be easily installed even on normal personal computers and HDFS works just fine in such an ecosystem. All this leads to

reducing the costs drastically and power to scale at will.

Ability to write once and read many times

The files in HDFS can be written once and can be read as many times as needed. The basic premise that is followed is that once a file is written it will not be overwritten and hence it can be

accessed multiple times with a hitch. This directly contributes to HDFS having such high throughout and also the issue of data coherency is also resolved.

Providing access to streaming data

When working on Big Data applications it becomes extremely important to fetch the data in a streaming manner. This is what HDFS does so effortlessly due to its ability to provide streaming

access to data. The emphasis is on providing high throughput to large amounts of data rather than providing low latency in accessing a single file. A great amount of importance is given to

achieving streaming data or fetching it at lightning speeds and not so much significance is given to how this data is stored.

Extreme throughput

One of the hallmarks of Big Data applications is that the throughput is very high. HDFS achieves this with some distinct features and capabilities. The task is divided into multiple smaller tasks

and it is shared by various systems. Due to this the various components work independently and in parallel in order to achieve the given task. Since data is read in parallel it drastically reduces

the time and hence high throughput is achieved regardless of the size of the data files.

Moving computation rather than data

This is a distinct feature of the Hadoop Distributed File System which lets you move the processing of data to the source of data rather than moving the data around the network. When

you are dealing with huge amounts of data it becomes particularly cumbersome to move it leading to overwhelmed networks and slower processing of data. HDFS overcomes this problem by letting you have interfaces for applications near the place of data storage for faster

computation.

Fault tolerance and data replication

Since the data is stored on cheap commodity hardware there has to be a trade-off somewhere and that occurs in the frequent failure of the nodes or commodity hardware. But HDFS gets around

this problem by storing the data in at least three nodes. Two nodes are on the same rack while the third node is on a different rack so as to be resilient in the event of fault on nodes. Due to this

unique nature of HDFS file storing it provides the whole system with enough fault recovery mechanism, easy data replication, enhanced scalability and data accessibility.

File System Namespace:

A traditional hierarchical file organization is followed by HDFS, where any user or an application can

create directories and store files inside these directories. Thus, HDFS’s file system namespace hierarchy

is similar to most of the other existing file systems, where one can create and delete files or relocate a

file from one directory to another, or even rename a file. In general, HDFS does not support hard links or

soft links, though these can be implemented if need arise.

HDFS

HDFS stands for Hadoop Distributed File System, which is the storage system used by Hadoop.

The following is a high-level architecture that explains how HDFS works.

The following are some of the key points to remember about the HDFS:

In the above diagram, there is one NameNode, and multiple DataNodes (servers). b1, b2, indicates data blocks.

When you dump a file (or data) into the HDFS, it stores them in blocks on the various nodes in the hadoop cluster. HDFS creates several replication of the data blocks and distributes them accordingly in the cluster in way that will be reliable and can be retrieved faster. A typical HDFS block size is 128MB. Each and every data block is replicated to multiple nodes across the cluster.

Hadoop will internally make sure that any node failure will never results in a data loss. There will be one NameNode that manages the file system metadata There will be multiple DataNodes (These are the real cheap commodity servers) that will store

the data blocks When you execute a query from a client, it will reach out to the NameNode to get the file

metadata information, and then it will reach out to the DataNodes to get the real data blocks Hadoop provides a command line interface for administrators to work on HDFS The NameNode comes with an in-built web server from where you can browse the HDFS

filesystem and view some basic cluster statistics

MapReduce

The following are some of the key points to remember about the HDFS:

MapReduce is a parallel programming model that is used to retrieve the data from the Hadoop cluster

In this model, the library handles lot of messy details that programmers doesn’t need to worry about. For example, the library takes care of parallelization, fault tolerance, data distribution, load balancing, etc.

This splits the tasks and executes on the various nodes parallely, thus speeding up the computation and retriving required data from a huge dataset in a fast manner.

This provides a clear abstraction for programmers. They have to just implement (or use) two functions: map and reduce

The data are fed into the map function as key value pairs to produce intermediate key/value pairs

Once the mapping is done, all the intermediate results from various nodes are reduced to create the final output

JobTracker keeps track of all the MapReduces jobs that are running on various nodes. This schedules the jobs, keeps track of all the map and reduce jobs running across the nodes. If any one of those jobs fails, it reallocates the job to another node, etc. In simple terms, JobTracker is responsible for making sure that the query on a huge dataset runs successfully and the data is returned to the client in a reliable manner.

TaskTracker performs the map and reduce tasks that are assigned by the JobTracker. TaskTracker also constantly sends a hearbeat message to JobTracker, which helps JobTracker to decide whether to delegate a new task to this particular node or not.

MapReduce

In the above map reduce flow:

1. The input data can be divided into n number of chunks depending upon the amount of

data and processing capacity of individual unit. 2. Next, it is passed to the mapper functions. Please note that all the chunks are processed

simultaneously at the same time, which embraces the parallel processing of data.

3. After that, shuffling happens which leads to aggregation of similar patterns. 4. Finally, reducers combine them all to get a consolidated output as per the logic.

5. This algorithm embraces scalability as depending on the size of the input data, we can keep increasing the number of the parallel processing units.

WordCount

Word Count Mapreduce Program

package PackageDemo;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.util.GenericOptionsParser;

public class WordCount {

public static void main(String [] args) throws Exception

{

Configuration c=new Configuration();

String[] files=new GenericOptionsParser(c,args).getRemainingArgs();

Path input=new Path(files[0]);

Path output=new Path(files[1]);

Job j=new Job(c,"wordcount");

j.setJarByClass(WordCount.class);

j.setMapperClass(MapForWordCount.class);

j.setReducerClass(ReduceForWordCount.class);

j.setOutputKeyClass(Text.class);

j.setOutputValueClass(IntWritable.class);

FileInputFormat.addInputPath(j, input);

FileOutputFormat.setOutputPath(j, output);

System.exit(j.waitForCompletion(true)?0:1);

}

public static class MapForWordCount extends Mapper<LongWritable, Text, Text,

IntWritable>{

public void map(LongWritable key, Text value, Context con) throws

IOException, InterruptedException

{

String line = value.toString();

String[] words=line.split(",");

for(String word: words )

{

Text outputKey = new Text(word.toUpperCase().trim());

IntWritable outputValue = new IntWritable(1);

con.write(outputKey, outputValue);

}

}

}

public static class ReduceForWordCount extends Reducer<Text, IntWritable,

Text, IntWritable>

{

public void reduce(Text word, Iterable<IntWritable> values, Context con)

throws IOException, InterruptedException

{

int sum = 0;

for(IntWritable value : values)

{

sum += value.get();

}

con.write(word, new IntWritable(sum));

}

}

}

Differences between Hadoop 1.x and Hadoop 2.x, …...How Hadoop 2.x solves Hadoop 1.x Limitations...

Documents

Transcript of Differences between Hadoop 1.x and Hadoop 2.x, …...How Hadoop 2.x solves Hadoop 1.x Limitations...