REDUCTION OF INTERMEDIATE DATA THROUGH MAP REDUCE … · cluster on the cloud. Keywords ² cloud,...

REDUCTION OF INTERMEDIATE DATA THROUGH MAP REDUCE

FOR PRECISED BIG DATA PROCESSING IN CLOUD COMPUTING

ENVIRONMENT

Shivani Sant

Department of Computer Science

Sagar Institute Of Research & Technology-Excellence

Bhopal,India

[email protected]

Prof. Anjana Verma

Department of Computer Science

Sagar Institute Of Research & Technology-Excellence

Bhopal,India

[email protected]

Abstract— Cloud computing appeared for huge data

because of its ability to provide users with on-

demand, reliable, flexible, and low-cost services.

With the increasing use of cloud applications, data

generation from the application is also increases

which has become an important issue for the cloud

which is known as a big data problem. So there is a

need of technology that can manage the big data on

the cloud efficiently. Apache Hadoop is an efficient

solution for handling big data because it is a open

source technology which stores the big data in a

distributed manner over cluster of heterogeneous

systems, provides a reliable storage as well as

processing to the big data over the cloud. But when

the data is distributed over cluster of multiple

system, the data movement between the systems

creates a problem known as load balancing. In this

paper we proposed approach is to run the combine

function inside the map method to minimize the

volume of emitted intermediate results through

which the network congestion on the cloud is less

which will improve the overall performance of the

cluster on the cloud.

Keywords— cloud, big data, Hadoop, heterogeneous

cluster, performance, load balancing.

I.INTRODUCTION

There is huge amount of data which is generated from

the internet or various sources. These rapidly increasing

data is known as big data which creates a problem of

storage and processing. Traditional tools and techniques

are failed to manage such huge volume of big data The

growing technology is generated data from which only

20% of data is in structured form and the remaining 80%

of data is in unstructured form which is known as a

Bigdata problem [1]. But their large amount of

unstructured data and incomplete information creates a

problem for analyst to analyze these data.Traditional

systems were used to store and process the unstructured

data, but handling such data by tradition systems is too

time-consuming and expensive too. The Bigdata can be

identified by three main attributes:

1. Volume – Data is huge and massive.

2.Velocity –Data changes rapidly and arrives

quickly so processing data in less time is very difficult.

3. Variety – Data have different structure they

are semi-structured or unstructured data.

.

In present scenario it is instrumental to blend

both big data and analytics into a single entity termed as

big data Analytics. Analytics involves examination of

data to derive meaningful insights such as hidden

patterns and trends that can in turn benefit the

organizations in making important business decisions

and developing newer business models. The problem of

data deluge imposes potential challenges involved in

processing and extracting useful information from data.

It also requires skills for management and analysis of

huge data sets.

Cloud computing is one of the solution for managing

Bigdata and host the big data workload on a cluster of

JASC: Journal of Applied Science and Computations

Volume VI, Issue V, May/2019

ISSN NO: 1076-5131

Page No:394

multiple machines. Cloud computing is an on-demand

computer resources and systems that can provide a

number of integrated computer services without being

bound by local resources to facilitate user access. These

resources include data storage, backup and self-

synchronization, as well as software processing and

scheduling tasks. Cloud computing is also a kind of

shared resource system through which can offer a

variety of online services such as virtual server storage,

applications and licensing for desktop applications[10].

By leveraging common resources, cloud computing is

able to achieve expansion and provide volume. The

cloud resources can be used in private mode through

private cloud or can be shared publicly using a public

cloud such as Google cloud platform, Amazon EC2 and

Microsoft Azure.

Hadoop technology can be used for managing big data

on a cluster of commodity hardware, Hadoop comes

with HDFS and Mapreduce [8]. It stores the big data on

HDFS and analyze or process the big data by using

Mapreduce.

II. LITERATURE REVIEW

According to [1], with the event of construction on

sensible grid and therefore, the affordable utilization of

intermittent energy supply, processing on ancient

platform cannot already satisfy the intermittent energy

sources. It should be a good challenge for the entire

platform. This paper proposes a technique that the cloud

platform is integrated with the intermittent energy

sources knowledge and therefore the load equalisation of

multi-factor prognosticative cloud platform. Firstly,

deploying the general method of the intermittent energy

processing on a brand new processing platform.

In [2] these days handling massive knowledge transfer

among thousands of interconnected servers plays an

important role on cloud computing setting. Massive

knowledge is nothing however, assortment of relative

knowledge, unstructured knowledge, and semi-

structured knowledge, streaming knowledge like

machines, sensors, net applications, and social media. In

existing system, this idea enhanced by fixing some best

to beat the bottlenecks whereas, knowledge transfer on

scientific cloud application. The parameters are

pipelining, correspondence, and concurrency. The key

drawback is fixing of incorrect parameter combination

ends up in overloading and underneath utilization of

network which ends up congestion and packet loss on

knowledge transfer. In this, they projected a brand new

dynamic work load, queuing ways to improvise the

network rate whereas equalisation work load

dynamically. They additionally invoke numerous

programming algorithms to predict the unbalanced

resource utilization in knowledge centre at the initial

length of every interval that is employed to schedule the

unbalanced resource. This rescheduling method recovers

over utilization and underneath utilization on network.

According to [3], managing and processing big data in

geo-distributed data centre gain abundant attention in

recent years. Despite the increasing attention on this

subject, most efforts are centred on user-centric

solutions, and sadly minimal on the difficulties

encountered by Cloud suppliers to boost their profits.

extremely economical framework for geo-distributed big

data process in cloud federation setting could be a

crucial answer to maximise profit of the cloud suppliers.

They maximise the profit for cloud suppliers by

minimizing prices and penalty. This work proposes to

transfer computations to geo-distributed knowledge and

outsourcing solely the specified knowledge to idle

resources of united clouds so as to reduce job prices; and

proposes a jobs rearrangement dynamic approach to

reduce the penalties costs. The performance analysis

proves that our projected formula will maximize profit,

scale back the MapReduce jobs prices and improve

utilization of clusters resources.

This paper they propose a profit maximization

formula to with efficiency maximize the profit of cloud

suppliers running multiple MapReduce jobs on united

clouds underneath a point in time. Associate optimized

cloud supplier profit needs initial with efficiency

reducing MapReduce jobs value obtained by outsourcing



ISSN NO: 1076-5131

Page No:395

the remaining map tasks to idle VMs across clouds; and

second, overcoming the penalty of execution

MapReduce jobs when a given point in time by applying

jobs rearrangement dynamic approach[11]. We have a

tendency to proof that reduction of wasted resources

features a direct impact on value decrease and

rearrangement execution time of jobs features a fairness

impact on penalties prices and accepted jobs rate.

Results show that the projected formula improved

performance relating to the cloud suppliers profit, the

duty prices and therefore the VMs utilization.

According to [4], Recently Cloud based Hadoop has

gained plenty of interest that provide access to use

Hadoop cluster setting for processing of massive

knowledge, eliminating the operational challenges of on-

the-spot hardware investment, IT support, configuring of

Hadoop parts like HDFS and MapReduce. On demand

Hadoop as a service helps the industries to specialize in

business growth and supported use of model for giant

processing with auto-scaling of Hadoop cluster feature.

They implementation of assorted MapReduce jobs like

Pi, TeraSort, WordCount has been done on cloud based

Hadoop preparation by mistreatment Microsoft Azure

cloud services. Performance of MapReduce jobs has

been evaluated with computer hardware execution time

with variable size of Hadoop cluster [12]. From the

experimental result, it is found that computer hardware

execution time to end the roles decrease because the

range of information Nodes in HDInsight cluster will

increase and indicates the great latency with increase in

performance in addition as additional client satisfaction.

According to [5], Cloud Computing leverages Hadoop

framework for process BigData in parallel. Hadoop has

bound limitations that would be exploited to execute the

duty with efficiency. These limitations are largely

attribute to knowledge section within the cluster, jobs

and tasks programming, and resource allocations in

Hadoop. we have a tendency to propose H2Hadoop, that

is associate increased Hadoop design that reduces the

computation value related to bigdata analysis. The

projected design additionally addresses the difficulty of

resource allocation in native Hadoop. H2Hadoop

provides a far better answer for “text data”, like finding

polymer sequence. Also, H2Hadoop provides associate

economical data processing approach for Cloud

Computing environments. H2Hadoop design leverages

on NameNode’s ability to assign jobs to the TaskTrakers

inside the cluster[14]. By adding management options to

the NameNode, H2Hadoop will showing intelligence

direct and assign tasks to the DataNodes that contain the

specified knowledge.

According to [6] , Mapreduce could be a coinciding

operational model for Brobdingnagian data refinement

in teams and datacenters. The work of a Mapreduce

consists of a bunch of tasks that contains additional

range of matching jobs and reducing the roles. The final

mapping jobs are processed earlier for reducing jobs.

Numerous tasks process the requests and Mapreduce

configuration positions of a Mapreduce has numerous

accomplishment and sort of pc usage supported the

workload. Two kinds of precise rules that utilised in

decrease the build span and therefore the entire finishing

amount of a logged off Mapreduce workload. Initial

formula concentrates on the task organizing

improvement for a Mapreduce workload for the given

Mapreduce position [9]. The second formula expects the

procedure that seems for optimized Mapreduce position

configuration in an exceedingly Mapreduce workload.

III PROBLEM DEFINITION

In a heterogeneous cluster [7] of Hadoop over the

cloud where each cluster node has their own local

memory, Hadoop has a advantage of data locality

[13] which means to launch the task or processing

operation on that machine where data are located or

stored. If information doesn’t seem to be regionally

accessible during a process node, information have

to be compelled to be migrated via network

interconnects to the node that performs the info

process operations. Migrating immense quantity of

knowledge ends up in excessive network

congestion, that successively will deteriorate



ISSN NO: 1076-5131

Page No:396

system performance. So when the map task is

launched on multiple machine where the data is

located, these map task will perform on data blocks

and intermediate output of these blocks are transfer

via network to the reducer machine, the reducer

machine takes the intermediate output as a input or

produces the final output. The intermediate output

of map task is incredibly massive, So there is

overhead of moving large intermediate information

from mapper nodes to reducer nodes which

becomes a issue moving Hadoop’s performance.

IV PROPOSED WORK

Hadoop Mapreduce is now a popular choice for

performing large-scale data analytics over the cloud. We

created a heterogeneous cluster of multiple machines on

the cloud to store and process, huge amount of data. For

managing huge volume over the cluster we proposed

Hadoop on the cloud which is very popular for storing

huge volume of data into multiple machines and

processing is done by Mapreduce. The data is processed

by multiple machines parallel and then these outputs are

migrated via network to the reducer machine to collect

these intermediate data which generates the final output.

Migrating immense quantity of knowledge ends up in

excessive network congestion that successively will

deteriorate system performance. The key idea of

proposed approach is to run the combine function inside

the map method to minimize the volume of emitted

intermediate results through which the network

congestion on the cloud is less which will improve the

overall performance of the cluster on the cloud.

Figure 1. Proposed Flow Diagram

V EXPERIMENTAL & RESULT ANALYSIS

All the experiments were performed using an i5-2410M

CPU @ 2.30 GHz processor and 4 GB of RAM running

ubuntu 14. After than we can install java which is a

prerequest for hadoop, and then after we are configuring

hadoop on ubuntu . And also All the experiments are

perform on Google cloud platform (GCP) on which we

developed a heterogeneous clusters of five nodes.

Cluster is implemented on linux with hadoop is

configure on it, and the cluster summary are shown in

table 1.:

Table 1. Hardware property of experimental

environment

For loading dataset into HDFS, we first created an

heterogenous cluster of Hadoop, for which we can

use a Google cloud services to create a Hadoop

cluster. We can use Google Cloud Platform to

create a cluster for five nodes for which we first

created a GCP project and under compute engine

we can create a five virtual machine (one for

master and remaining for slaves) figure 2 shows

the cluster of six virtual machines.

Figure 2. Heterogeneous cluster of six nodes



ISSN NO: 1076-5131

Page No:397

After we can developed a mapreduce job for performing

word count application in a overall file , the file size is

around 560MB, In these we can run mapreduce job

which work on data locality configuration. After

developing we can launch the existing.jar file and after

execution of existing mapreduce job the final output is

shown in output directory and the other performance

fields such as shuffle bytes taken and time taken for

execution, the execution time taken are shown in figure

3.

Figure 3. time taken by existing mapreduce job

In an existing technique, cluster on the cloud where each

node has a local disk, it is efficient to move data

processing operations to nodes where application data

are located. If information does not seem to be

regionally accessible during a process node, information

have to be compelled to be migrated via network

interconnects to the node that performs the info process

operations. Migrating immense quantity of knowledge

ends up in excessive network congestion that

successively will reduce system performance.

So, to increase performance of the cluster we

efficiently balanced the load into the multiple nodes by

performing computation on the nodes where the data is

located. Due to less data is sent to the other nodes

through which the network congestion is reduced and we

can improve the performance by usingcombine function

inside the map method to minimize the volume of

emitted intermediate results through which the network

congestion on the cloud is less which will improve the

overall performance of the cluster on the cloud.. Figure

4 shows the proposed technique measures.

Figure 4. time taken by proposed mapreduce job

So we can compare the performance of cluster over

the cloud between existing techniques or proposed

technique and the time taken by these techniques

are shown in table 2.

Table 2. Execution time taken by existing and proposed

system

The tabular result which is shown in figure 5 are

represented on graph shown in figure 6, on which it is



ISSN NO: 1076-5131

Page No:398

clearly show that proposed mapreduce job are taking

less execution time as compared to existing mapreduce

job.

Figure 5. Graph representation of execution time taken

VI CONCLUSION

Huge volume of data is generated from various

applications which are deployed to cloud computing

environment and these huge data is need to be

processed. Cloud computing is one of the solution for

managing Bigdata and host the Bigdata workload on a

cluster of multiple machines. In this, we create a hadoop

cluster over cloud computing environment, which is

suitable to deal in parallel with these kinds of

applications. In this paper, we use Google cloud

platform (GCP) for creating Hadoop cluster and we can

fetch the data to corresponding compute nodes in

advance. It’s proved that the proposed mechanism

reduces data transmission overhead over network

effectively.

REFERENCES

[1] Tao Lin, Pengfei Zhao, Jing Zhao, Jing Zhao,

"Study on Load Balancing of Intermittent Energy Big

Data Cloud Platform" in 2018 International

Conference on Intelligent Transportation, Big Data &

Smart City (ICITBS), IEEE.

[2] C.Jayashri, P.Abitha, S.Subburaj, S.Yamuna Devi,

Suthir S , Janakiraman S, "Big Data Transfers through

Dynamic and Load Balanced Flow on Cloud Networks"

in 3rd International Conference on Advances in

Electrical, Electronics, Information, Communication and

Bio-Informatics (AEEICB17), IEEE 2017.

.

[3] Thouraya Gouasmi, Wajdi Louati, Ahmed Hadj

Kacem, "Geo-distributed BigData Processing for

Maximizing Profit in Federated clouds environment" in

26th Euromicro International Conference on Parallel,

Distributed, and Network-Based Processing , 2018

IEEE.

[4] Aditya Bhardwaj, Vineet Kumar Singh, Vanraj,

Yogendra Narayan, "Analyzing BigData with Hadoop

Cluster in HDInsight Azure Cloud" in 2015 IEEE.

[5] Hamoud Alshammari, Jeongkyu Lee and Hassan

Bajwa, "H2Hadoop: Improving Hadoop Performance

using the Metadata of Related Jobs" in IEEE

TRANSACTIONS ON Cloud Computing, manuscript

ID TCC-2015-11-0399, 2015 IEEE.

[6] Sadhana.R, Rabeena.S, "Improve Job Ordering And

Slot Configuration In Bigdata" in IEEE 2017.

[7] J. Xie, S. Yin, X. Ruan, Z. Ding, Y. Tian, J. Majors,

A. Manzanares and X. Qin, “Improving MapReduce

Performance through Data Placement in Heterogeneous

Hadoop Clusters”, IEEE International Symposium on

Parallel & Distributed Processing, Workshops and PhD

Forum (IPDPSW), (2010) April 19-23: Arlanta, USA.

[8] Improving MapReduce Performance in

Heterogeneous Network Environments and Resource

Utilization, Zhenhua Guo, Geoffrey Fox IEEE (2012)

[9] Improving MapReduce Performance Using Smart

Speculative Execution Strategy Qi Chen, Cheng Liu,

and Zhen Xiao, Senior Member, 2013 IEEE

[10] S. Khalil, S. A. Salem, S. Nassar and E. M. Saad,

“Mapreduce Performance in Heterogeneous

Environments: A Review”, International Journal of

Scientific & Engineering Research, vol. 4, no. 4, (2013).

[11] Z. Tang, J. Q. Zhou, K. L. Li and R. X. Li, “MTSD:

A task scheduling algorithm for MapReduce base on



ISSN NO: 1076-5131

Page No:399

https://ieeexplore.ieee.org/search/searchresult.jsp?searchWithin=%22First%20Name%22:%22Tao%22&searchWithin=%22Last%20Name%22:%22Lin%22&newsearch=true

https://ieeexplore.ieee.org/search/searchresult.jsp?searchWithin=%22First%20Name%22:%22Pengfei%22&searchWithin=%22Last%20Name%22:%22Zhao%22&newsearch=true

https://ieeexplore.ieee.org/search/searchresult.jsp?searchWithin=%22First%20Name%22:%22Jing%22&searchWithin=%22Last%20Name%22:%22Zhao%22&newsearch=true

https://ieeexplore.ieee.org/search/searchresult.jsp?searchWithin=%22First%20Name%22:%22Jing%22&searchWithin=%22Last%20Name%22:%22Zhao%22&newsearch=true

https://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=8332248



deadline constraints”, IEEE International Symposium on

Parallel and Distributed Processing Workshops and PhD

Forum (IPDPSW), (2012) May 21-25: Shanghai, China.

[12] M. Zaharia, D. Borthakur, J. Sen Sarma, K.

Elmeleegy, S. Shenker and I. Stoica, “Delay Scheduling:

A Simple Technique for Achieving Locality and

Fairness in Cluster Scheduling”, Proceedings of the 5th

European conference on Computer systems, (2010)

April 13-16: Paris, France.

[13] X. Zhang, Z. Zhong, S. Feng and B. Tu,

“Improving Data Locality of MapReduce by Scheduling

in Homogeneous Computing Environments”, IEEE 9th

International Symposium on Parallel and Distributed

Processing with Applications (ISPA), (2011) May 26-

28: Busan, Korea.

[14] C. Abad, Y. Lu and R. Campbell, “DARE:

Adaptive Data Replication for Efficient Cluster

Scheduling”, IEEE International Conference on Cluster

Computing (CLUSTER), (2011) September 26-30:

Austin, USA.



ISSN NO: 1076-5131

Page No:400

REDUCTION OF INTERMEDIATE DATA THROUGH MAP REDUCE … · cluster on the cloud. Keywords ² cloud,...

Documents

Transcript of REDUCTION OF INTERMEDIATE DATA THROUGH MAP REDUCE … · cluster on the cloud. Keywords ² cloud,...