Application of MapReduce in Cloud Computing

Introduction Motivation Description of First Paper Description of Second Paper Comparison Conclusion References End

MapReduce in Cloud Computing

Mohammad MustaqeemM.Tech 2nd Year

Reg No: 2011CS17

Computer Science and Engineering DepartmentMotilal Nehru National Institute of Technology Allahabad

November 8, 2012


Outline

1 Introduction2 Motivation3 Description of First Paper

IssuesApproach Used

HDFSMapReduce Progamming Model

Example: Word Count4 Description of Second Paper

IssuesApproach Used

ArchitectureSystem Mechanism

Example5 Comparison6 Conclusion


Introduction

MapReduce is a general-purpose programming model fordata-intensive computing.It was introduced by Google in 2004 to construct its webindex.It is also used at Yahoo, Facebook etc.It uses a parallel computing model that distributescomputational tasks to large number of nodes(approx1000-10000 nodes.)It is fault-tolerable. It can work even when 1600 nodesamong 1800 nodes fails.

Return


Introduction

In MapReduce model, user has to write only two functions-map and reduce.Few examples that can be easily expressed asMapReduce computations:

Distributed GrepCount of URL Access FrequencyInverted IndexMining

Return


Motivation

Cloud Computing refers to services that are offered bycluster having 1000 to 10000 machines[6].e.g. services offered by Yahoo, Google etc.

Cloud computing deliveres computing resources as aservice. It may be -

Infrastructure as a Service (IaaS).Platform as a Service (PaaS).Software as a Service (SaaS).Storage as a Service (STaaS). etc.

Return


Motivation cont..

Cloud Service is different from traditional hosting service infollowing ways[6] -

It is sold on demand, typically by the minute or the hour.It is elastic - a user can have as much or as little of aservice as they want at any given time.It is fully managed by provider (the consumer needsnothing but a personal computer and Internet access)

Amazon Web Services is the largest public cloud provider.

Return


Motivation cont..

MapReduce is a programming model for large-scalecomputing[3].It uses distributed environment of the cloud to processlarge amount of data in reasonable amount of time.It was inspired by map and reduce function of FunctionalProgramming Language(like LISP, scheme, racket)[3].Map and Reduce in Racket (Functional ProgrammingLanguage)[4]:

Map:(map f list1)→ list2e.g. (map square ’(1 2 3 4 5))→ ’(1 4 9 16 25)Reduce:(foldl f init list1)→ anye.g. (foldl + 0 ’(1 2 3 4 5))→ 15

Return


Motivation cont..

Although, the map and reduce functions in MapReducemodel is not exactly same as in functional programming.Map and Reduce functions in MapReduce model:

Map: It process a (key, value) pair and returns a list of(intermediate key, value) pairs-

map(k1, v1)→ list(k2, v2)

Reduce: It merges all intermediate values having the sameintermediate key-

reduce(k2, list(v2))→ list(v3)

Return


Issues

Issues

Gaizhen Yang, "The Application of MapReduce in theCloud Computing"

It analyzes Hadoop.Hadoop is the implementation of MapReduce Model.It process data parallely in distributed manner.It divides the data into different logical blocks and processthese logical blocks in parallel on different machines and atlast combines all the results to produce the final result[1].It is fault-tolerable.One attractive feature of Hadoop is that user can write themap and reduce functions in any programming langauge.

Return


Approach Used

Approach Used

Hadoop is an open source Java framework for processinglarge amount of data on the clusters of machines[1].Hadoop is the implementation of Google’s MapReduceprogramming model.Yahoo is the biggest contributor of Hadoop[5].Hadoop has mainly two components:

Hadoop Distributed File System (HDFS)MapReduce

Return


Approach Used

HDFS

HDFS provides support for distributed storage[1].Like traditional File System, the files can be deleted,renamed etc.HDFS has two types of nodes:

Name NodeData Node

Figure: HDFS Architecture

Return


Approach Used

HDFS cont..

Name Node:Name Node provides the main data services.It is a process that runs on a separate machine.It stores only the meta-data of the files and directories.Programmer access files through it.For reliablity of the file system, it keeps multiple copies ofthe same file blocks.

Data Node:Data Node is a process that runs on individual machine ofthe cluster.The file blocks are stored in the local file system of thesenodes.It periodically send the meta-data of the stored blocks to theName Node.

Return


Approach Used

MapReduce Progamming Model

MapReduce is the key concept behind the Hadoop.It is a technique for dividing the work across a distributedsystem.The user has to define only two functions:

Map: It process a (key, value) pair and returns a list of(intermediate key, value) pairs-

map(k1, v1)→ list(k2, v2)

Reduce: It merges all intermediate values having the sameintermediate key-

reduce(k2, list(v2))→ list(v3)

Return


Approach Used

MapReduce Progamming Model cont..

Execution phase of a MapReduce Application1 MapReduce library splits files into M pieces and copies

these pieces into multiple machines.2 Master picks the idle workers and assigns a map task.3 The map workers process key-value pairs of the input data

and passes each pair to the user-defined map function andproduces the intermediate key-value pairs.

4 The map worker buffers the output key-value pairs in thelocal memory. It passes these memory locations to themaster and then master forwards it to the reducer.

5 After reading the intermediate key-value pairs, the reducersorts these pairs by the intermediate key.

6 For each intermediate key, the user defined reducefunction is applied to the corresponding intermediatevalues.

Return


Approach Used

MapReduce Progamming Model cont..

7 When all map tasks and reduce tasks have beencompleted. Master gives the final output to the user.

Figure: Execution phase of a generic MapReduce Application

Return


Example: Word Count

Example: Word Count

The pseudo code of map and reduce function for word countproblem is -

Algorithm 3.1: MAPPER(filename, file − contents)

for each word ∈ file − contentsdo EMIT(word ,1)

Algorithm 3.2: REDUCER(word , values)

sum← 0for each value ∈ values

do sum← sum + valueEMIT(word , sum)

Return


Example: Word Count

Example: Word Count cont..

Figure: Word Count Execution

Return


Issues

Issues

Fabrizio Marozzo, Domenico Talia, Paolo Trunfioa,"P2P-MapReduce: Parallel data processing in dynamicCloud environments"

The discussed MapReduce is centralized.It can’t deal with master failure.Since the nodes joins and leaves the cloud dynamically, weneed a P2P-MapReduce model.This paper descibes an adaptive P2P-MapReduce systemthat can handle the master failure.

Return


Approach Used

Approach Used

P2P-MapReduce is a programming model in which nodesmay join and leave the cluster dynamically.The nodes act as either master or slave at a time.The master and slave interchange to each otherdynamically such that the master/slave ratio remainsconstant.To prevent the loss of computation in case of masterfailure, there are some backup masters for each masters.

The master responsible for a job J is referred as theprimary master for J.The primary master dynamically updates the job state on itsbackup nodes, which are referred as backup masters for J.When a primary master fails, its place is taken by one of itsbackup masters.

Return


Approach Used

Architecture

There are three type of nodes in P2P-MapReducearchitecture:

UserMasterSlave

The masters and slaves nodes form two logicalpeer-to-peer network M-net and S-net respectively.The composition of M-net and S-net changes dynamically.User node submits the MapReduce job to one of theavailable master nodes. The selection of master node isdone by current workload of the available master nodes.

Return


Approach Used

Architecture cont..

Master nodes perform three type of operations[2]:Management: A master node that is acting as primarymaster for one or more jobs, executes managementoperation.Recovery: A master node that is acting as backup masterfor one or more jobs, executes recovery operation.Coordination: The coordinator operation changes slavesinto masters and vice-versa, so as to keep the desiredmaster/slave ratio.

The slave executes tasks that are assigned to it by one ormore primary masters.

Return


Approach Used

Architecture cont..

For each managed jobs, primary master runs one JobManager.Backup masters runs Backup Job Manager.For each assigned tasks, slave runs one Task Manager.The task manager keeps informing to its job manager. Theinformation includes the status of the slave(ACTIVE orDEAD) and howmuch computation has been done.If a master doesn’t get the signal from a task manager,then it reschedules that assigned task on another idleslave.In addition to this condition, if a slave works slowly, thenalso the master node reschedules that assigned task onanother idle slave and consider that output which comesfirst and discards other.

Return


Approach Used

System Mechanism

The mechanism of a generic node can be understood by UMLstate diagram[2].

Figure: Behaviour of a generic node described by an UML StateDiagram

Return


Example

Example

Figure: P2P-MapReduce example

Return


Example

Example cont..

The following recovery procedure takes place when aprimary master Node1 fails[2]:

Backup masters Node2 and Node3 detect the failure ofNode1 and starts a distributed procedure to elect the newprimary master among them.Assuming that Node3 is elected as the new primary master,Node2 continues to play the backup function and, to keepthe desired number of backup masters active, anotherbackup node is chosen by Node3.Node3 uses its local replica of the job to proceed fromwhere the Node1 fails.

Return


Comparison between two Papers

First Paper Second PaperIssues To perform data-intensive

computation in Cloud en-vironment in reasonableamount of time.

To design a P2P MapReducesystem that can handle all thenode’s failure including Mas-ter node’s failure.

Approaches Used Simple MapReduce (pre-sented by Google) imple-mentation is used. Theimplemented version isknown as Hadoop, which isbased on the Master-SlaveModel.

Peer-to-peer architecture isused to handle all the dy-namic churns in a cluster.

Advantages Hadoop is scalable, reliableand distributed able to handleenormous amount of data. Itcan process big data in realtime.

P2P-MapReduce can man-age node churn, master fail-ures and job recovery in an ef-fective way.

Table: Comparison between two Papers.

Return


Conclusion

MapReduce is scalable, reliable computing model toexploids the distributed environment of the cloud.MapReduce optimizes the system performance byrescheduling the slow task on multiple slaves.P2P-MapReduce has all the property of simpleMapReduce.Since P2P-MapReduce provides fault-tolerance againstmaster failures, so it is more reliable.P2P-MapReduce prevents computation loss by keepingjob state at backup masters.

Return


References

Gaizhen Yang, "The Application of MapReduce in the Cloud Computing",International Symposium on Intelligence Information Processing and TrustedComputing (IPTC), October 2011, pp. 154-156, http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6103560.

Fabrizio Marozzo, Domenico Talia, Paolo Trunfioa, "P2P-MapReduce: Paralleldata processing in dynamic Cloud environments", Journal of Computer andSystem Sciences, vol. 78, Issue 5 September 212, pp.1382-1402,http://dl.acm.org/citation.cfm?id=2240494.

Jeffrey Dean and Sanjay Ghemawat, "MapReduce: simplified data processing onlarge clusters", OSDI’04 Proceedings of the 6th conference on Symposium onOpearting Systems Design & Implementation, vol. 6, 2004, pp.10-10,www.usenix.org/event/osdi04/tech/full_papers/dean/dean.pdfandhttp://dl.acm.org/citation.cfm?id=1251254.1251264..

Return

http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6103560

http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6103560

http://dl.acm.org/citation.cfm?id=2240494

www.usenix.org/event/osdi04/tech/full_papers/dean/dean.pdf and http://dl.acm.org/citation.cfm?id=1251254.1251264.

www.usenix.org/event/osdi04/tech/full_papers/dean/dean.pdf and http://dl.acm.org/citation.cfm?id=1251254.1251264.


References

The Racket Guide, http://docs.racket-lang.org/guide/.

Hadoop Tutorial - YDN,http://developer.yahoo.com/hadoop/tutorial/module4.html.

http://readwrite.com/2012/10/15/why-the-future-of-software-and-apps-is-serverless.

F. Marozzo, D. Talia, P. Trunfio, "A Peer-to-Peer Framework for SupportingMapReduce Applications in Dynamic Cloud Environments", In: N. Antonopoulos,L. Gillam (eds.), Cloud Computing: Principles, Systems and Applications,Springer, Chapter 7, 113-125, 2010,

IBM developer work, Using MapReduce and load balancing on the cloud, http://www.ibm.com/developerworks/cloud/library/cl-mapreduce/.

Return

http://docs.racket-lang.org/guide/

http://developer.yahoo.com/hadoop/tutorial/module4.html

http://readwrite.com/2012/10/15/why-the-future-of-software-and-apps-is-serverless

http://readwrite.com/2012/10/15/why-the-future-of-software-and-apps-is-serverless

http://www.ibm.com/developerworks/cloud/library/cl-mapreduce/.

http://www.ibm.com/developerworks/cloud/library/cl-mapreduce/.


THANK YOU

Return

Application of MapReduce in Cloud Computing

Documents

Transcript of Application of MapReduce in Cloud Computing