Hadoop/MapReduce Computing Paradigm Spring 2015 Taken from: WPI, Mohamed Eltabakh 1.
MapReduce Paradigm
-
Upload
dilip-reddy -
Category
Technology
-
view
859 -
download
6
description
Transcript of MapReduce Paradigm
![Page 1: MapReduce Paradigm](https://reader033.fdocuments.us/reader033/viewer/2022061221/54be2e504a79598c1e8b459f/html5/thumbnails/1.jpg)
MapReduce Paradigm
Dilip Reddy KancharlaSpring 2012
![Page 2: MapReduce Paradigm](https://reader033.fdocuments.us/reader033/viewer/2022061221/54be2e504a79598c1e8b459f/html5/thumbnails/2.jpg)
Outline
• Introduction• Motivating example• Hadoop– Hadoop MapReduce– HDFS
• Pros & Cons of MapReduce• Hadoop Applicability to different workflows• Conclusions and Future work
![Page 3: MapReduce Paradigm](https://reader033.fdocuments.us/reader033/viewer/2022061221/54be2e504a79598c1e8b459f/html5/thumbnails/3.jpg)
. . .
User Program
Master
Split 1
Split 2
Split 3
Split 4
Split 5 . . . Worker
Worker
Worker
Input Files Map Phase
Key/Value Pairs
Worker
Worker
Intermediate Operations
Output file 1
Output file 2
Reduce Phase
Remote read
Output Files
ForkFork Fork
WriteLocal Write
Assign Map
Assign Reduce
Critical MapReduce Execution Overview [DG08]
![Page 4: MapReduce Paradigm](https://reader033.fdocuments.us/reader033/viewer/2022061221/54be2e504a79598c1e8b459f/html5/thumbnails/4.jpg)
MapReduce Paradigm
• Splits input files into blocks (typically of 64MB each)
• Operates on key/value pairs• Mappers filter & transform input data• Reducers aggregate mappers output• Efficient way to process the cluster:– Move code to data– Run code on all machines
![Page 5: MapReduce Paradigm](https://reader033.fdocuments.us/reader033/viewer/2022061221/54be2e504a79598c1e8b459f/html5/thumbnails/5.jpg)
• Map
• Reduce
(K1,v1) List(k2,v2)
(k2,list(v2)) List(k3,v3)
Hash Function
Shufflers &
Combiners
Aggregate Function
![Page 6: MapReduce Paradigm](https://reader033.fdocuments.us/reader033/viewer/2022061221/54be2e504a79598c1e8b459f/html5/thumbnails/6.jpg)
Advanced MapReduce
• Hadoop Streaming– Lets you stream Mapper and reducer written in
other languages such as python, ruby, etc.,• Chaining MapReduce jobs• Joining data• Bloom filters
![Page 7: MapReduce Paradigm](https://reader033.fdocuments.us/reader033/viewer/2022061221/54be2e504a79598c1e8b459f/html5/thumbnails/7.jpg)
Hadoop
• Open Source Implementation of MapReduce by Apache Software Foundation.
• Created by Doug Cutting.• Derived from Google's MapReduce and Google File System
(GFS) papers.• Apache Hadoop is a software framework that supports
data-intensive distributed applications under a free license• It enables applications to work with thousands of
computational independent computers and petabytes of data.
![Page 8: MapReduce Paradigm](https://reader033.fdocuments.us/reader033/viewer/2022061221/54be2e504a79598c1e8b459f/html5/thumbnails/8.jpg)
Hadoop Architecture
• Hadoop MapReduce– Single master node, many worker nodes– Client submits a job to master node– Master splits each job into tasks (MapReduce),
and assigns tasks to worker nodes• Hadoop Distributed File System (HDFS)– Single name node, many data nodes– Files stored as large, fixed-size (e.g. 64MB) blocks– HDFS typically holds map input and reduce output
![Page 9: MapReduce Paradigm](https://reader033.fdocuments.us/reader033/viewer/2022061221/54be2e504a79598c1e8b459f/html5/thumbnails/9.jpg)
Hadoop ArchitectureSecondary Namenode
Namenode JobTracker
Data node
TaskTracker
MapMap
Map
MapMapReduce
Data node
TaskTracker
MapMap
Map
MapMapReduce
Data node
TaskTracker
MapMap
Map
MapMapReduce
![Page 10: MapReduce Paradigm](https://reader033.fdocuments.us/reader033/viewer/2022061221/54be2e504a79598c1e8b459f/html5/thumbnails/10.jpg)
Job Scheduling in Hadoop
• One map task for each block of the input file– Applies user-defined map function to each record in the
block– Record = <key, value>
• User-defined number of reduce tasks– Each reduce task is assigned a set of record groups– For each group, apply user-defined reduce function to
the record values in that group• Reduce tasks read from every map task– Each read returns the record groups for that reduce task
![Page 11: MapReduce Paradigm](https://reader033.fdocuments.us/reader033/viewer/2022061221/54be2e504a79598c1e8b459f/html5/thumbnails/11.jpg)
Dataflow in Hadoop
• Map tasks write their output to local disk– Output available after map task has completed
• Reduce tasks write their output to HDFS– Once job is finished, next job’s map tasks can be
scheduled, and will read input from HDFS• Therefore, fault tolerance is simple: simply re-
run tasks on failure– No consumers see partial operator output
![Page 12: MapReduce Paradigm](https://reader033.fdocuments.us/reader033/viewer/2022061221/54be2e504a79598c1e8b459f/html5/thumbnails/12.jpg)
Dataflow in Hadoop[CAHER10]
Submit job
schedulemapmap
mapmap
reducereduce
reducereduce
![Page 13: MapReduce Paradigm](https://reader033.fdocuments.us/reader033/viewer/2022061221/54be2e504a79598c1e8b459f/html5/thumbnails/13.jpg)
Dataflow in Hadoop[CAHER10]
HDFSHDFS
Block 1
Block 2
mapmap
mapmap
reducereduce
reducereduce
Read Input File
![Page 14: MapReduce Paradigm](https://reader033.fdocuments.us/reader033/viewer/2022061221/54be2e504a79598c1e8b459f/html5/thumbnails/14.jpg)
Dataflow in Hadoop[CAHER10]
mapmap
mapmap
reducereduce
reducereduce
Local FS
Local FS
Local FS
Local FS
HTTP GET
![Page 15: MapReduce Paradigm](https://reader033.fdocuments.us/reader033/viewer/2022061221/54be2e504a79598c1e8b459f/html5/thumbnails/15.jpg)
Dataflow in Hadoop[CAHER10]
reducereduce
reducereduce
HDFSHDFS
Write Final Answer
![Page 16: MapReduce Paradigm](https://reader033.fdocuments.us/reader033/viewer/2022061221/54be2e504a79598c1e8b459f/html5/thumbnails/16.jpg)
HDFS
• Data is distributed and replicated over multiple machines.
• Files are not stored in contiguously on servers broken up into blocks.
• Designed for large files (large means GB or TB)• Block Oriented• Linux Style commands (eg. ls, cp, mkdir, mv)
![Page 17: MapReduce Paradigm](https://reader033.fdocuments.us/reader033/viewer/2022061221/54be2e504a79598c1e8b459f/html5/thumbnails/17.jpg)
Different Workflows[MTAGS11]
![Page 18: MapReduce Paradigm](https://reader033.fdocuments.us/reader033/viewer/2022061221/54be2e504a79598c1e8b459f/html5/thumbnails/18.jpg)
Hadoop Applicability by Workflow[MTAGS11]
Score Meaning:• Score Zero implies Easily adaptable to the workflow• Score 0.5 implies Moderately adaptable to the
workflow• Score 1 indicates one of the potential workflow areas
where Hadoop needs improvement
![Page 19: MapReduce Paradigm](https://reader033.fdocuments.us/reader033/viewer/2022061221/54be2e504a79598c1e8b459f/html5/thumbnails/19.jpg)
Relative Merits and Demerits of Hadoop Over DBMS
Pros• Fault tolerance• Self Healing rebalances files
across cluster• Highly Scalable• Highly Flexible as it does not
have any dependency on data model and schema
Cons• No high level language like
SQL in DBMS• No schema and no index• Low efficiency• Very young (since 2004)
compared to over 40years of DBMS
Hadoop RelationalScale out (add more
machines)Scaling is difficult
Key/Value pairs TablesSay how to process the data Say what you want (SQL)
Offline/ batch Online/ realtime
![Page 20: MapReduce Paradigm](https://reader033.fdocuments.us/reader033/viewer/2022061221/54be2e504a79598c1e8b459f/html5/thumbnails/20.jpg)
Conclusions and Future Work• MapReduce is easy to program• Hadoop=HDFS+MapReduce• Distributed, Parallel processing• Designed for fault tolerance and high scalability• MapReduce is unlikely to substitute DBMS in
data warehousing instead we expect them to complement each other and help in data analysis of scientific data patterns
• Finally, Efficiency and especially I/O costs needs to be addressed for successful implications
![Page 21: MapReduce Paradigm](https://reader033.fdocuments.us/reader033/viewer/2022061221/54be2e504a79598c1e8b459f/html5/thumbnails/21.jpg)
References[LLCCM12] Kyong-Ha Lee, Yoon-Joon Lee, Hyunsik Choi, Yon Dohn Chung, and Bongki Moon, “Parallel data processing with MapReduce: a survey,” SIGMOD, January 2012, pp. 11-20. [MTAGS11] Elif Dede, Madhusudhan Govindaraju, Daniel Gunter, and Lavanya Ramakrishnan, “ Riding the Elephant: Managing Ensembles with Hadoop,” Proceedings of the 2011 ACM international workshop on Many task computing on grids and supercomputers, ACM, New York, NY, USA, pp. 49-58.[DG08]Jeffrey Dean and Sanjay Ghemawat, “MapReduce: simplified data processing on large clusters,” January 2008, pp. 107-113. ACM.[CAHER10]Tyson Condie, Neil Conway, Peter Alvaro, Joseph M. Hellerstein, Khaled Elmeleegy, and Russell Sears, “MapReduce online,” Proceedings of the 7th USENIX conference on Networked systems design and implementation (NSDI'10), USENIX Association, Berkeley, CA, USA, 2010, pp. 21-37.
![Page 22: MapReduce Paradigm](https://reader033.fdocuments.us/reader033/viewer/2022061221/54be2e504a79598c1e8b459f/html5/thumbnails/22.jpg)
Thank You!
Questions?