MapReduce with Hadoop

12
MapReduce with HADOOP Vitalie Scurtu

description

A presentation of MapReduced in hadoop. It shows the result of one experiment.

Transcript of MapReduce with Hadoop

Page 1: MapReduce with Hadoop

MapReduce with HADOOP

Vitalie Scurtu

Page 2: MapReduce with Hadoop

What is hadoop?

Hadoop is a set of open source frameworks for parallel and distributive computing:

• HDFS: Distributed file system

• MapReduce: A technique and a framework for parallel computation in cluster.

• ZooKeeper: A configuration service.

• and others: Hive ,HBase ,Mahout, Pig.• Yahoo's Hadoop clusters was used to sort 1 terabyte of data in 209

seconds in Terabyte Sorting Competition.

Page 3: MapReduce with Hadoop

Why distributed computing?

• Reduced costs. More computers are cheaper then more powerful computer.

• Scalability. We can add new computer to the cluster anytime.

• Super power and super speed.

• Distributed algorithms.

• Stability

• Robust frameworks.

Page 4: MapReduce with Hadoop

Configuring Hadoop

• It is java and it uses xml file for configuration.• Installation is very simple. • Every computer can become a part of the cluster.• To try a demo we need only 30 minutes.• Uses an advanced configuration system named

ZooKeeper• cat /usr/local/hadoop/conf/slaves

hadoop-masterhadoop-slave01hadoop-slave02hadoop-slave03hadoop-slave06

Page 5: MapReduce with Hadoop

HDFSHadoop Distributed File System

• Distributed file system

• Support for huge files (GB, terrabyte)

• Hardware Failure safe, replication

• File access model is “Write-once-read-many”

• Cross-platform (java)

Page 6: MapReduce with Hadoop

MapReduce

• An uniq model for distributed computation, main algorithm is divided in two– Map

• Accepts in input key-value pairs (dictionary)• Records must be independend (Key A does not depend on Key B)• It does the intermediary computations and prepares the data for Reduce stage.

– Reduce• Accepts in input collections of key-value with intermediary results.• Parallel Sorting and Grouping functions. • Returns the final result.

– Map -> Reduce• It is not only a distributed framework but also a development methodology thanks to its

uniq formula. The algorithms contrains makes it possible for the developer to think about implementation and not to focus on the parallel computation. Once a problem is transormed into a MapReduce algorithm, the framework is applicable.

– Computation time: max(time_of_each_map) + max(time_of_each_reduce)

Page 7: MapReduce with Hadoop

MapReduce

Map1

Map2

Map3

Map4

Input

Reduce Output

Page 8: MapReduce with Hadoop

Example of Applications

• Problem: Extract all the texts from a database with 1 million posts and compute the occurencyof each token.

mapper.py <- Takes as input an id

-> Prints each token with its occurency

reducer.py <- Takes as input a list of tokens with ids occurency

-> Sums the occurency of all tokens and outputs the final result.

Page 9: MapReduce with Hadoop

Experiment 1, 100K docs, 5 slaves

• Time without MapReduce– 906.63user – 4.18system – 0:14:32 elapsed – 104%CPU (0avgtext+0avgdata 0maxresident)k

• Time with MapReduce– 3.79user – 0.40system – 0:21:00 elapsed – 0%CPU (0avgtext+0avgdata 0maxresident)k

– 10/10/25 11:10:36 INFO streaming.StreamJob: map 0% reduce 0%

– 10/10/25 11:10:50 INFO streaming.StreamJob: map 16% reduce 0%

– 10/10/25 11:11:48 INFO streaming.StreamJob: map 33% reduce 0%

– 10/10/25 11:12:10 INFO streaming.StreamJob: map 49% reduce 0%

– 10/10/25 11:14:09 INFO streaming.StreamJob: map 66% reduce 0%

– 10/10/25 11:14:37 INFO streaming.StreamJob: map 82% reduce 0%

– 10/10/25 11:16:26 INFO streaming.StreamJob: map 83% reduce 0%

– 10/10/25 11:18:12 INFO streaming.StreamJob: map 83% reduce 17%

– 10/10/25 11:20:18 INFO streaming.StreamJob: map 99% reduce 17%

Page 10: MapReduce with Hadoop

Experiment 2, 1M doc, 5 slaves

• Time without MapReduce– 6892.08user – 25.03system – 1:56:37 elapsed – 98%CPU (0avgtext+0avgdata 0maxresident)k

• Time with MapReduce– 6.30user – 0.98system – 3:26:18elapsed – 0%CPU (0avgtext+0avgdata 0maxresident)k

– 10/10/26 15:04:36 INFO streaming.StreamJob: map 100% reduce 14%– 10/10/26 15:04:37 INFO streaming.StreamJob: map 100% reduce 16%– 10/10/26 15:04:39 INFO streaming.StreamJob: map 100% reduce 25%– 10/10/26 15:04:40 INFO streaming.StreamJob: map 100% reduce 27%– 10/10/26 15:04:42 INFO streaming.StreamJob: map 100% reduce 30%– 10/10/26 15:04:44 INFO streaming.StreamJob: map 100% reduce 32%– 10/10/26 15:04:45 INFO streaming.StreamJob: map 100% reduce 34%– 10/10/26 15:04:48 INFO streaming.StreamJob: map 100% reduce 35%– 10/10/26 15:07:29 INFO streaming.StreamJob: map 83% reduce 35%– 10/10/26 15:07:35 INFO streaming.StreamJob: map 100% reduce 35%– 10/10/26 15:09:57 INFO streaming.StreamJob: map 100% reduce 36%– 10/10/26 15:09:59 INFO streaming.StreamJob: map 100% reduce 37%

Page 11: MapReduce with Hadoop

Experiment 3, 1M doc, 3 slaves

• Time without MapReduce– 6892.08user – 25.03system – 1:56:37 elapsed – 98%CPU (0avgtext+0avgdata 0maxresident)k

• Time with MapReduce– 5.50user – 0.97system – 00:53:20elapsed – 0%CPU (0avgtext+0avgdata 0maxresident)k– 10/10/26 15:04:36 INFO streaming.StreamJob: map 100% reduce 14%

– 10/10/26 15:04:37 INFO streaming.StreamJob: map 100% reduce 16%

– 10/10/26 15:04:39 INFO streaming.StreamJob: map 100% reduce 25%

– 10/10/26 15:04:40 INFO streaming.StreamJob: map 100% reduce 27%

– 10/10/26 15:04:42 INFO streaming.StreamJob: map 100% reduce 30%

– 10/10/26 15:04:44 INFO streaming.StreamJob: map 100% reduce 32%

– 10/10/26 15:04:45 INFO streaming.StreamJob: map 100% reduce 34%

– 10/10/26 15:04:48 INFO streaming.StreamJob: map 100% reduce 35%

– 10/10/26 15:07:29 INFO streaming.StreamJob: map 83% reduce 35%

– 10/10/26 15:07:35 INFO streaming.StreamJob: map 100% reduce 35%

– 10/10/26 15:09:57 INFO streaming.StreamJob: map 100% reduce 36%

– 10/10/26 15:09:59 INFO streaming.StreamJob: map 100% reduce 37%

Page 12: MapReduce with Hadoop

What’s next?

• MapReduce can be applied in many problems and natural language processing applications. Examples– Sentiment analysis.

– Computing probabilities of huge data.

– Retrieval problem.

– Huge data statistics and analysis.

– MapReduce is not only a framework it is also a distributed computing methodology.