MapReduce basics
Harisankar H,PhD student, DOS lab, Dept. CSE, IIT Madras
6-Feb-2013
http://harisankarh.wordpress.com
Distributed processing ?
• Processing distributed across multiple machines/servers
Image from: http://installornot.com/wp-content/uploads/google-datacenter-tech-13.jpg
Why distributed processing?
– Reduce execution time of large jobs
• E.g., extracting urls from terabytes of data
• 1000 machines could finish the jobs 1000 times faster
– Fault-tolerance
• Other nodes will take over the jobs if some of the nodes fail
– Typically if you have 10,000 servers, on the average one will fail per day
Issues in distributed processing
• Realized traditionally using special-purpose implementations– E.g., indexer, log processor
• Implementation really hard at socket programming level– Fault-tolerance
• Keep track of failure, reassignment of tasks
– Hand-coded parallelization– Scheduling across heterogeneous nodes– Locality
• Minimise movement of data for computation
– How to distribute data?
• Results in:– Complex, brittle, non-generic code– Reimplementation of common features like fault-tolerance,
distribution
Need for a generic abstraction for distributed processing
• Tradeoff between genericity and performance
– More generic => usually less performance
• MapReduce probably a sweet spot where you have both to some extent
App programmer abstraction systems developer
Separation of concerns
Express app logic
Performance, fault handling etc.
MapReduce abstraction(app programmer’s view)
• Model input and output as <key,value> pairs
• Provide map() and reduce() functions which act on <k,v> pairs
• Input: set of <k,v> pairs: {k,v}– For each input <k,v>:
map(k1,v1) list(k2,v2)
– For each unique output key from map:
reduce(k2,combined list(v2)) list(v3)
System will take care of distributing the tasks across thousands of machines, handling locality, fault-tolerance etc.
Example: word count
• Problem:– Count the number of occurrences of each unique
word in a big collection of documents
• Input <k,v> set:– <document name, document contents>
• Organize the files in this format
• Output:– <word, count>
• Get it in output files
• Next step: – Define the map() and reduce() functions
Word count
map(String key, String value):// key: document name// value: document contentsfor each word w in value:EmitIntermediate(w, “1”);
reduce(String key, List values):// key: a word// values: a list of countsint result = 0;for each v in values:result += ParseInt(v);Emit(AsString(result));
Program in java
public void map(LongWritable key, Text value, Context context) throws …
{String line = value.toString();StringTokenizer tokenizer = new
StringTokenizer(line);while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());context.write(word, one);
}}
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws …
{int sum = 0;for (IntWritable val : values) {
sum += val.get();}context.write(key, new
IntWritable(sum));}
Implementing MapReduce abstraction
• Looked at the application programmer’s view• Need a platform which implements the
MapReduce abstraction• Hadoop is the popular open-source
implementation of MapReduce abstraction• Questions for the platform developer
– How to • parallelize ?• handle faults ?• provide locality ?• distribute the data ?
App programmer abstraction systems developer
Basics of platform implementation
• parallelize ?– Each map can be executed independently in parallel– After all maps have finished execution, all reduce can be
executed in parallel
• handle faults ?– map() and reduce() has no internal state
• Simply re-execute in case of a failure
• distribute the data ?– Have a distributed file system(HDFS)
• provide locality ?– Prefer to execute map() on the nodes having input <k,v>
pair
MapReduce implementation
• Distributed File System(DFS) + MapReduce(MR) Engine– Specifically, MR engine uses a DFS
• Distributed files system– Files split into large chunks and stored in the
distributed file system(e.g., HDFS)
– Large chunks: typically 64MB per block
– can have a master-slave architecture• Master assigns and manages replicated blocks in the
slaves
MapReduce engine
• Has a master slave architecture
– Master co-ordinates the task execution across workers
– Workers perform the map() and reduce() functions
• Reads and writes blocks to/from the DFS
– Master keeps tracks of failure of workers and reassigns tasks if necessary
• Failure detection usually done through timeouts
network
Some tips for designing MR jobs
• Reduce network traffic between map and reduce
– Model map() and reduce() jobs appropriately
– Use combine() functions
• combine(<k,[v]>) <k,[v]>
• combine() executes after all map()s finish in each block
– map() [same node] combine() [network] reduce()
• Make map jobs of roughly equal expected execution times
• Try to make reduce() jobs less skewed
Pros and cons of MapReduce
• Advantages– Simple, easy to use distributed processing system– Reasonably generic– Exploits locality for performance– Simple and less buggy implementation
• Issues– Not a magic bullet which fit all problems
• Difficult to model iterative and recursive computations– E.g.: k-means clustering– Generate-Map-Reduce
• Difficult to model streaming computations• Centralized entities like master becomes bottlenecks• Most real-world problems require large chains of MR jobs
Summary
• Today
– Distributed processing issues, MR programming model
– Sample MR job
– How MR can be implemented
– Pros and cons of MR, tips for better performance
• Tomorrow
– Details specific to Hadoop
– Downloading and setting up of Hadoop on a cluster
Ack: some images from: Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM 51, 1 (January 2008), 107-113.
Hadoop components
• HDFS
– Master: Namenode
– Slave : DataNode
• MapReduce engine
– Master: JobTracker
– Slave: TaskTracker
Top Related