MapReduce: Simplified Data Processing on Large Clusters J. Dean and S. Ghemawat (Google) OSDI 2004...

MapReduce: Simplified Data Processing on Large Clusters

J. Dean and S. Ghemawat (Google) OSDI 2004

Shimin ChenDISC Reading Group

Motivation: Large Scale Data Processing

Process lots of data to produce other derived data Input: crawled documents, web request logs etc. Output: inverted indices, web page graph structure,

top queries in a day etc. Want to use hundreds or thousands of CPUs

but want to only focus on the functionality MapReduce hides messy details in a library:

Parallelization Data distribution Fault-tolerance Load balancing

Outline

Programming Model Implementation Refinements Evaluation Conclusion

Programming model

Input & Output: each a set of key/value pairs Programmer specifies two functions:

map (in_key, in_value) -> list(out_key, intermediate_value) Processes input key/value pair to generate intermediate pairs

reduce (out_key, list(intermediate_value)) -> list(out_value) Given all intermediate values for a particular key, produces a set of merged output values (usually just one)

Inspired by similar primitives in LISP and other functional languages

Example: Count word occurrences

map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1");

reduce(String output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result));

Looking at Actual Code (Appendix A)#include "mapreduce/mapreduce.h“

// User's map functionclass WordCounter : public Mapper { public: virtual void Map(const MapInput& input) {

const string& text = input.value();const int n = text.size();for (int i = 0; i < n; ) { // Skip past leading whitespace while ((i < n) && isspace(text[i])) i++; // Find word end int start = i; while ((i < n) && !isspace(text[i])) i++; if (start < i) Emit(text.substr(start,i-start),"1");

} }};REGISTER_MAPPER(WordCounter);

// User's reduce functionclass Adder : public Reducer { virtual void Reduce(ReduceInput* input) {

// Iterate over all entries with the // same key and add the values int64 value = 0; while (!input->done()) { value += StringToInt(input->value()); input->NextValue(); } // Emit sum for input->key() Emit(IntToString(value));

}};REGISTER_REDUCER(Adder);

int main(int argc, char** argv) {ParseCommandLineFlags(argc, argv);MapReduceSpecification spec;// Store list of input files into "spec"for (int i = 1; i < argc; i++) {

MapReduceInput* input = spec.add_input();input->set_format("text");input->set_filepattern(argv[i]);input->set_mapper_class("WordCounter");

}// Specify the output files:// /gfs/test/freq-00000-of-00100,/gfs/test/freq-00001-of-00100MapReduceOutput* out = spec.output();out->set_filebase("/gfs/test/freq");out->set_num_tasks(100);out->set_format("text");out->set_reducer_class("Adder");// Optional: do partial sums within map tasksout->set_combiner_class("Adder");// Tuning parametersspec.set_machines(2000); spec.set_map_megabytes(100);spec.set_reduce_megabytes(100);// Now run itMapReduceResult result;if (!MapReduce(spec, &result)) abort();// Done: 'result' structure contains info about counters, time// taken, number of machines used, etc.return 0;

}

More Examples

Inverted index: worddocuments Map: parse each document, emits <word, document ID> Reduce: emits <word, list of document IDs>

Distributed grep Map: emits a line if input document match a given pattern Reduce: identity function

Distributed sort Map: extracts key from each record, emits a <key, record> Reduce: emits all pairs unchanged Relies on partitioning function and ordering guarantees

(later in the talk)

MapReduce Jobs Run in August 2004 (Table 1)

Outline


Implementation Overview

Execution environment: Google cluster 100s/1000s of 2-CPU x86 machines, 2-4 GB of memory 100 mbps or 1 gbps Ethernet, but limited (average) bisection

bandwidth Storage is on local IDE disks GFS: distributed file system manages data Job scheduling system: jobs made up of tasks, scheduler

assigns tasks to machines

Parallelization

Map Divide the input into M equal-sized splits Each split is 16-64 MB large

Reduce Partitioning intermediate key space into R pieces hash(intermediate_key) mod R

Typical setting: 2,000 machines M = 200,000 R = 5,000

Execution Overview

M input splits of 16-64MB each

Partitioning functionhash(intermediate_key) mod R

(0) mapreduce(spec, &result)

R regions

• Read all intermediate data• Sort it by intermediate keys

Timeline

More Details

Master: Map task: state (idle/in-progress/completed), R file locations, worker

machine Reduce task: state (idle/in-progress/completed), worker machine O(M+R) scheduling decisions, O(MR) space

Locality preserving scheduling Schedule a map task close to the input location

Prefer fine-grain tasks Dynamic load balancing Speeds up recovery

Fault Tolerance via Re-Execution

On worker failure: Detect failure via periodic heartbeats Re-execute completed and in-progress map tasks

Fine-grain: the completed tasks can be re-executed on multiple machines quickly

Re-execute in progress reduce tasks Task completion committed through master

Master failure: Could handle, but don't yet (master failure unlikely)

Outline


Backup Tasks

Problem: “straggler” A machine takes an unusually long time to complete one of

the last few map or reduce tasks E.g. bad disk with frequent correctable errors; other jobs

running; machine configuration problems etc.

Near end of phase, master schedules backup executions of the remaining in-progress tasks

Whichever one finishes first "wins"

Combiner Function

Purpose: reduce data sent over network Combiner function: performs partial merging of

intermediate data at the map worker

Typically, combiner function == reducer function Requires commutative and associative E.g. word count

Skipping Bad Records

Map/Reduce functions sometimes fail for particular inputs Best solution is to debug & fix, but not always possible On seg fault:

Send UDP packet to master from signal handler Include sequence number of record being processed

If master sees two failures for same record: Next worker is told to skip the record

Effect: Can work around bugs in third-party libraries

Other Refinements

Extensible input and output types Local execution for debugging Status web page User-defined counters

Counter values returned to user code Displayed on status web page

Outline


Setup

Tests run on cluster of 1800 machines: 4 GB of memory Dual-processor 2 GHz Xeons with Hyperthreading Dual 160 GB IDE disks Gigabit Ethernet per machine Bisection bandwidth approximately 100 Gbps

Benchmarks

Two benchmarks: Grep Scan 1010 100-byte records to extract

records matching a rare pattern (92K matching records) M=15,000 (input split size about 64MB) R=1

Sort Sort 1010 100-byte records (modeled after TeraSort benchmark) M=15,000 (input split size about 64MB) R=4,000

Locality optimization helps: 1800 machines read 1 TB of data at peak of ~31 GB/s Without this, rack switches would limit to 10 GB/s Startup overhead is significant for short jobs Total time about 150 seconds; 1 minute startup time

Grep

Sort 44% longer

5% longer

Conclusion

MapReduce has proven to be a useful abstraction Greatly simplifies large-scale computations at Google Fun to use:

focus on problem, let library deal w/ messy details

MapReduce: Simplified Data Processing on Large Clusters J. Dean and S. Ghemawat (Google) OSDI 2004...

Documents

Transcript of MapReduce: Simplified Data Processing on Large Clusters J. Dean and S. Ghemawat (Google) OSDI 2004...