MapReduce: Simplified Data Processing on Large Clusters J. Dean and S. Ghemawat (Google) OSDI 2004...
-
date post
21-Dec-2015 -
Category
Documents
-
view
218 -
download
2
Transcript of MapReduce: Simplified Data Processing on Large Clusters J. Dean and S. Ghemawat (Google) OSDI 2004...
MapReduce: Simplified Data Processing on Large Clusters
J. Dean and S. Ghemawat (Google) OSDI 2004
Shimin ChenDISC Reading Group
Motivation: Large Scale Data Processing
Process lots of data to produce other derived data Input: crawled documents, web request logs etc. Output: inverted indices, web page graph structure,
top queries in a day etc. Want to use hundreds or thousands of CPUs
but want to only focus on the functionality MapReduce hides messy details in a library:
Parallelization Data distribution Fault-tolerance Load balancing
Programming model
Input & Output: each a set of key/value pairs Programmer specifies two functions:
map (in_key, in_value) -> list(out_key, intermediate_value) Processes input key/value pair to generate intermediate pairs
reduce (out_key, list(intermediate_value)) -> list(out_value) Given all intermediate values for a particular key, produces a set of merged output values (usually just one)
Inspired by similar primitives in LISP and other functional languages
Example: Count word occurrences
map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1");
reduce(String output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result));
Looking at Actual Code (Appendix A)#include "mapreduce/mapreduce.h“
// User's map functionclass WordCounter : public Mapper { public: virtual void Map(const MapInput& input) {
const string& text = input.value();const int n = text.size();for (int i = 0; i < n; ) { // Skip past leading whitespace while ((i < n) && isspace(text[i])) i++; // Find word end int start = i; while ((i < n) && !isspace(text[i])) i++; if (start < i) Emit(text.substr(start,i-start),"1");
} }};REGISTER_MAPPER(WordCounter);
// User's reduce functionclass Adder : public Reducer { virtual void Reduce(ReduceInput* input) {
// Iterate over all entries with the // same key and add the values int64 value = 0; while (!input->done()) { value += StringToInt(input->value()); input->NextValue(); } // Emit sum for input->key() Emit(IntToString(value));
}};REGISTER_REDUCER(Adder);
int main(int argc, char** argv) {ParseCommandLineFlags(argc, argv);MapReduceSpecification spec;// Store list of input files into "spec"for (int i = 1; i < argc; i++) {
MapReduceInput* input = spec.add_input();input->set_format("text");input->set_filepattern(argv[i]);input->set_mapper_class("WordCounter");
}// Specify the output files:// /gfs/test/freq-00000-of-00100,/gfs/test/freq-00001-of-00100MapReduceOutput* out = spec.output();out->set_filebase("/gfs/test/freq");out->set_num_tasks(100);out->set_format("text");out->set_reducer_class("Adder");// Optional: do partial sums within map tasksout->set_combiner_class("Adder");// Tuning parametersspec.set_machines(2000); spec.set_map_megabytes(100);spec.set_reduce_megabytes(100);// Now run itMapReduceResult result;if (!MapReduce(spec, &result)) abort();// Done: 'result' structure contains info about counters, time// taken, number of machines used, etc.return 0;
}
More Examples
Inverted index: worddocuments Map: parse each document, emits <word, document ID> Reduce: emits <word, list of document IDs>
Distributed grep Map: emits a line if input document match a given pattern Reduce: identity function
Distributed sort Map: extracts key from each record, emits a <key, record> Reduce: emits all pairs unchanged Relies on partitioning function and ordering guarantees
(later in the talk)
Implementation Overview
Execution environment: Google cluster 100s/1000s of 2-CPU x86 machines, 2-4 GB of memory 100 mbps or 1 gbps Ethernet, but limited (average) bisection
bandwidth Storage is on local IDE disks GFS: distributed file system manages data Job scheduling system: jobs made up of tasks, scheduler
assigns tasks to machines
Parallelization
Map Divide the input into M equal-sized splits Each split is 16-64 MB large
Reduce Partitioning intermediate key space into R pieces hash(intermediate_key) mod R
Typical setting: 2,000 machines M = 200,000 R = 5,000
Execution Overview
M input splits of 16-64MB each
Partitioning functionhash(intermediate_key) mod R
(0) mapreduce(spec, &result)
R regions
• Read all intermediate data• Sort it by intermediate keys
More Details
Master: Map task: state (idle/in-progress/completed), R file locations, worker
machine Reduce task: state (idle/in-progress/completed), worker machine O(M+R) scheduling decisions, O(MR) space
Locality preserving scheduling Schedule a map task close to the input location
Prefer fine-grain tasks Dynamic load balancing Speeds up recovery
Fault Tolerance via Re-Execution
On worker failure: Detect failure via periodic heartbeats Re-execute completed and in-progress map tasks
Fine-grain: the completed tasks can be re-executed on multiple machines quickly
Re-execute in progress reduce tasks Task completion committed through master
Master failure: Could handle, but don't yet (master failure unlikely)
Backup Tasks
Problem: “straggler” A machine takes an unusually long time to complete one of
the last few map or reduce tasks E.g. bad disk with frequent correctable errors; other jobs
running; machine configuration problems etc.
Near end of phase, master schedules backup executions of the remaining in-progress tasks
Whichever one finishes first "wins"
Combiner Function
Purpose: reduce data sent over network Combiner function: performs partial merging of
intermediate data at the map worker
Typically, combiner function == reducer function Requires commutative and associative E.g. word count
Skipping Bad Records
Map/Reduce functions sometimes fail for particular inputs Best solution is to debug & fix, but not always possible On seg fault:
Send UDP packet to master from signal handler Include sequence number of record being processed
If master sees two failures for same record: Next worker is told to skip the record
Effect: Can work around bugs in third-party libraries
Other Refinements
Extensible input and output types Local execution for debugging Status web page User-defined counters
Counter values returned to user code Displayed on status web page
Setup
Tests run on cluster of 1800 machines: 4 GB of memory Dual-processor 2 GHz Xeons with Hyperthreading Dual 160 GB IDE disks Gigabit Ethernet per machine Bisection bandwidth approximately 100 Gbps
Benchmarks
Two benchmarks: Grep Scan 1010 100-byte records to extract
records matching a rare pattern (92K matching records) M=15,000 (input split size about 64MB) R=1
Sort Sort 1010 100-byte records (modeled after TeraSort benchmark) M=15,000 (input split size about 64MB) R=4,000
Locality optimization helps: 1800 machines read 1 TB of data at peak of ~31 GB/s Without this, rack switches would limit to 10 GB/s Startup overhead is significant for short jobs Total time about 150 seconds; 1 minute startup time
Grep