Mass Data Processing Technology on Large Scale Clusters Summer, 2007, Tsinghua University All course...
-
Upload
miles-maxwell -
Category
Documents
-
view
223 -
download
6
Transcript of Mass Data Processing Technology on Large Scale Clusters Summer, 2007, Tsinghua University All course...
Mass Data Processing Technology on Large Scale Clusters
Summer, 2007, Tsinghua University
All course material (slides, labs, etc) is licensed under the Creative Commons Attribution 2.5 License .Many thanks to Aaron Kimball & Sierra Michels-Slettvet for their original version
1
2
Some Slides from : Jeff Dean, Sanjay Ghemawathttp://labs.google.com/papers/mapreduce.html
Motivation
3
200+ processors200+ terabyte database1010 total clock cycles0.1 second response time5¢ average advertising revenue
From: www.cs.cmu.edu/~bryant/presentations/DISC-FCRC07.ppt
Motivation: Large Scale Data ProcessingWant to process lots of data ( > 1 TB)
Want to parallelize across hundreds/thousands of CPUs
… Want to make this easy
4
"Google Earth uses 70.5 TB: 70 TB for the raw imagery and 500 GB for the index data."
From: http://googlesystem.blogspot.com/2006/09/how-much-data-does-google-store.html
MapReduceAutomatic parallelization & distribution
Fault-tolerantProvides status and monitoring tools
Clean abstraction for programmers
5
Programming ModelBorrows from functional programmingUsers implement interface of two
functions:
map (in_key, in_value) ->
(out_key, intermediate_value) list
reduce (out_key, intermediate_value list) ->
out_value list6
mapRecords from the data source (lines
out of files, rows of a database, etc) are fed into the map function as key*value pairs: e.g., (filename, line).
map() produces one or more intermediate values along with an output key from the input.
7
reduceAfter the map phase is over, all the
intermediate values for a given output key are combined together into a list
reduce() combines those intermediate values into one or more final values for that same output key
(in practice, usually only one final value per key)
8
Architecture
9
Data store 1 Data store nmap
(key 1, values...)
(key 2, values...)
(key 3, values...)
map
(key 1, values...)
(key 2, values...)
(key 3, values...)
Input key*value pairs
Input key*value pairs
== Barrier == : Aggregates intermediate values by output key
reduce reduce reduce
key 1, intermediate
values
key 2, intermediate
values
key 3, intermediate
values
final key 1 values
final key 2 values
final key 3 values
...
Parallelismmap() functions run in parallel, creating
different intermediate values from different input data sets
reduce() functions also run in parallel, each working on a different output key
All values are processed independentlyBottleneck: reduce phase can’t start
until map phase is completely finished.
10
Example: Count word occurrencesmap(String input_key, String input_value):
// input_key: document name
// input_value: document contents
for each word w in input_value:
EmitIntermediate(w, "1");
reduce(String output_key, Iterator intermediate_values):
// output_key: a word
// output_values: a list of counts
int result = 0;
for each v in intermediate_values:
result += ParseInt(v);
Emit(AsString(result));
11
Example vs. Actual Source CodeExample is written in pseudo-codeActual implementation is in C++,
using a MapReduce libraryBindings for Python and Java exist via
interfacesTrue code is somewhat more involved
(defines how the input key/values are divided up and accessed, etc.)
12
ExamplePage 1: the weather is goodPage 2: today is goodPage 3: good weather is good.
13
Map outputWorker 1:
(the 1), (weather 1), (is 1), (good 1).
Worker 2: (today 1), (is 1), (good 1).
Worker 3: (good 1), (weather 1), (is 1), (good 1).
14
Reduce InputWorker 1:
(the 1)Worker 2:
(is 1), (is 1), (is 1)Worker 3:
(weather 1), (weather 1)Worker 4:
(today 1)Worker 5:
(good 1), (good 1), (good 1), (good 1)
15
Reduce OutputWorker 1:
(the 1)Worker 2:
(is 3)Worker 3:
(weather 2)Worker 4:
(today 1)Worker 5:
(good 4)
16
Some Other Real ExamplesTerm frequencies through the whole Web repository
Count of URL access frequencyReverse web-link graph
17
Implementation OverviewTypical cluster:
100s/1000s of 2-CPU x86 machines, 2-4 GB of memory
Limited bisection bandwidth Storage is on local IDE disks GFS: distributed file system manages data
(SOSP'03) Job scheduling system: jobs made up of tasks,
scheduler assigns tasks to machines Implementation is a C++ library linked into
user programs18
Architecture
19
Execution
20
Parallel Execution
21
Task Granularity And PipeliningFine granularity tasks: many more map tasks than
machines Minimizes time for fault recovery Can pipeline shuffling with map execution Better dynamic load balancing
Often use 200,000 map/5000 reduce tasks w/ 2000 machines
22
LocalityMaster program divvies up tasks based
on location of data: (Asks GFS for locations of replicas of input file blocks) tries to have map() tasks on same machine as physical file data, or at least same rack
map() task inputs are divided into 64 MB blocks: same size as Google File System chunks
Without this, rack switches limit read rate
23
Effect: Thousands of machines read input at local disk speed
Fault ToleranceMaster detects worker failures
Re-executes completed & in-progress map() tasks
Re-executes in-progress reduce() tasks
Master notices particular input key/values cause crashes in map(), and skips those values on re-execution.Effect: Can work around bugs in third-party libraries!
24
Fault ToleranceOn worker failure:
Detect failure via periodic heartbeats Re-execute completed and in-progress
map tasks Re-execute in progress reduce tasks Task completion committed through
master Master failure:
Could handle, but don't yet (master failure unlikely)
25
Robust: lost 1600 of 1800 machines once, but finished fine
OptimizationsNo reduce can start until map is complete:
A single slow disk controller can rate-limit the whole process
Master redundantly executes “slow-moving” map tasks; uses results of first copy to finish, (one finishes first “wins”)
26
Why is it safe to redundantly execute map tasks? Wouldn’t this mess up the total computation?
Slow workers significantly lengthen completion time Other jobs consuming resources on machine Bad disks with soft errors transfer data very slowly Weird things: processor caches disabled (!!)
Optimizations“Combiner” functions can run on same
machine as a mapperCauses a mini-reduce phase to occur before
the real reduce phase, to save bandwidth
27
Under what conditions is it sound to use a combiner?
RefinementSorting guarantees within each
reduce partition Compression of intermediate data Combiner: useful for saving network
bandwidth Local execution for
debugging/testing User-defined counters
28
PerformanceTests run on cluster of 1800 machines:
4 GB of memory Dual-processor 2 GHz Xeons with Hyperthreading Dual 160 GB IDE disks Gigabit Ethernet per machine Bisection bandwidth approximately 100 Gbps
29
Two benchmarks:
MR_Grep Scan 1010 100-byte records to extract records matching a rare pattern (92K matching records)
MR_Sort Sort 1010 100-byte records (modeled after TeraSort benchmark)
MR_Grep
Locality optimization helps: 1800 machines read 1 TB of data at peak of ~31 GB/s Without this, rack switches would limit to 10 GB/s
Startup overhead is significant for short jobs
30
MR_SortBackup tasks reduce job completion time
significantly System deals well with failures
31
Normal No Backup Tasks 200 processes killed
More and more MapReduce
32
MapReduce Programs In Google Source Tree
Example uses:distributed grepdistributed sortweb link-graph reversalterm-vector per hostweb access log statsinverted index constructiondocument clustering machine learningstatistical machine translation
Real MapReduce : Rewrite of Production Indexing System
33
MapReduce ConclusionsMapReduce has proven to be a useful
abstraction Greatly simplifies large-scale
computations at Google Functional programming paradigm
can be applied to large-scale applications
Fun to use: focus on problem, let library deal w/ messy details
34
Boggle & MapReduce
35
Discuss: Use MapReduce to find the highest possible Boggle score.
Questions and Answers !!!
36