Mass Data Processing Technology on Large Scale Clusters Summer, 2007, Tsinghua University All course...

Mass Data Processing Technology on Large Scale Clusters

Summer, 2007, Tsinghua University

All course material (slides, labs, etc) is licensed under the Creative Commons Attribution 2.5 License .Many thanks to Aaron Kimball & Sierra Michels-Slettvet for their original version

1

2

Some Slides from : Jeff Dean, Sanjay Ghemawathttp://labs.google.com/papers/mapreduce.html

Motivation

3

200+ processors200+ terabyte database1010 total clock cycles0.1 second response time5¢ average advertising revenue

From: www.cs.cmu.edu/~bryant/presentations/DISC-FCRC07.ppt

Motivation: Large Scale Data ProcessingWant to process lots of data ( > 1 TB)

Want to parallelize across hundreds/thousands of CPUs

… Want to make this easy

4

"Google Earth uses 70.5 TB: 70 TB for the raw imagery and 500 GB for the index data."

From: http://googlesystem.blogspot.com/2006/09/how-much-data-does-google-store.html

MapReduceAutomatic parallelization & distribution

Fault-tolerantProvides status and monitoring tools

Clean abstraction for programmers

5

Programming ModelBorrows from functional programmingUsers implement interface of two

functions:

map (in_key, in_value) ->

(out_key, intermediate_value) list

reduce (out_key, intermediate_value list) ->

out_value list6

mapRecords from the data source (lines

out of files, rows of a database, etc) are fed into the map function as key*value pairs: e.g., (filename, line).

map() produces one or more intermediate values along with an output key from the input.

7

reduceAfter the map phase is over, all the

intermediate values for a given output key are combined together into a list

reduce() combines those intermediate values into one or more final values for that same output key

(in practice, usually only one final value per key)

8

Architecture

9

Data store 1 Data store nmap

(key 1, values...)

(key 2, values...)

(key 3, values...)

map

(key 1, values...)

(key 2, values...)

(key 3, values...)

Input key*value pairs

Input key*value pairs

== Barrier == : Aggregates intermediate values by output key

reduce reduce reduce

key 1, intermediate

values

key 2, intermediate

values

key 3, intermediate

values

final key 1 values

final key 2 values

final key 3 values

...

Parallelismmap() functions run in parallel, creating

different intermediate values from different input data sets

reduce() functions also run in parallel, each working on a different output key

All values are processed independentlyBottleneck: reduce phase can’t start

until map phase is completely finished.

10

Example: Count word occurrencesmap(String input_key, String input_value):

// input_key: document name

// input_value: document contents

for each word w in input_value:

EmitIntermediate(w, "1");

reduce(String output_key, Iterator intermediate_values):

// output_key: a word

// output_values: a list of counts

int result = 0;

for each v in intermediate_values:

result += ParseInt(v);

Emit(AsString(result));

11

Example vs. Actual Source CodeExample is written in pseudo-codeActual implementation is in C++,

using a MapReduce libraryBindings for Python and Java exist via

interfacesTrue code is somewhat more involved

(defines how the input key/values are divided up and accessed, etc.)

12

ExamplePage 1: the weather is goodPage 2: today is goodPage 3: good weather is good.

13

Map outputWorker 1:

(the 1), (weather 1), (is 1), (good 1).

Worker 2: (today 1), (is 1), (good 1).

Worker 3: (good 1), (weather 1), (is 1), (good 1).

14

Reduce InputWorker 1:

(the 1)Worker 2:

(is 1), (is 1), (is 1)Worker 3:

(weather 1), (weather 1)Worker 4:

(today 1)Worker 5:

(good 1), (good 1), (good 1), (good 1)

15

Reduce OutputWorker 1:

(the 1)Worker 2:

(is 3)Worker 3:

(weather 2)Worker 4:

(today 1)Worker 5:

(good 4)

16

Some Other Real ExamplesTerm frequencies through the whole Web repository

Count of URL access frequencyReverse web-link graph

17

Implementation OverviewTypical cluster:

100s/1000s of 2-CPU x86 machines, 2-4 GB of memory

Limited bisection bandwidth Storage is on local IDE disks GFS: distributed file system manages data

(SOSP'03) Job scheduling system: jobs made up of tasks,

scheduler assigns tasks to machines Implementation is a C++ library linked into

user programs18

Architecture

19

Execution

20

Parallel Execution

21

Task Granularity And PipeliningFine granularity tasks: many more map tasks than

machines Minimizes time for fault recovery Can pipeline shuffling with map execution Better dynamic load balancing

Often use 200,000 map/5000 reduce tasks w/ 2000 machines

22

LocalityMaster program divvies up tasks based

on location of data: (Asks GFS for locations of replicas of input file blocks) tries to have map() tasks on same machine as physical file data, or at least same rack

map() task inputs are divided into 64 MB blocks: same size as Google File System chunks

Without this, rack switches limit read rate

23

Effect: Thousands of machines read input at local disk speed

Fault ToleranceMaster detects worker failures

Re-executes completed & in-progress map() tasks

Re-executes in-progress reduce() tasks

Master notices particular input key/values cause crashes in map(), and skips those values on re-execution.Effect: Can work around bugs in third-party libraries!

24

Fault ToleranceOn worker failure:

Detect failure via periodic heartbeats Re-execute completed and in-progress

map tasks Re-execute in progress reduce tasks Task completion committed through

master Master failure:

Could handle, but don't yet (master failure unlikely)

25

Robust: lost 1600 of 1800 machines once, but finished fine

OptimizationsNo reduce can start until map is complete:

A single slow disk controller can rate-limit the whole process

Master redundantly executes “slow-moving” map tasks; uses results of first copy to finish, (one finishes first “wins”)

26

Why is it safe to redundantly execute map tasks? Wouldn’t this mess up the total computation?

Slow workers significantly lengthen completion time Other jobs consuming resources on machine Bad disks with soft errors transfer data very slowly Weird things: processor caches disabled (!!)

Optimizations“Combiner” functions can run on same

machine as a mapperCauses a mini-reduce phase to occur before

the real reduce phase, to save bandwidth

27

Under what conditions is it sound to use a combiner?

RefinementSorting guarantees within each

reduce partition Compression of intermediate data Combiner: useful for saving network

bandwidth Local execution for

debugging/testing User-defined counters

28

PerformanceTests run on cluster of 1800 machines:

4 GB of memory Dual-processor 2 GHz Xeons with Hyperthreading Dual 160 GB IDE disks Gigabit Ethernet per machine Bisection bandwidth approximately 100 Gbps

29

Two benchmarks:

MR_Grep Scan 1010 100-byte records to extract records matching a rare pattern (92K matching records)

MR_Sort Sort 1010 100-byte records (modeled after TeraSort benchmark)

MR_Grep

Locality optimization helps: 1800 machines read 1 TB of data at peak of ~31 GB/s Without this, rack switches would limit to 10 GB/s

Startup overhead is significant for short jobs

30

MR_SortBackup tasks reduce job completion time

significantly System deals well with failures

31

Normal No Backup Tasks 200 processes killed

More and more MapReduce

32

MapReduce Programs In Google Source Tree

Example uses:distributed grepdistributed sortweb link-graph reversalterm-vector per hostweb access log statsinverted index constructiondocument clustering machine learningstatistical machine translation

Real MapReduce : Rewrite of Production Indexing System

33

MapReduce ConclusionsMapReduce has proven to be a useful

abstraction Greatly simplifies large-scale

computations at Google Functional programming paradigm

can be applied to large-scale applications

Fun to use: focus on problem, let library deal w/ messy details

34

Boggle & MapReduce

35

Discuss: Use MapReduce to find the highest possible Boggle score.

Questions and Answers !!!

36

Mass Data Processing Technology on Large Scale Clusters Summer, 2007, Tsinghua University All course...

Documents

Transcript of Mass Data Processing Technology on Large Scale Clusters Summer, 2007, Tsinghua University All course...