Mass Data Processing Technology on Large Scale Clusters Summer, 2007, Tsinghua University All course...

36
Mass Data Processing Technology on Large Scale Clusters Summer, 2007, Tsinghua University All course material (slides, labs, etc) is licensed under the Creative Commons Attribution 2.5 License . Many thanks to Aaron Kimball & Sierra Michels- Slettvet for their original version 1

Transcript of Mass Data Processing Technology on Large Scale Clusters Summer, 2007, Tsinghua University All course...

Page 1: Mass Data Processing Technology on Large Scale Clusters Summer, 2007, Tsinghua University All course material (slides, labs, etc) is licensed under the.

Mass Data Processing Technology on Large Scale Clusters

Summer, 2007, Tsinghua University

All course material (slides, labs, etc) is licensed under the Creative Commons Attribution 2.5 License .Many thanks to Aaron Kimball & Sierra Michels-Slettvet for their original version

1

Page 2: Mass Data Processing Technology on Large Scale Clusters Summer, 2007, Tsinghua University All course material (slides, labs, etc) is licensed under the.

2

Some Slides from : Jeff Dean, Sanjay Ghemawathttp://labs.google.com/papers/mapreduce.html

Page 3: Mass Data Processing Technology on Large Scale Clusters Summer, 2007, Tsinghua University All course material (slides, labs, etc) is licensed under the.

Motivation

3

200+ processors200+ terabyte database1010 total clock cycles0.1 second response time5¢ average advertising revenue

From: www.cs.cmu.edu/~bryant/presentations/DISC-FCRC07.ppt

Page 4: Mass Data Processing Technology on Large Scale Clusters Summer, 2007, Tsinghua University All course material (slides, labs, etc) is licensed under the.

Motivation: Large Scale Data ProcessingWant to process lots of data ( > 1 TB)

Want to parallelize across hundreds/thousands of CPUs

… Want to make this easy

4

"Google Earth uses 70.5 TB: 70 TB for the raw imagery and 500 GB for the index data."

From: http://googlesystem.blogspot.com/2006/09/how-much-data-does-google-store.html

Page 5: Mass Data Processing Technology on Large Scale Clusters Summer, 2007, Tsinghua University All course material (slides, labs, etc) is licensed under the.

MapReduceAutomatic parallelization & distribution

Fault-tolerantProvides status and monitoring tools

Clean abstraction for programmers

5

Page 6: Mass Data Processing Technology on Large Scale Clusters Summer, 2007, Tsinghua University All course material (slides, labs, etc) is licensed under the.

Programming ModelBorrows from functional programmingUsers implement interface of two

functions:

map (in_key, in_value) ->

(out_key, intermediate_value) list

reduce (out_key, intermediate_value list) ->

out_value list6

Page 7: Mass Data Processing Technology on Large Scale Clusters Summer, 2007, Tsinghua University All course material (slides, labs, etc) is licensed under the.

mapRecords from the data source (lines

out of files, rows of a database, etc) are fed into the map function as key*value pairs: e.g., (filename, line).

map() produces one or more intermediate values along with an output key from the input.

7

Page 8: Mass Data Processing Technology on Large Scale Clusters Summer, 2007, Tsinghua University All course material (slides, labs, etc) is licensed under the.

reduceAfter the map phase is over, all the

intermediate values for a given output key are combined together into a list

reduce() combines those intermediate values into one or more final values for that same output key

(in practice, usually only one final value per key)

8

Page 9: Mass Data Processing Technology on Large Scale Clusters Summer, 2007, Tsinghua University All course material (slides, labs, etc) is licensed under the.

Architecture

9

Data store 1 Data store nmap

(key 1, values...)

(key 2, values...)

(key 3, values...)

map

(key 1, values...)

(key 2, values...)

(key 3, values...)

Input key*value pairs

Input key*value pairs

== Barrier == : Aggregates intermediate values by output key

reduce reduce reduce

key 1, intermediate

values

key 2, intermediate

values

key 3, intermediate

values

final key 1 values

final key 2 values

final key 3 values

...

Page 10: Mass Data Processing Technology on Large Scale Clusters Summer, 2007, Tsinghua University All course material (slides, labs, etc) is licensed under the.

Parallelismmap() functions run in parallel, creating

different intermediate values from different input data sets

reduce() functions also run in parallel, each working on a different output key

All values are processed independentlyBottleneck: reduce phase can’t start

until map phase is completely finished.

10

Page 11: Mass Data Processing Technology on Large Scale Clusters Summer, 2007, Tsinghua University All course material (slides, labs, etc) is licensed under the.

Example: Count word occurrencesmap(String input_key, String input_value):

// input_key: document name

// input_value: document contents

for each word w in input_value:

EmitIntermediate(w, "1");

reduce(String output_key, Iterator intermediate_values):

// output_key: a word

// output_values: a list of counts

int result = 0;

for each v in intermediate_values:

result += ParseInt(v);

Emit(AsString(result));

11

Page 12: Mass Data Processing Technology on Large Scale Clusters Summer, 2007, Tsinghua University All course material (slides, labs, etc) is licensed under the.

Example vs. Actual Source CodeExample is written in pseudo-codeActual implementation is in C++,

using a MapReduce libraryBindings for Python and Java exist via

interfacesTrue code is somewhat more involved

(defines how the input key/values are divided up and accessed, etc.)

12

Page 13: Mass Data Processing Technology on Large Scale Clusters Summer, 2007, Tsinghua University All course material (slides, labs, etc) is licensed under the.

ExamplePage 1: the weather is goodPage 2: today is goodPage 3: good weather is good.

13

Page 14: Mass Data Processing Technology on Large Scale Clusters Summer, 2007, Tsinghua University All course material (slides, labs, etc) is licensed under the.

Map outputWorker 1:

(the 1), (weather 1), (is 1), (good 1).

Worker 2: (today 1), (is 1), (good 1).

Worker 3: (good 1), (weather 1), (is 1), (good 1).

14

Page 15: Mass Data Processing Technology on Large Scale Clusters Summer, 2007, Tsinghua University All course material (slides, labs, etc) is licensed under the.

Reduce InputWorker 1:

(the 1)Worker 2:

(is 1), (is 1), (is 1)Worker 3:

(weather 1), (weather 1)Worker 4:

(today 1)Worker 5:

(good 1), (good 1), (good 1), (good 1)

15

Page 16: Mass Data Processing Technology on Large Scale Clusters Summer, 2007, Tsinghua University All course material (slides, labs, etc) is licensed under the.

Reduce OutputWorker 1:

(the 1)Worker 2:

(is 3)Worker 3:

(weather 2)Worker 4:

(today 1)Worker 5:

(good 4)

16

Page 17: Mass Data Processing Technology on Large Scale Clusters Summer, 2007, Tsinghua University All course material (slides, labs, etc) is licensed under the.

Some Other Real ExamplesTerm frequencies through the whole Web repository

Count of URL access frequencyReverse web-link graph

17

Page 18: Mass Data Processing Technology on Large Scale Clusters Summer, 2007, Tsinghua University All course material (slides, labs, etc) is licensed under the.

Implementation OverviewTypical cluster:

100s/1000s of 2-CPU x86 machines, 2-4 GB of memory

Limited bisection bandwidth Storage is on local IDE disks GFS: distributed file system manages data

(SOSP'03) Job scheduling system: jobs made up of tasks,

scheduler assigns tasks to machines Implementation is a C++ library linked into

user programs18

Page 19: Mass Data Processing Technology on Large Scale Clusters Summer, 2007, Tsinghua University All course material (slides, labs, etc) is licensed under the.

Architecture

19

Page 20: Mass Data Processing Technology on Large Scale Clusters Summer, 2007, Tsinghua University All course material (slides, labs, etc) is licensed under the.

Execution

20

Page 21: Mass Data Processing Technology on Large Scale Clusters Summer, 2007, Tsinghua University All course material (slides, labs, etc) is licensed under the.

Parallel Execution

21

Page 22: Mass Data Processing Technology on Large Scale Clusters Summer, 2007, Tsinghua University All course material (slides, labs, etc) is licensed under the.

Task Granularity And PipeliningFine granularity tasks: many more map tasks than

machines Minimizes time for fault recovery Can pipeline shuffling with map execution Better dynamic load balancing

Often use 200,000 map/5000 reduce tasks w/ 2000 machines

22

Page 23: Mass Data Processing Technology on Large Scale Clusters Summer, 2007, Tsinghua University All course material (slides, labs, etc) is licensed under the.

LocalityMaster program divvies up tasks based

on location of data: (Asks GFS for locations of replicas of input file blocks) tries to have map() tasks on same machine as physical file data, or at least same rack

map() task inputs are divided into 64 MB blocks: same size as Google File System chunks

Without this, rack switches limit read rate

23

Effect: Thousands of machines read input at local disk speed

Page 24: Mass Data Processing Technology on Large Scale Clusters Summer, 2007, Tsinghua University All course material (slides, labs, etc) is licensed under the.

Fault ToleranceMaster detects worker failures

Re-executes completed & in-progress map() tasks

Re-executes in-progress reduce() tasks

Master notices particular input key/values cause crashes in map(), and skips those values on re-execution.Effect: Can work around bugs in third-party libraries!

24

Page 25: Mass Data Processing Technology on Large Scale Clusters Summer, 2007, Tsinghua University All course material (slides, labs, etc) is licensed under the.

Fault ToleranceOn worker failure:

Detect failure via periodic heartbeats Re-execute completed and in-progress

map tasks Re-execute in progress reduce tasks Task completion committed through

master Master failure:

Could handle, but don't yet (master failure unlikely)

25

Robust: lost 1600 of 1800 machines once, but finished fine

Page 26: Mass Data Processing Technology on Large Scale Clusters Summer, 2007, Tsinghua University All course material (slides, labs, etc) is licensed under the.

OptimizationsNo reduce can start until map is complete:

A single slow disk controller can rate-limit the whole process

Master redundantly executes “slow-moving” map tasks; uses results of first copy to finish, (one finishes first “wins”)

26

Why is it safe to redundantly execute map tasks? Wouldn’t this mess up the total computation?

Slow workers significantly lengthen completion time Other jobs consuming resources on machine Bad disks with soft errors transfer data very slowly Weird things: processor caches disabled (!!)

Page 27: Mass Data Processing Technology on Large Scale Clusters Summer, 2007, Tsinghua University All course material (slides, labs, etc) is licensed under the.

Optimizations“Combiner” functions can run on same

machine as a mapperCauses a mini-reduce phase to occur before

the real reduce phase, to save bandwidth

27

Under what conditions is it sound to use a combiner?

Page 28: Mass Data Processing Technology on Large Scale Clusters Summer, 2007, Tsinghua University All course material (slides, labs, etc) is licensed under the.

RefinementSorting guarantees within each

reduce partition Compression of intermediate data Combiner: useful for saving network

bandwidth Local execution for

debugging/testing User-defined counters

28

Page 29: Mass Data Processing Technology on Large Scale Clusters Summer, 2007, Tsinghua University All course material (slides, labs, etc) is licensed under the.

PerformanceTests run on cluster of 1800 machines:

4 GB of memory Dual-processor 2 GHz Xeons with Hyperthreading Dual 160 GB IDE disks Gigabit Ethernet per machine Bisection bandwidth approximately 100 Gbps

29

Two benchmarks:

MR_Grep Scan 1010 100-byte records to extract records matching a rare pattern (92K matching records)

MR_Sort Sort 1010 100-byte records (modeled after TeraSort benchmark)

Page 30: Mass Data Processing Technology on Large Scale Clusters Summer, 2007, Tsinghua University All course material (slides, labs, etc) is licensed under the.

MR_Grep

Locality optimization helps: 1800 machines read 1 TB of data at peak of ~31 GB/s Without this, rack switches would limit to 10 GB/s

Startup overhead is significant for short jobs

30

Page 31: Mass Data Processing Technology on Large Scale Clusters Summer, 2007, Tsinghua University All course material (slides, labs, etc) is licensed under the.

MR_SortBackup tasks reduce job completion time

significantly System deals well with failures

31

Normal No Backup Tasks 200 processes killed

Page 32: Mass Data Processing Technology on Large Scale Clusters Summer, 2007, Tsinghua University All course material (slides, labs, etc) is licensed under the.

More and more MapReduce

32

MapReduce Programs In Google Source Tree

Example uses:distributed grepdistributed sortweb link-graph reversalterm-vector per hostweb access log statsinverted index constructiondocument clustering machine learningstatistical machine translation

Page 33: Mass Data Processing Technology on Large Scale Clusters Summer, 2007, Tsinghua University All course material (slides, labs, etc) is licensed under the.

Real MapReduce : Rewrite of Production Indexing System

33

Page 34: Mass Data Processing Technology on Large Scale Clusters Summer, 2007, Tsinghua University All course material (slides, labs, etc) is licensed under the.

MapReduce ConclusionsMapReduce has proven to be a useful

abstraction Greatly simplifies large-scale

computations at Google Functional programming paradigm

can be applied to large-scale applications

Fun to use: focus on problem, let library deal w/ messy details

34

Page 35: Mass Data Processing Technology on Large Scale Clusters Summer, 2007, Tsinghua University All course material (slides, labs, etc) is licensed under the.

Boggle & MapReduce

35

Discuss: Use MapReduce to find the highest possible Boggle score.

Page 36: Mass Data Processing Technology on Large Scale Clusters Summer, 2007, Tsinghua University All course material (slides, labs, etc) is licensed under the.

Questions and Answers !!!

36