Distributed Computations MapReduce / Dryad MapReduce slides are adapted from Jeff Dean’s.

28
Distributed Computations MapReduce / Dryad MapReduce slides are adapted from Jeff Dean’s
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    220
  • download

    1

Transcript of Distributed Computations MapReduce / Dryad MapReduce slides are adapted from Jeff Dean’s.

Page 1: Distributed Computations MapReduce / Dryad MapReduce slides are adapted from Jeff Dean’s.

Distributed Computations

MapReduce / Dryad

MapReduce slides are adapted from Jeff Dean’s

Page 2: Distributed Computations MapReduce / Dryad MapReduce slides are adapted from Jeff Dean’s.

Why distributed computations?

• How long to sort 1 TB on one computer?– One computer can read ~30MB from disk– Takes ~2 days!!

• Google indexes 20 billion+ web pages – 20 * 10^9 pages * 20KB/page = 400 TB

• Large Hadron Collider is expected to produce 15 PB every year!

Page 3: Distributed Computations MapReduce / Dryad MapReduce slides are adapted from Jeff Dean’s.

Solution: use many nodes!

• Cluster computing (cloud computing)– Thousands PCs connected by high speed LANs

• Grid computing– Hundreds of supercomputers connected by high

speed net

• 1000 nodes potentially give 1000X speedup

Page 4: Distributed Computations MapReduce / Dryad MapReduce slides are adapted from Jeff Dean’s.

Distributed computations are difficult to program

• Sending data to/from nodes

• Coordinating among nodes

• Recovering from node failure

• Optimizing for locality

• Debugging

Same for all problems

Page 5: Distributed Computations MapReduce / Dryad MapReduce slides are adapted from Jeff Dean’s.

MapReduce/Dryad

• Both are case studies of distributed systems• Both address typical challenges of distributed

systems– Abstraction– Fault tolerance– Concurrency– Consistency– …

Page 6: Distributed Computations MapReduce / Dryad MapReduce slides are adapted from Jeff Dean’s.

MapReduce• A programming model for large-scale computations

– Process large amounts of input, produce output– No side-effects or persistent state (unlike file system)

• MapReduce is implemented as a runtime library:– automatic parallelization– load balancing– locality optimization– handling of machine failures

Page 7: Distributed Computations MapReduce / Dryad MapReduce slides are adapted from Jeff Dean’s.

MapReduce design

• Input data is partitioned into M splits• Map: extract information on each split

– Each Map produces R partitions

• Shuffle and sort– Bring M partitions to the same reducer

• Reduce: aggregate, summarize, filter or transform• Output is in R result files

Page 8: Distributed Computations MapReduce / Dryad MapReduce slides are adapted from Jeff Dean’s.

More specifically…• Programmer specifies two methods:

– map(k, v) → <k', v'>*– reduce(k', <v'>*) → <k', v'>*

• All v' with same k' are reduced together, in order.

• Usually also specify:– partition(k’, total partitions) -> partition for k’

• often a simple hash of the key• allows reduce operations for different k’ to be

parallelized

Page 9: Distributed Computations MapReduce / Dryad MapReduce slides are adapted from Jeff Dean’s.

Example: Count word frequencies in web pages

• Input is files with one doc per record

• Map parses documents into words– key = document URL– value = document contents

• Output of map:

“doc1”, “to be or not to be”

“to”, “1”“be”, “1”“or”, “1”…

Page 10: Distributed Computations MapReduce / Dryad MapReduce slides are adapted from Jeff Dean’s.

Example: word frequencies• Reduce: computes sum for a key

• Output of reduce saved

“be”, “2”“not”, “1”“or”, “1”“to”, “2”

key = “or”values = “1”

“1”

key = “be”values = “1”, “1”

“2”

key = “to”values = “1”, “1”

“2”

key = “not”values = “1”

“1”

Page 11: Distributed Computations MapReduce / Dryad MapReduce slides are adapted from Jeff Dean’s.

Example: Pseudo-codeMap(String input_key, String input_value): //input_key: document name //input_value: document contents for each word w in input_values: EmitIntermediate(w, "1");

Reduce(String key, Iterator intermediate_values): //key: a word, same for input and output //intermediate_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result));

Page 12: Distributed Computations MapReduce / Dryad MapReduce slides are adapted from Jeff Dean’s.

MapReduce is widely applicable

• Distributed grep

• Document clustering

• Web link graph reversal

• Detecting approx. duplicate web pages

• …

Page 13: Distributed Computations MapReduce / Dryad MapReduce slides are adapted from Jeff Dean’s.

MapReduce implementation

• Input data is partitioned into M splits• Map: extract information on each split

– Each Map produces R partitions

• Shuffle and sort– Bring M partitions to the same reducer

• Reduce: aggregate, summarize, filter or transform• Output is in R result files

Page 14: Distributed Computations MapReduce / Dryad MapReduce slides are adapted from Jeff Dean’s.

MapReduce scheduling

• One master, many workers – Input data split into M map tasks (e.g. 64 MB)– R reduce tasks– Tasks are assigned to workers dynamically– Often: M=200,000; R=4,000; workers=2,000

Page 15: Distributed Computations MapReduce / Dryad MapReduce slides are adapted from Jeff Dean’s.

MapReduce scheduling• Master assigns a map task to a free worker

– Prefers “close-by” workers when assigning task– Worker reads task input (often from local disk!)– Worker produces R local files containing intermediate

k/v pairs

• Master assigns a reduce task to a free worker – Worker reads intermediate k/v pairs from map workers– Worker sorts & applies user’s Reduce op to produce

the output

Page 16: Distributed Computations MapReduce / Dryad MapReduce slides are adapted from Jeff Dean’s.

Parallel MapReduce

Map Map Map Map

Inputdata

Inputdata

Reduce

Shuffle

Reduce

Shuffle

Reduce

Shuffle

Partitioned output

Partitioned output

Master

Page 17: Distributed Computations MapReduce / Dryad MapReduce slides are adapted from Jeff Dean’s.

WordCount Internals• Input data is split into M map jobs

• Each map job generates in R local partitions

“doc1”, “to be or not to be”

“to”, “1”“be”, “1”“or”, “1”“not”, “1“to”, “1”

“be”,“1”

“not”,“1”“or”, “1”

R localpartitions

“doc234”, “do not be silly”

“do”, “1”“not”, “1”“be”, “1”“silly”, “1 “be”,“1”

R localpartitions

“not”,“1”

“do”,“1”

“to”,“1”,”1”Hash(“to”) %

R

Page 18: Distributed Computations MapReduce / Dryad MapReduce slides are adapted from Jeff Dean’s.

WordCount Internals• Shuffle brings same partitions to same reducer

“to”,“1”,”1”

“be”,“1”

“not”,“1”“or”, “1”

“be”,“1”

R localpartitions

R localpartitions

“not”,“1”

“do”,“1”

“to”,“1”,”1”

“do”,“1”

“be”,“1”,”1”

“not”,“1”,”1”“or”, “1”

Page 19: Distributed Computations MapReduce / Dryad MapReduce slides are adapted from Jeff Dean’s.

WordCount Internals• Reduce aggregates sorted key values pairs

“to”,“1”,”1”

“do”,“1”

“not”,“1”,”1”

“or”, “1”

“do”,“1”“to”, “2”

“be”,“2”

“not”,“2”“or”, “1”

“be”,“1”,”1”

Page 20: Distributed Computations MapReduce / Dryad MapReduce slides are adapted from Jeff Dean’s.

The importance of partition function

• partition(k’, total partitions) -> partition for k’– e.g. hash(k’) % R

• What is the partition function for sort?

Page 21: Distributed Computations MapReduce / Dryad MapReduce slides are adapted from Jeff Dean’s.

Load Balance and Pipelining• Fine granularity tasks: many more map

tasks than machines– Minimizes time for fault recovery– Can pipeline shuffling with map execution– Better dynamic load balancing

• Often use 200,000 map/5000 reduce tasks w/ 2000 machines

Page 22: Distributed Computations MapReduce / Dryad MapReduce slides are adapted from Jeff Dean’s.

Fault tolerance via re-execution

On worker failure:• Re-execute completed and in-progress map

tasks• Re-execute in progress reduce tasks• Task completion committed through masterOn master failure:• State is checkpointed to GFS: new master

recovers & continues

Page 23: Distributed Computations MapReduce / Dryad MapReduce slides are adapted from Jeff Dean’s.

Avoid straggler using backup tasks• Slow workers significantly lengthen completion time

– Other jobs consuming resources on machine– Bad disks with soft errors transfer data very slowly– Weird things: processor caches disabled (!!)

• Solution: Near end of phase, spawn backup copies of tasks– Whichever one finishes first "wins"

• Effect: Dramatically shortens job completion time

Page 24: Distributed Computations MapReduce / Dryad MapReduce slides are adapted from Jeff Dean’s.

MapReduce Sort Performance

Page 25: Distributed Computations MapReduce / Dryad MapReduce slides are adapted from Jeff Dean’s.

Dryad

• Similar goals as MapReduce, different design

• Computations expressed as a graph– Vertices are computations– Edges are communication channels– Each vertex has several input and output edges

Page 26: Distributed Computations MapReduce / Dryad MapReduce slides are adapted from Jeff Dean’s.

WordCount in Dryad

CountWord:n

MergeSortWord:n

CountWord:n

DistributeWord:n

Page 27: Distributed Computations MapReduce / Dryad MapReduce slides are adapted from Jeff Dean’s.

Dryad Implementation

• Job manager (master)– Keeps execution records for each vertex– When all inputs are available, vertex becomes

runnable

• Daemon– Creates processes to run vertices (worker)

• Channels– Files, TCP connections, pipes, shared memory

• Cope with straggler via re-execution

Page 28: Distributed Computations MapReduce / Dryad MapReduce slides are adapted from Jeff Dean’s.

MapReduce vs. Dryad

• MapReduce:– Chain multiple MR jobs into a pipeline

• Dryad:– Express computation in arbitrary DAG

• Minor differences?– Dryad vertex takes arbitrary # of input/output

channels – Map takes one input(?), generates R outputs– Dryad channels can be TCP, pipes, files etc.– Map outputs are in R files