Large Scale Data Analysis with Map/Reduce, part I

Feb 2010

Large Scale Data Analysis with Map/Reduce, part I

Marin Dimitrov

(technology watch #1)

Feb, 2010

Contents

• Map/Reduce

• Dryad

• Sector/Sphere

• Open source M/R frameworks & tools– Hadoop (Yahoo/Apache)

– Cloud MapReduce (Accenture)

– Elastic MapReduce (Hadoop on AWS)

– MR.Flow

• Some M/R algorithms– Graph algorithms, Text Indexing & retrieval

Large Scale Data Analysis (Map/Reduce), part I #2

Feb, 2010

Contents

Part I

Distributed computing frameworks


Feb, 2010

Scalability & Parallelisation

• Scalability approaches– Scale up (vertical scaling)

• Only one direction of improvement (bigger box)

– Scale out (horizontal scaling)

• Two directions – add more nodes + scale up each node

• Can achieve x4 the performance of a similarly priced scale-up system (ref?)

– Hybrid (“scale out in a box”)

• Parallel algorithms... Not– Algorithms with state

– Dependencies from one iteration to another (recurrence, induction)


Feb, 2010

Parallelisation approaches

• Parallelization approaches– Task decomposition

• Distribute coarse-grained (synchronisation wise) and computationally expensive tasks (otherwise too much coordination/management overhead)

• Dependencies - execution order vs. data dependencies

• Move the data to the processing (when needed)

– Data decomposition

• Each parallel task works with a data partition assigned to it (no sharing)

• Data has regular structure, i.e. chunks expected to need the same amount of processing time

• Two criteria: granularity (size of chunk) and shape (data exchange between chunk neighbours)

• Move the processing to the data


Feb, 2010

Amdahl’s law

• Impossible to achieve linear speedup

• Maximum speedup is always bounded by the overhead for parallelisation and by the serial processing part

• Amdahl’s law– max_speedup =

– P: proportion of the program than can be parallelised (1-P still remains serial or overhead)

– N: number of processors / parallel nodes

– Example: P=75% (i.e. 25% serial or overhead)


N (parallel nodes) 2 4 8 16 32 1024 64K

Max speedup 1.60 2.29 2.91 3.37 3.66 3.99 3.99

Feb, 2010

Map/Reduce

• Google (2005), US patent (2010)

• General idea - co-locate data with computation nodes– Data decomposition (parallelization) – no data/order dependencies

between tasks (except the Map-to-Reduce phase)

– Try to utilise data locality (bandwidth is $$$)

– Implicit data flow (higher abstraction level than MPI)

– Partial failure handling (failed map/reduce tasks are re-scheduled)

• Structure– Map - for each input (Ki,Vi) produce zero or more output pairs

(Km,Vm)

– Combine – optional intermediate aggregation (less M->R data transfer)

– Reduce - for input pair (Km, list(V1,V2,…, Vn)) produce zero or more output pairs (Kr,Vr)


Feb, 2010

Map/Reduce (2)


(C) Jimmy Lin

Feb, 2010

Map/Reduce - examples

• In other words…– Map = partitioning of the data (compute part of a problem across

several servers)

– Reduce = processing of the partitions (aggregate the partial results from all servers into a single resultset)

– The M/R framework takes care of grouping of partitions by key

• Example: word count– Map (1 task per document in the collection)

• In: docx

• Out: (term1, count1,x), (term2, count2,x), …

– Reduce (1 task per term in the collection)

• In: (term1, < count1,x, count1,y, … count1,z >)

• Out: (term1, SUM(count1,x, count1,y, … count1,z))


Feb, 2010

Map/Reduceexamples (2)

• Example: Shortest path in graph (naïve)– Map: in (nodein, dist); out (nodeout, dist++) where nodein->nodeout

– Reduce: in (noder, <dista,r, distb,r, …, dustc,r>); out (noder, MIN(dista,r, distb,r, …, dustc,r))

– Multiple M/R iterations required, start with (nodestart,0)

• Example: Inverted indexing (full text search)– Map

• In: docx

• out: (term1, (docx, pos’1,x)), (term1, (docx, pos’’1,x)), (term2, (docx, pos2,x))…

– Reduce

• in = (term1, < (docx, pos’1,x), (docx, pos’’1,x), (docy, pos1,y), … (docz, pos1,z)>)

• out = (term1, < (docx, <pos’1,x, pos’’1,x,…>), (docy, <pos1,y>), … (docz, <pos1,z>)>)


Feb, 2010

Map/Reduce - examples (3)

• Inverted index example rundown

• input– Doc1: “Why did the chicken cross the road?”

– Doc2: “The chicken and egg problem”

– Doc3: “Kentucky Fried Chicken”

• Map phase (3 parallel tasks)– map1 => (“why”,(doc1,1)), (“did”,(doc1,2)), (“the”,(doc1,3)),

(“chicken”,(doc1,4)), (“cross”,(doc1,5)), (“the”,(doc1,6)), (“road”,(doc1,7))

– map2 => (“the”,(doc2,1)), (“chicken”,(doc2,2)), (“and”,(doc2,3)), (“egg”,(doc2,4)), (“problem”, (doc2,5))

– map3 => (“kentucky”,(doc3,1)), (“fried”,(doc3,2)), (“chicken”,(doc3,3))


Feb, 2010


• Inverted index example rundown (cont.)

• Intermediate shuffle & sort phase– (“why”, <(doc1,1)>),

– (“did”, <(doc1,2)>),

– (“the”, <(doc1,3), (doc1,6), (doc2,1)>)

– (“chicken”, <(doc1,4), (doc2,2), (doc3,3)>)

– (“cross”, <(doc1,5)>)

– (“road”, <(doc1,7)>)

– (“and”, <(doc2,3)>)

– (“egg”, <(doc2,4)>)

– (“problem”, <(doc2,5)>)

– (“kentucky”, <(doc3,1)>)

– (“fried”, <(doc3,2)>)


Feb, 2010


• Inverted index example rundown (cont.)

• Reduce phase (11 parallel tasks)– (“why”, <(doc1,<1>)>),

– (“did”, <(doc1,<2>)>),

– (“the”, <(doc1, <3,6>), (doc2, <1>)>)

– (“chicken”, <(doc1,<4>), (doc2,<2>), (doc3,<3>)>)

– (“cross”, <(doc1,<5>)>)

– (“road”, <(doc1,<7>)>)

– (“and”, <(doc2,<3>)>)

– (“egg”, <(doc2,<4>)>)

– (“problem”, <(doc2,<5>)>)

– (“kentucky”, <(doc3,<1>)>)

– (“fried”, <(doc3,<2>)>)


Feb, 2010

Map/Reduce – pros & cons

• Good for– Lots of input, intermediate & output data

– Little or no synchronisation required

– “Read once”, batch oriented datasets (ETL)

• Bad for– Fast response time

– Large amounts of shared data

– Fine-grained synchronisation required

– CPU intensive operations (as opposed to data intensive)


Feb, 2010

Dryad

• Microsoft Research (2007), http://research.microsoft.com/en-us/projects/dryad/

• General purpose distributed execution engine– Focus on throughput, not latency

– Automatic management of scheduling, distribution &fault tolerance

• Simple DAG model– Vertices -> processes (processing nodes)

– Edges -> communication channels between the processes

• DAG model benefits– Generic scheduler

– No deadlocks / deterministic

– Easier fault tolerance


http://research.microsoft.com/en-us/projects/dryad/



Feb, 2010

Dryad DAG jobs


(C) Michael Isard

Feb, 2010

Dryad (3)

• The job graph can mutate during execution (?)

• Channel types (one way)– Files on a DFS

– Temporary file

– Shared memory FIFO

– TCP pipes

• Fault tolerance– Node fails => re-run

– Input disappears => re-run upstream node

– Node is slow => run a duplicate copy at another node, get first result


Feb, 2010

Dryad architecture & components


(C) Mihai Budiu

Feb, 2010

Dryad programming

• C++ API (incl. Map/Reduce interfaces)

• SQL Integration Services (SSIS)– Many parallel SQL Server instances (each is a vertex in the DAG)

• DryadLINQ– LINQ to Dryad translator

• Distributed shell– Generalisation of the Unix shell & pipes

– Many inputs/outputs per process!

– Pipes span multiple machines


Feb, 2010

Dryad vs. Map/Reduce


(C) Mihai Budiu

Feb, 2010

Contents

Part II

Open Source Map/Reduce frameworks


Feb, 2010

Hadoop

• Apache Nutch (2004), Yahoo is currently the major contributor

• http://hadoop.apache.org/

• Not only a Map/Reduce implementation!– HDFS – distributed filesystem

– HBase – distributed column store

– Pig – high level query language (SQL like)

– Hive – Hadoop based data warehouse

– ZooKeeper, Chukwa, Pipes/Streaming, …

• Also available on Amazon EC2

• Largest Hadoop cluster – 25K nodes / 100K cores (Yahoo)


http://hadoop.apache.org/

Feb, 2010

Hadoop - Map/Reduce

• Components– Job client

– Job Tracker

• Only one

• Scheduling, coordinating, monitoring, failure handling

– Task Tracker

• Many

• Executes tasks received by the Job Tracker

• Sends “heartbeats” and progress reports back to the Job Tracker

– Task Runner

• The actual Map or Reduce task started in a separate JVM

• Crashes & failures do not affect the Task Tracker on the node!


Feb, 2010

Hadoop - Map/Reduce (2)


(C) Tom White

Feb, 2010


• Integrated with HDFS– Map tasks executed on the HDFS node where the data is (data

locality => reduce traffic)

– Data locality is not possible for Reduce tasks

– Intermediate outputs of Map tasks (nodes) are not stored on HDFS, but locally, and then sent to the proper Reduce task (node)

• Status updates– Task Runner => Task Tracker, progress updates every 3s

– Task Tracker => Job Tracker, heartbeat + progress for all local tasks every 5s

– If a task has no progress report for too long, it will be considered failed and re-started


Feb, 2010


• Some extras– Counters

• Gather stats about a task

• Globally aggregated (Job Runner => Task Tracker => Job Tracker)

• M/R counters: M/R input records, M/R output records

• Filesystem counters: bytes read/written

• Job counters: launched M/R tasks, failed M/R tasks, …

– Joins

• Copy the small set on each node and perform joins locally. Useful when one dataset is very large, the other very small (e.g. “Scalable Distributed Reasoning using MapReduce” from VUA)

• Map side join – data is joined before the Map function, very efficient but less flexible (datasets must be partitioned & sorted in a particular way)

• Reduce side join – more general but less efficient (Map generates (K,V) pairs using the join key)


Feb, 2010


• Built-in mappers and reducers– Chain – run a chain/pipe of sequential Maps (M+RM*). The last Map

output is the Task output

– FieldSelection – select a list of fields from the input dataset to be used as MR keys/values

– TokenCounterMapper, SumReducer – (remember the “word count” example?)

– RegexMapper – matches a regex in the input key/value pairs


Feb, 2010

Cloud MapReduce

• Accenture (2010)

• http://code.google.com/p/cloudmapreduce/

• Map/Reduce implementation for AWS (EC2, S3, SimpleDB, SQS)– fast (reported as up to 60 times faster than Hadoop/EC2 in some

cases)

– scalable & robust (no single point of bottleneck or failure)

– simple (3 KLOC)

• Features– No need for centralised coordinator (JobTracker), just put job status

in the cloud datastore (SimpleDB)

– All data transfer & communication is handled by the Cloud

– All I/O and storage is handled by the Cloud


http://code.google.com/p/cloudmapreduce/

Feb, 2010

Cloud MapReduce (2)


(C) Ricky Ho

Feb, 2010

Cloud MapReduce (3)

• Job client workflow1. Store input data (S3)

2. Create a Map task for each data split & put it into the MapperQueue (SQS)

3. Create Multiple Partition Queue (SQS)

4. Create Reducer Queue (SQS) & put a Reduce task for each Partition Queue

5. Create the Output Queue (SQS)

6. Create a Job Request (ref to all queues) and put it into SimpleDB

7. Start EC2 instances for Mappers & Reducers

8. Poll SimpleDB for job status

9. When job complete download results from S3


Feb, 2010

Cloud MapReduce (4)

• Mapper worflow1. Dequeue a Map task from the Mapper Queue

2. Fetch data from S3

3. Perform user defined map function, add multiple output (Km,Vm) pairs to some Multiple Partition Queue (hash(Km)) => several partition keys may share the same partition queue!

4. When done remove Map task from Mapper Queue

• Reducer workflow1. Dequeue a Reeduce task from the Reducer Queue

2. Dequeue the (Km,Vm) pairs from the corresponding Partition Queue => several partitions may share the same queue!

3. Perform a user defined reduce function and add output pairs (Kr,Vr) to the Output Queue

4. When done remove the Reduce task from the Reducer Queue


Feb, 2010

MR.Flow

• Web based M/R editor– http://www.mr-flow.com

– Reusable M/R modules

– Execution & status monitoring (Hadoop clusters)


http://www.mr-flow.com/



Feb, 2010

Contents

Part III

Some Map/Reduce algorithms


Feb, 2010

General considerations

• Map execution order is not deterministic

• Map processing time cannot be predicted

• Reduce tasks cannot start before all Maps have finished (dataset needs to be fully partitioned)

• Not suitable for continuous input streams

• There will be a spike in network utilisation after the Map / before the Reduce phase

• Number & size of key/value pairs– Object creation & serialisation overhead (Amdahl’s law!)

• Aggregate partial results when possible!– Use Combiners


Feb, 2010

Graph algorithms

• Very suitable for M/R processing– Data (graph node) locality

– “spreading activation” type of processing

– Some algorithms with sequential dependency not suitable for M/R

• Breadth-first search algorithms better than depth-first

• General Approach– Graph represented by adjacency lists

– Map task – input: node + its adjacency list; perform some analysis over the node link structure; output: target key + analysis result

– Reduce task – aggregate values by key

– Perform multiple iterations (with a termination criteria)


Feb, 2010

Social Network Analysis

• Problem: recommend new friends (friend-of-a-friend, FOAF)

• Map task– U (target user) is fixed and its friends list copied to all cluster nodes

(“copy join”); each cluster node stores part of the social graph

– In: (X, <friendsX>), i.e. the local data for the cluster node

– Out:

• if (U, X) are friends => (U, <friendsX\friendsU>), i.e. the users who are friends of X but not already friends of U

• nil otherwise

• Reduce task– In: (U, <<friendsA\friendsU>,<friendsB\friendsU>, … >), i.e. the FOAF

lists for all users A, B, etc. who are friends with U

– Out (U, <(X1, N1), (X2, N2), …>), where each X is a FOAF for U, and N is its total number of occurrences in all FOAF lists (sort/rank the result!)


Feb, 2010

PageRank with M/R


(C) Jimmy Lin

Feb, 2010

Text Indexing & Retrieval

• Indexing is very suitable for M/R– Focus on scalability, not on latency & response time

– Batch oriented

• Map task– emit (Term, (DocID, position))

• Reduce task– Group pairs by Term and sort by DocID


Feb, 2010

Text Indexing & Retrieval (2)


(C) Jimmy Lin

Feb, 2010

Text Indexing & Retrieval (3)

• Retrieval not suitable for M/R– Focus on response time

– Startup of Mappers & Reducers is usually prohibitively expensive

• Katta– http://katta.sourceforge.net/

– Distributed Lucene indexing with Hadoop (HDFS)

– Multicast querying & ranking


http://katta.sourceforge.net/

Feb, 2010

Useful links

• "MapReduce: Simplified Data Processing on Large Clusters"

• “Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks”

• “Cloud MapReduce Technical Report”

• Data-Intensive Text Processing with MapReduce

• Hadoop - The Definitive Guide


http://labs.google.com/papers/mapreduce.html



http://research.microsoft.com/en-us/projects/dryad/eurosys07.pdf




http://sites.google.com/site/huanliu/cloudmapreduce.pdf




http://www.umiacs.umd.edu/~jimmylin/MapReduce-book-20100219.pdf




http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/






Feb, 2010

Q & A

Questions?


Large Scale Data Analysis with Map/Reduce, part I

Technology

Transcript of Large Scale Data Analysis with Map/Reduce, part I