Large Scale Data Analysis with Map/Reduce, part I
-
Upload
marin-dimitrov -
Category
Technology
-
view
16.502 -
download
9
Transcript of Large Scale Data Analysis with Map/Reduce, part I
Feb 2010
Large Scale Data Analysis with Map/Reduce, part I
Marin Dimitrov
(technology watch #1)
Feb, 2010
Contents
• Map/Reduce
• Dryad
• Sector/Sphere
• Open source M/R frameworks & tools– Hadoop (Yahoo/Apache)
– Cloud MapReduce (Accenture)
– Elastic MapReduce (Hadoop on AWS)
– MR.Flow
• Some M/R algorithms– Graph algorithms, Text Indexing & retrieval
Large Scale Data Analysis (Map/Reduce), part I #2
Feb, 2010
Contents
Part I
Distributed computing frameworks
Large Scale Data Analysis (Map/Reduce), part I #3
Feb, 2010
Scalability & Parallelisation
• Scalability approaches– Scale up (vertical scaling)
• Only one direction of improvement (bigger box)
– Scale out (horizontal scaling)
• Two directions – add more nodes + scale up each node
• Can achieve x4 the performance of a similarly priced scale-up system (ref?)
– Hybrid (“scale out in a box”)
• Parallel algorithms... Not– Algorithms with state
– Dependencies from one iteration to another (recurrence, induction)
Large Scale Data Analysis (Map/Reduce), part I #4
Feb, 2010
Parallelisation approaches
• Parallelization approaches– Task decomposition
• Distribute coarse-grained (synchronisation wise) and computationally expensive tasks (otherwise too much coordination/management overhead)
• Dependencies - execution order vs. data dependencies
• Move the data to the processing (when needed)
– Data decomposition
• Each parallel task works with a data partition assigned to it (no sharing)
• Data has regular structure, i.e. chunks expected to need the same amount of processing time
• Two criteria: granularity (size of chunk) and shape (data exchange between chunk neighbours)
• Move the processing to the data
Large Scale Data Analysis (Map/Reduce), part I #5
Feb, 2010
Amdahl’s law
• Impossible to achieve linear speedup
• Maximum speedup is always bounded by the overhead for parallelisation and by the serial processing part
• Amdahl’s law– max_speedup =
– P: proportion of the program than can be parallelised (1-P still remains serial or overhead)
– N: number of processors / parallel nodes
– Example: P=75% (i.e. 25% serial or overhead)
Large Scale Data Analysis (Map/Reduce), part I #6
N (parallel nodes) 2 4 8 16 32 1024 64K
Max speedup 1.60 2.29 2.91 3.37 3.66 3.99 3.99
Feb, 2010
Map/Reduce
• Google (2005), US patent (2010)
• General idea - co-locate data with computation nodes– Data decomposition (parallelization) – no data/order dependencies
between tasks (except the Map-to-Reduce phase)
– Try to utilise data locality (bandwidth is $$$)
– Implicit data flow (higher abstraction level than MPI)
– Partial failure handling (failed map/reduce tasks are re-scheduled)
• Structure– Map - for each input (Ki,Vi) produce zero or more output pairs
(Km,Vm)
– Combine – optional intermediate aggregation (less M->R data transfer)
– Reduce - for input pair (Km, list(V1,V2,…, Vn)) produce zero or more output pairs (Kr,Vr)
Large Scale Data Analysis (Map/Reduce), part I #7
Feb, 2010
Map/Reduce (2)
Large Scale Data Analysis (Map/Reduce), part I #8
(C) Jimmy Lin
Feb, 2010
Map/Reduce - examples
• In other words…– Map = partitioning of the data (compute part of a problem across
several servers)
– Reduce = processing of the partitions (aggregate the partial results from all servers into a single resultset)
– The M/R framework takes care of grouping of partitions by key
• Example: word count– Map (1 task per document in the collection)
• In: docx
• Out: (term1, count1,x), (term2, count2,x), …
– Reduce (1 task per term in the collection)
• In: (term1, < count1,x, count1,y, … count1,z >)
• Out: (term1, SUM(count1,x, count1,y, … count1,z))
Large Scale Data Analysis (Map/Reduce), part I #9
Feb, 2010
Map/Reduceexamples (2)
• Example: Shortest path in graph (naïve)– Map: in (nodein, dist); out (nodeout, dist++) where nodein->nodeout
– Reduce: in (noder, <dista,r, distb,r, …, dustc,r>); out (noder, MIN(dista,r, distb,r, …, dustc,r))
– Multiple M/R iterations required, start with (nodestart,0)
• Example: Inverted indexing (full text search)– Map
• In: docx
• out: (term1, (docx, pos’1,x)), (term1, (docx, pos’’1,x)), (term2, (docx, pos2,x))…
– Reduce
• in = (term1, < (docx, pos’1,x), (docx, pos’’1,x), (docy, pos1,y), … (docz, pos1,z)>)
• out = (term1, < (docx, <pos’1,x, pos’’1,x,…>), (docy, <pos1,y>), … (docz, <pos1,z>)>)
Large Scale Data Analysis (Map/Reduce), part I #10
Feb, 2010
Map/Reduce - examples (3)
• Inverted index example rundown
• input– Doc1: “Why did the chicken cross the road?”
– Doc2: “The chicken and egg problem”
– Doc3: “Kentucky Fried Chicken”
• Map phase (3 parallel tasks)– map1 => (“why”,(doc1,1)), (“did”,(doc1,2)), (“the”,(doc1,3)),
(“chicken”,(doc1,4)), (“cross”,(doc1,5)), (“the”,(doc1,6)), (“road”,(doc1,7))
– map2 => (“the”,(doc2,1)), (“chicken”,(doc2,2)), (“and”,(doc2,3)), (“egg”,(doc2,4)), (“problem”, (doc2,5))
– map3 => (“kentucky”,(doc3,1)), (“fried”,(doc3,2)), (“chicken”,(doc3,3))
Large Scale Data Analysis (Map/Reduce), part I #11
Feb, 2010
Map/Reduce - examples (4)
• Inverted index example rundown (cont.)
• Intermediate shuffle & sort phase– (“why”, <(doc1,1)>),
– (“did”, <(doc1,2)>),
– (“the”, <(doc1,3), (doc1,6), (doc2,1)>)
– (“chicken”, <(doc1,4), (doc2,2), (doc3,3)>)
– (“cross”, <(doc1,5)>)
– (“road”, <(doc1,7)>)
– (“and”, <(doc2,3)>)
– (“egg”, <(doc2,4)>)
– (“problem”, <(doc2,5)>)
– (“kentucky”, <(doc3,1)>)
– (“fried”, <(doc3,2)>)
Large Scale Data Analysis (Map/Reduce), part I #12
Feb, 2010
Map/Reduce - examples (5)
• Inverted index example rundown (cont.)
• Reduce phase (11 parallel tasks)– (“why”, <(doc1,<1>)>),
– (“did”, <(doc1,<2>)>),
– (“the”, <(doc1, <3,6>), (doc2, <1>)>)
– (“chicken”, <(doc1,<4>), (doc2,<2>), (doc3,<3>)>)
– (“cross”, <(doc1,<5>)>)
– (“road”, <(doc1,<7>)>)
– (“and”, <(doc2,<3>)>)
– (“egg”, <(doc2,<4>)>)
– (“problem”, <(doc2,<5>)>)
– (“kentucky”, <(doc3,<1>)>)
– (“fried”, <(doc3,<2>)>)
Large Scale Data Analysis (Map/Reduce), part I #13
Feb, 2010
Map/Reduce – pros & cons
• Good for– Lots of input, intermediate & output data
– Little or no synchronisation required
– “Read once”, batch oriented datasets (ETL)
• Bad for– Fast response time
– Large amounts of shared data
– Fine-grained synchronisation required
– CPU intensive operations (as opposed to data intensive)
Large Scale Data Analysis (Map/Reduce), part I #14
Feb, 2010
Dryad
• Microsoft Research (2007), http://research.microsoft.com/en-us/projects/dryad/
• General purpose distributed execution engine– Focus on throughput, not latency
– Automatic management of scheduling, distribution &fault tolerance
• Simple DAG model– Vertices -> processes (processing nodes)
– Edges -> communication channels between the processes
• DAG model benefits– Generic scheduler
– No deadlocks / deterministic
– Easier fault tolerance
Large Scale Data Analysis (Map/Reduce), part I #15
Feb, 2010
Dryad DAG jobs
Large Scale Data Analysis (Map/Reduce), part I #16
(C) Michael Isard
Feb, 2010
Dryad (3)
• The job graph can mutate during execution (?)
• Channel types (one way)– Files on a DFS
– Temporary file
– Shared memory FIFO
– TCP pipes
• Fault tolerance– Node fails => re-run
– Input disappears => re-run upstream node
– Node is slow => run a duplicate copy at another node, get first result
Large Scale Data Analysis (Map/Reduce), part I #17
Feb, 2010
Dryad architecture & components
Large Scale Data Analysis (Map/Reduce), part I #18
(C) Mihai Budiu
Feb, 2010
Dryad programming
• C++ API (incl. Map/Reduce interfaces)
• SQL Integration Services (SSIS)– Many parallel SQL Server instances (each is a vertex in the DAG)
• DryadLINQ– LINQ to Dryad translator
• Distributed shell– Generalisation of the Unix shell & pipes
– Many inputs/outputs per process!
– Pipes span multiple machines
Large Scale Data Analysis (Map/Reduce), part I #19
Feb, 2010
Dryad vs. Map/Reduce
Large Scale Data Analysis (Map/Reduce), part I #20
(C) Mihai Budiu
Feb, 2010
Contents
Part II
Open Source Map/Reduce frameworks
Large Scale Data Analysis (Map/Reduce), part I #21
Feb, 2010
Hadoop
• Apache Nutch (2004), Yahoo is currently the major contributor
• http://hadoop.apache.org/
• Not only a Map/Reduce implementation!– HDFS – distributed filesystem
– HBase – distributed column store
– Pig – high level query language (SQL like)
– Hive – Hadoop based data warehouse
– ZooKeeper, Chukwa, Pipes/Streaming, …
• Also available on Amazon EC2
• Largest Hadoop cluster – 25K nodes / 100K cores (Yahoo)
Large Scale Data Analysis (Map/Reduce), part I #22
Feb, 2010
Hadoop - Map/Reduce
• Components– Job client
– Job Tracker
• Only one
• Scheduling, coordinating, monitoring, failure handling
– Task Tracker
• Many
• Executes tasks received by the Job Tracker
• Sends “heartbeats” and progress reports back to the Job Tracker
– Task Runner
• The actual Map or Reduce task started in a separate JVM
• Crashes & failures do not affect the Task Tracker on the node!
Large Scale Data Analysis (Map/Reduce), part I #23
Feb, 2010
Hadoop - Map/Reduce (2)
Large Scale Data Analysis (Map/Reduce), part I #24
(C) Tom White
Feb, 2010
Hadoop - Map/Reduce (3)
• Integrated with HDFS– Map tasks executed on the HDFS node where the data is (data
locality => reduce traffic)
– Data locality is not possible for Reduce tasks
– Intermediate outputs of Map tasks (nodes) are not stored on HDFS, but locally, and then sent to the proper Reduce task (node)
• Status updates– Task Runner => Task Tracker, progress updates every 3s
– Task Tracker => Job Tracker, heartbeat + progress for all local tasks every 5s
– If a task has no progress report for too long, it will be considered failed and re-started
Large Scale Data Analysis (Map/Reduce), part I #25
Feb, 2010
Hadoop - Map/Reduce (4)
• Some extras– Counters
• Gather stats about a task
• Globally aggregated (Job Runner => Task Tracker => Job Tracker)
• M/R counters: M/R input records, M/R output records
• Filesystem counters: bytes read/written
• Job counters: launched M/R tasks, failed M/R tasks, …
– Joins
• Copy the small set on each node and perform joins locally. Useful when one dataset is very large, the other very small (e.g. “Scalable Distributed Reasoning using MapReduce” from VUA)
• Map side join – data is joined before the Map function, very efficient but less flexible (datasets must be partitioned & sorted in a particular way)
• Reduce side join – more general but less efficient (Map generates (K,V) pairs using the join key)
Large Scale Data Analysis (Map/Reduce), part I #26
Feb, 2010
Hadoop - Map/Reduce (5)
• Built-in mappers and reducers– Chain – run a chain/pipe of sequential Maps (M+RM*). The last Map
output is the Task output
– FieldSelection – select a list of fields from the input dataset to be used as MR keys/values
– TokenCounterMapper, SumReducer – (remember the “word count” example?)
– RegexMapper – matches a regex in the input key/value pairs
Large Scale Data Analysis (Map/Reduce), part I #27
Feb, 2010
Cloud MapReduce
• Accenture (2010)
• http://code.google.com/p/cloudmapreduce/
• Map/Reduce implementation for AWS (EC2, S3, SimpleDB, SQS)– fast (reported as up to 60 times faster than Hadoop/EC2 in some
cases)
– scalable & robust (no single point of bottleneck or failure)
– simple (3 KLOC)
• Features– No need for centralised coordinator (JobTracker), just put job status
in the cloud datastore (SimpleDB)
– All data transfer & communication is handled by the Cloud
– All I/O and storage is handled by the Cloud
Large Scale Data Analysis (Map/Reduce), part I #28
Feb, 2010
Cloud MapReduce (2)
Large Scale Data Analysis (Map/Reduce), part I #29
(C) Ricky Ho
Feb, 2010
Cloud MapReduce (3)
• Job client workflow1. Store input data (S3)
2. Create a Map task for each data split & put it into the MapperQueue (SQS)
3. Create Multiple Partition Queue (SQS)
4. Create Reducer Queue (SQS) & put a Reduce task for each Partition Queue
5. Create the Output Queue (SQS)
6. Create a Job Request (ref to all queues) and put it into SimpleDB
7. Start EC2 instances for Mappers & Reducers
8. Poll SimpleDB for job status
9. When job complete download results from S3
Large Scale Data Analysis (Map/Reduce), part I #30
Feb, 2010
Cloud MapReduce (4)
• Mapper worflow1. Dequeue a Map task from the Mapper Queue
2. Fetch data from S3
3. Perform user defined map function, add multiple output (Km,Vm) pairs to some Multiple Partition Queue (hash(Km)) => several partition keys may share the same partition queue!
4. When done remove Map task from Mapper Queue
• Reducer workflow1. Dequeue a Reeduce task from the Reducer Queue
2. Dequeue the (Km,Vm) pairs from the corresponding Partition Queue => several partitions may share the same queue!
3. Perform a user defined reduce function and add output pairs (Kr,Vr) to the Output Queue
4. When done remove the Reduce task from the Reducer Queue
Large Scale Data Analysis (Map/Reduce), part I #31
Feb, 2010
MR.Flow
• Web based M/R editor– http://www.mr-flow.com
– Reusable M/R modules
– Execution & status monitoring (Hadoop clusters)
Large Scale Data Analysis (Map/Reduce), part I #32
Feb, 2010
Contents
Part III
Some Map/Reduce algorithms
Large Scale Data Analysis (Map/Reduce), part I #33
Feb, 2010
General considerations
• Map execution order is not deterministic
• Map processing time cannot be predicted
• Reduce tasks cannot start before all Maps have finished (dataset needs to be fully partitioned)
• Not suitable for continuous input streams
• There will be a spike in network utilisation after the Map / before the Reduce phase
• Number & size of key/value pairs– Object creation & serialisation overhead (Amdahl’s law!)
• Aggregate partial results when possible!– Use Combiners
Large Scale Data Analysis (Map/Reduce), part I #34
Feb, 2010
Graph algorithms
• Very suitable for M/R processing– Data (graph node) locality
– “spreading activation” type of processing
– Some algorithms with sequential dependency not suitable for M/R
• Breadth-first search algorithms better than depth-first
• General Approach– Graph represented by adjacency lists
– Map task – input: node + its adjacency list; perform some analysis over the node link structure; output: target key + analysis result
– Reduce task – aggregate values by key
– Perform multiple iterations (with a termination criteria)
Large Scale Data Analysis (Map/Reduce), part I #35
Feb, 2010
Social Network Analysis
• Problem: recommend new friends (friend-of-a-friend, FOAF)
• Map task– U (target user) is fixed and its friends list copied to all cluster nodes
(“copy join”); each cluster node stores part of the social graph
– In: (X, <friendsX>), i.e. the local data for the cluster node
– Out:
• if (U, X) are friends => (U, <friendsX\friendsU>), i.e. the users who are friends of X but not already friends of U
• nil otherwise
• Reduce task– In: (U, <<friendsA\friendsU>,<friendsB\friendsU>, … >), i.e. the FOAF
lists for all users A, B, etc. who are friends with U
– Out (U, <(X1, N1), (X2, N2), …>), where each X is a FOAF for U, and N is its total number of occurrences in all FOAF lists (sort/rank the result!)
Large Scale Data Analysis (Map/Reduce), part I #36
Feb, 2010
PageRank with M/R
Large Scale Data Analysis (Map/Reduce), part I #37
(C) Jimmy Lin
Feb, 2010
Text Indexing & Retrieval
• Indexing is very suitable for M/R– Focus on scalability, not on latency & response time
– Batch oriented
• Map task– emit (Term, (DocID, position))
• Reduce task– Group pairs by Term and sort by DocID
Large Scale Data Analysis (Map/Reduce), part I #38
Feb, 2010
Text Indexing & Retrieval (2)
Large Scale Data Analysis (Map/Reduce), part I #39
(C) Jimmy Lin
Feb, 2010
Text Indexing & Retrieval (3)
• Retrieval not suitable for M/R– Focus on response time
– Startup of Mappers & Reducers is usually prohibitively expensive
• Katta– http://katta.sourceforge.net/
– Distributed Lucene indexing with Hadoop (HDFS)
– Multicast querying & ranking
Large Scale Data Analysis (Map/Reduce), part I #40
Feb, 2010
Useful links
• "MapReduce: Simplified Data Processing on Large Clusters"
• “Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks”
• “Cloud MapReduce Technical Report”
• Data-Intensive Text Processing with MapReduce
• Hadoop - The Definitive Guide
Large Scale Data Analysis (Map/Reduce), part I #41
Feb, 2010
Q & A
Questions?
Large Scale Data Analysis (Map/Reduce), part I #42