Distributed Cluster Computing Platforms. Outline What is the purpose of Data Intensive Super...

download Distributed Cluster Computing Platforms. Outline What is the purpose of Data Intensive Super Computing? MapReduce Pregel Dryad Spark/Shark Distributed.

If you can't read please download the document

Transcript of Distributed Cluster Computing Platforms. Outline What is the purpose of Data Intensive Super...

  • Slide 1
  • Distributed Cluster Computing Platforms
  • Slide 2
  • Outline What is the purpose of Data Intensive Super Computing? MapReduce Pregel Dryad Spark/Shark Distributed Graph Computing
  • Slide 3
  • Why DISC DISC stands for Data Intensive Super Computing A lot of applications. scientific data, web search engine, social network economic, GIS New data are continuously generated People want to understand the data BigData analysis is now considered as a very important method for scientific research.
  • Slide 4
  • What are the required features for the platform to handle DISC? Application specific: it is very difficult or even impossible to construct one system to fit them all. One example is the POSIX compatible file system. Each system should be re-configure or even re-designed for a specific application. Think about the motivation for building the Google file system for Google search engine. Programmer friendly interfaces: The Application programmer should not consider how to handle the infrastructure such as machines and networks. Fault Tolerant: The platform should handle the fault components automatically without any special treatment from the application. Scalability: The platform should run on top of at least thousands of machines and harnessing the power of all the components. The load balance should be achieved by the platform instead of the application itself. Try to understand all these four features during the introduction of the concrete platform below.
  • Slide 5
  • Google MapReduc e Programming Model Implementation Refinements Evaluation Conclusion
  • Slide 6
  • Motivation: large scale data processing Process lots of data to produce other derived data Input: crawled documents, web request logs etc. Output: inverted indices, web page graph structure, top queries in a day etc. Want to use hundreds or thousands of CPUs but want to only focus on the functionality MapReduce hides messy details in a library: Parallelization Data distribution Fault-tolerance Load balancing
  • Slide 7
  • Motivation: Large Scale Data Processing Want to process lots of data ( > 1 TB) Want to parallelize across hundreds/thousands of CPUs Want to make this easy "Google Earth uses 70.5 TB: 70 TB for the raw imagery and 500 GB for the index data." From: http://googlesystem.blogspot.com/2006/09/how-much- data-does-google-store.html
  • Slide 8
  • MapReduce Automatic parallelization & distribution Fault-tolerant Provides status and monitoring tools Clean abstraction for programmers
  • Slide 9
  • Programming Model Borrows from functional programming Users implement interface of two functions: map (in_key, in_value) -> (out_key, intermediate_value) list reduce (out_key, intermediate_value list) -> out_value list
  • Slide 10
  • map Records from the data source (lines out of files, rows of a database, etc) are fed into the map function as key*value pairs: e.g., (filename, line). map() produces one or more intermediate values along with an output key from the input.
  • Slide 11
  • reduce After the map phase is over, all the intermediate values for a given output key are combined together into a list reduce() combines those intermediate values into one or more final values for that same output key (in practice, usually only one final value per key)
  • Slide 12
  • Architecture
  • Slide 13
  • Parallelism map() functions run in parallel, creating different intermediate values from different input data sets reduce() functions also run in parallel, each working on a different output key All values are processed independently Bottleneck: reduce phase cant start until map phase is completely finished.
  • Slide 14
  • Example: Count word occurrences map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); reduce(String output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result));
  • Slide 15
  • Example vs. Actual Source Code Example is written in pseudo-code Actual implementation is in C++, using a MapReduce library Bindings for Python and Java exist via interfaces True code is somewhat more involved (defines how the input key/values are divided up and accessed, etc.)
  • Slide 16
  • Example Page 1: the weather is good Page 2: today is good Page 3: good weather is good.
  • Slide 17
  • Map output Worker 1: (the 1), (weather 1), (is 1), (good 1). Worker 2: (today 1), (is 1), (good 1). Worker 3: (good 1), (weather 1), (is 1), (good 1).
  • Slide 18
  • Reduce Input Worker 1: (the 1) Worker 2: (is 1), (is 1), (is 1) Worker 3: (weather 1), (weather 1) Worker 4: (today 1) Worker 5: (good 1), (good 1), (good 1), (good 1)
  • Slide 19
  • Reduce Output Worker 1: (the 1) Worker 2: (is 3) Worker 3: (weather 2) Worker 4: (today 1) Worker 5: (good 4)
  • Slide 20
  • Some Other Real Examples Term frequencies through the whole Web repository Count of URL access frequency Reverse web-link graph
  • Slide 21
  • Implementation Overview Typical cluster: 100s/1000s of 2-CPU x86 machines, 2-4 GB of memory Limited bisection bandwidth Storage is on local IDE disks GFS: distributed file system manages data (SOSP'03) Job scheduling system: jobs made up of tasks, scheduler assigns tasks to machines Implementation is a C++ library linked into user programs
  • Slide 22
  • Architecture
  • Slide 23
  • Execution
  • Slide 24
  • Parallel Execution
  • Slide 25
  • Task Granularity And Pipelining Fine granularity tasks: many more map tasks than machines Minimizes time for fault recovery Can pipeline shuffling with map execution Better dynamic load balancing Often use 200,000 map/5000 reduce tasks w/ 2000 machines
  • Slide 26
  • Locality Master program divvies up tasks based on location of data: (Asks GFS for locations of replicas of input file blocks) tries to have map() tasks on same machine as physical file data, or at least same rack map() task inputs are divided into 64 MB blocks: same size as Google File System chunks Without this, rack switches limit read rate Effect: Thousands of machines read input at local disk speed
  • Slide 27
  • Fault Tolerance Master detects worker failures Re-executes completed & in-progress map() tasks Re-executes in-progress reduce() tasks Master notices particular input key/values cause crashes in map(), and skips those values on re-execution. Effect: Can work around bugs in third- party libraries!
  • Slide 28
  • Fault Tolerance On worker failure: Detect failure via periodic heartbeats Re-execute completed and in-progress map tasks Re-execute in progress reduce tasks Task completion committed through master Master failure: Could handle, but don't yet (master failure unlikely) Robust: lost 1600 of 1800 machines once, but finished fine
  • Slide 29
  • Optimizations No reduce can start until map is complete: A single slow disk controller can rate-limit the whole process Master redundantly executes slow-moving map tasks; uses results of first copy to finish, (one finishes first wins) Why is it safe to redundantly execute map tasks? Wouldnt this mess up the total computation? Slow workers significantly lengthen completion time Other jobs consuming resources on machine Bad disks with soft errors transfer data very slowly Weird things: processor caches disabled (!!)
  • Slide 30
  • Optimizations Combiner functions can run on same machine as a mapper Causes a mini-reduce phase to occur before the real reduce phase, to save bandwidth Under what conditions is it sound to use a combiner?
  • Slide 31
  • Refinement Sorting guarantees within each reduce partition Compression of intermediate data Combiner: useful for saving network bandwidth Local execution for debugging/testing User-defined counters
  • Slide 32
  • Performance Tests run on cluster of 1800 machines: 4 GB of memory Dual-processor 2 GHz Xeons with Hyperthreading Dual 160 GB IDE disks Gigabit Ethernet per machine Bisection bandwidth approximately 100 Gbps Two benchmarks: MR_GrepScan 10 10 100-byte records to extract records matching a rare pattern (92K matching records) MR_SortSort 10 10 100-byte records (modeled after TeraSort benchmark)
  • Slide 33
  • MR_Grep Locality optimization helps: 1800 machines read 1 TB of data at peak of ~31 GB/s Without this, rack switches would limit to 10 GB/s Startup overhead is significant for short jobs
  • Slide 34
  • MR_Sort Backup tasks reduce job completion time significantly System deals well with failures NormalNo Backup Tasks200 processes killed
  • Slide 35
  • More and more MapReduce MapReduce Programs In Google Source Tree Example uses: distributed grep distributed sort web link-graph reversal term-vector per host web access log stats inverted index construction document clustering machine learning statistical machine translation
  • Slide 36
  • Real MapReduce : Rewrite of Production Indexing System Rewrote Google's production indexing system using MapReduce Set of 10, 14, 17, 21, 24 MapReduce operations New code is simpler, easier to understand MapReduce takes care of failures, slow machines Easy to make indexing faster by adding more machines
  • Slide 37
  • MapReduce Conclusions MapReduce has proven to be a useful abstraction Greatly simplifies large-scale computations at Google Functional programming paradigm can be applied to large-scale applications Fun to use: focus on problem, let library deal w/ messy details
  • Slide 38
  • MapReduce Programs Sorting Searching Indexing Classification TF-IDF Breadth-First Search / SSSP PageRank Clustering
  • Slide 39
  • MapReduc e for PageRank
  • Slide 40
  • PageRank: Random Walks Over The Web If a user starts at a random web page and surfs by clicking links and randomly entering new URLs, what is the probability that s/he will arrive at a given page? The PageRank of a page captures this notion More popular or worthwhile pages get a higher rank
  • Slide 41
  • PageRank: Visually
  • Slide 42
  • PageRank: Formula Given page A, and pages T 1 through T n linking to A, PageRank is defined as: PR(A) = (1-d) + d (PR(T 1 )/C(T 1 ) +... + PR(T n )/C(T n )) C(P) is the cardinality (out-degree) of page P d is the damping (random URL) factor
  • Slide 43
  • PageRank: Intuition Calculation is iterative: PR i+1 is based on PR i Each page distributes its PR i to all pages it links to. Linkees add up their awarded rank fragments to find their PR i+1 d is a tunable parameter (usually = 0.85) encapsulating the random jump factor PR(A) = (1-d) + d (PR(T 1 )/C(T 1 ) +... + PR(T n )/C(T n ))
  • Slide 44
  • PageRank: First Implementation Create two tables 'current' and 'next' holding the PageRank for each page. Seed 'current' with initial PR values Iterate over all pages in the graph, distributing PR from 'current' into 'next' of linkees current := next; next := fresh_table(); Go back to iteration step or end if converged
  • Slide 45
  • Distribution of the Algorithm Key insights allowing parallelization: The 'next' table depends on 'current', but not on any other rows of 'next' Individual rows of the adjacency matrix can be processed in parallel Sparse matrix rows are relatively small
  • Slide 46
  • Distribution of the Algorithm Consequences of insights: We can map each row of 'current' to a list of PageRank fragments to assign to linkees These fragments can be reduced into a single PageRank value for a page by summing Graph representation can be even more compact; since each element is simply 0 or 1, only transmit column numbers where it's 1
  • Slide 47
  • Slide 48
  • Phase 1: Parse HTML Map task takes (URL, page content) pairs and maps them to (URL, (PR init, list-of-urls)) PR init is the seed PageRank for URL list-of-urls contains all pages pointed to by URL Reduce task is just the identity function
  • Slide 49
  • Phase 2: PageRank Distribution Map task takes (URL, (cur_rank, url_list)) For each u in url_list, emit (u, cur_rank/|url_list|) Emit (URL, url_list) to carry the points-to list along through iterations PR(A) = (1-d) + d (PR(T 1 )/C(T 1 ) +... + PR(T n )/C(T n ))
  • Slide 50
  • Phase 2: PageRank Distribution Reduce task gets (URL, url_list) and many (URL, val) values Sum vals and fix up with d Emit (URL, (new_rank, url_list)) PR(A) = (1-d) + d (PR(T 1 )/C(T 1 ) +... + PR(T n )/C(T n ))
  • Slide 51
  • Finishing up... A non-parallelizable component determines whether convergence has been achieved (Fixed number of iterations? Comparison of key values?) If so, write out the PageRank lists - done! Otherwise, feed output of Phase 2 into another Phase 2 iteration
  • Slide 52
  • PageRank Conclusions MapReduce isn't the greatest at iterated computation, but still helps run the heavy lifting Key element in parallelization is independent PageRank computations in a given step Parallelization requires thinking about minimum data partitions to transmit (e.g., compact representations of graph rows) Even the implementation shown today doesn't actually scale to the whole Internet; but it works for intermediate-sized graphs So, do you think that MapReduce is suitable for PageRank? (homework, give concrete reason for why and why not.)
  • Slide 53
  • Dryad Dryad Design Implementation Policies as Plug-ins Building on Dryad
  • Slide 54
  • Design Space 54 ThroughputLatency Internet Private data center Data- parallel Shared memory
  • Slide 55
  • Data Partitioning 55 RAM DATA
  • Slide 56
  • 2-D Piping Unix Pipes: 1-D grep | sed | sort | awk | perl Dryad: 2-D grep 1000 | sed 500 | sort 1000 | awk 500 | perl 50 56
  • Slide 57
  • Dryad = Execution Layer 57 Job (Application) Dryad Cluster Pipeline Shell Machine
  • Slide 58
  • Dryad Design Implementation Policies as Plug-ins Building on Dryad 58
  • Slide 59
  • Virtualized 2-D Pipelines 59
  • Slide 60
  • Virtualized 2-D Pipelines 60
  • Slide 61
  • Virtualized 2-D Pipelines 61
  • Slide 62
  • Virtualized 2-D Pipelines 62
  • Slide 63
  • Virtualized 2-D Pipelines 63 2D DAG multi-machine virtualized
  • Slide 64
  • Dryad Job Structure 64 grep sed sort awk perl grep sed sort awk Input files Vertices (processes) Output files Channels Stage grep 1000 | sed 500 | sort 1000 | awk 500 | perl 50
  • Slide 65
  • Channels 65 X M Items Finite Streams of items distributed filesystem files (persistent) SMB/NTFS files (temporary) TCP pipes (inter-machine) memory FIFOs (intra-machine)
  • Slide 66
  • Architecture 66 Files, TCP, FIFO, Network job schedule data plane control plane NSPD V VV Job managercluster
  • Slide 67
  • JM code vertex code Staging 1. Build 2. Send.exe 3. Start JM 5. Generate graph 7. Serialize vertices 8. Monitor Vertex execution 4. Query cluster resources Cluster services 6. Initialize vertices
  • Slide 68
  • Fault Tolerance
  • Slide 69
  • Dryad Design Implementation Policies and Resource Management Building on Dryad 69
  • Slide 70
  • Policy Managers 70 RR XXXX Stage R RR Stage X Job Manager R managerX Manager R-X Manager Connection R-X
  • Slide 71
  • X[0]X[1]X[3]X[2]X[2] Completed vertices Slow vertex Duplicate vertex Duplicate Execution Manager Duplication Policy = f(running times, data volumes)
  • Slide 72
  • SSSS AAA SS T SSSSSS T # 1# 2# 1# 3 # 2 # 3# 2# 1 static dynamic rack # Aggregation Manager 72
  • Slide 73
  • Data Distribution (Group By) 73 Dest Source Dest Source Dest Source m n m x n
  • Slide 74
  • TT [0-?)[?-100) Range-Distribution Manager S DDD SS SSS T static dynamic 74 Hist [0-30),[30-100) [30-100)[0-30) [0-100)
  • Slide 75
  • Goal: Declarative Programming 75 X T S XX SS TTT X staticdynamic
  • Slide 76
  • Dryad Design Implementation Policies as Plug-ins Building on Dryad 76
  • Slide 77
  • Software Stack 77 Windows Server Cluster Services Distributed Filesystem Dryad Distributed Shell PSQL DryadLINQ Perl SQL server C++ Windows Server C++ CIFS/NTFS legacy code sed, awk, grep, etc. SSIS Queries C# Vectors Machine Learning C# Job queueing, monitoring
  • Slide 78
  • SkyServer Query 18 78 select distinct P.ObjID into results from photoPrimary U, neighbors N, photoPrimary L where U.ObjID = N.ObjID and L.ObjID = N.NeighborObjID and P.ObjID < L.ObjID and abs((U.u-U.g)-(L.u-L.g)) a.Multiply(b)); 94
  • Slide 95
  • Expectation Maximization (Gaussians) 95 160 lines 3 iterations shown
  • Slide 96
  • Conclusions Dryad = distributed execution environment Application-independent (semantics oblivious) Supports rich software ecosystem Relational algebra Map-reduce LINQ Etc. DryadLINQ = A Dryad provider for LINQ This is only the beginning! 96
  • Slide 97
  • Some other system you should know about BigData processing Hadoop HDFS, MapReduce (open source version of GFS and MapReduce) HIVE/Pig/Sawzall (Query Language Processing) Spark/Shark (Efficient use of cluster memory and supporting iterative mapreduce program)
  • Slide 98
  • Thank you! Any Questions?
  • Slide 99
  • Pregel as backup slides
  • Slide 100
  • Pregel Introduction Computation Model Writing a Pregel Program System Implementation Experiments Conclusion
  • Slide 101
  • Introduction (1/2) Source: SIGMETRICS 09 Tutorial MapReduce: The Programming Model and Practice, by Jerry Zhao
  • Slide 102
  • Introduction (2/2) Many practical computing problems concern large graphs MapReduce is ill-suited for graph processing Many iterations are needed for parallel graph processing Materializations of intermediate results at every MapReduce iteration harm performance Large graph data Web graph Transportation routes Citation relationships Social networks Graph algorithms PageRank Shortest path Connected components Clustering techniques
  • Slide 103
  • Single Source Shortest Path (SSSP) Problem Find shortest path from a source node to all target nodes Solution Single processor machine: Dijkstras algorithm
  • Slide 104
  • Example: SSSP Dijkstras Algorithm 0 10 5 23 2 1 9 7 46
  • Slide 105
  • Example: SSSP Dijkstras Algorithm 0 10 5 5 23 2 1 9 7 46
  • Slide 106
  • Example: SSSP Dijkstras Algorithm 0 8 5 14 7 10 5 23 2 1 9 7 46
  • Slide 107
  • Example: SSSP Dijkstras Algorithm 0 8 5 13 7 10 5 23 2 1 9 7 46
  • Slide 108
  • Example: SSSP Dijkstras Algorithm 0 8 5 9 7 10 5 23 2 1 9 7 46
  • Slide 109
  • Example: SSSP Dijkstras Algorithm 0 8 5 9 7 10 5 23 2 1 9 7 46
  • Slide 110
  • Single Source Shortest Path (SSSP) Problem Find shortest path from a source node to all target nodes Solution Single processor machine: Dijkstras algorithm MapReduce/Pregel: parallel breadth-first search (BFS)
  • Slide 111
  • MapReduce Execution Overview
  • Slide 112
  • Adjacency matrix Adjacency List A: (B, 10), (D, 5) B: (C, 1), (D, 2) C: (E, 4) D: (B, 3), (C, 9), (E, 2) E: (A, 7), (C, 6) 0 10 5 23 2 1 9 7 46 A BC DE ABCDE A 5 B12 C4 D392 E76 Example: SSSP Parallel BFS in MapReduce
  • Slide 113
  • 0 10 5 23 2 1 9 7 46 A BC DE Map input: > >> Map output: >> Flushed to local disk!! Example: SSSP Parallel BFS in MapReduce
  • Slide 114
  • Reduce input: >> >> >> >> >> 0 10 5 23 2 1 9 7 46 A BC DE Example: SSSP Parallel BFS in MapReduce
  • Slide 115
  • Reduce input: >> >> >> >> >> 0 10 5 23 2 1 9 7 46 A BC DE Example: SSSP Parallel BFS in MapReduce
  • Slide 116
  • Reduce output: > = Map input for next iteration >> Map output: 0 10 5 5 23 2 1 9 7 46 A BC DE >> Flushed to DFS!! Flushed to local disk!! Example: SSSP Parallel BFS in MapReduce
  • Slide 117
  • Reduce input: >> >> >> >> >> 0 10 5 5 23 2 1 9 7 46 A BC DE Example: SSSP Parallel BFS in MapReduce
  • Slide 118
  • Reduce input: >> >> >> >> >> 0 10 5 5 23 2 1 9 7 46 A BC DE Example: SSSP Parallel BFS in MapReduce
  • Slide 119
  • Reduce output: > = Map input for next iteration >> the rest omitted 0 8 5 11 7 10 5 23 2 1 9 7 46 A BC DE Flushed to DFS!! Example: SSSP Parallel BFS in MapReduce
  • Slide 120
  • Computation Model (1/3) Input Output Supersteps (a sequence of iterations)
  • Slide 121
  • Think like a vertex Inspired by Valiants Bulk Synchronous Parallel model (1990) Computation Model (2/3) Source: http://en.wikipedia.org/wiki/Bulk_synchronous_parallel
  • Slide 122
  • Computation Model (3/3) Superstep: the vertices compute in parallel Each vertex Receives messages sent in the previous superstep Executes the same user-defined function Modifies its value or that of its outgoing edges Sends messages to other vertices (to be received in the next superstep) Mutates the topology of the graph Votes to halt if it has no further work to do Termination condition All vertices are simultaneously inactive There are no messages in transit
  • Slide 123
  • Example: SSSP Parallel BFS in Pregel 0 10 5 23 2 1 9 7 46
  • Slide 124
  • Example: SSSP Parallel BFS in Pregel 0 10 5 23 2 1 9 7 46 5
  • Slide 125
  • Example: SSSP Parallel BFS in Pregel 0 10 5 5 23 2 1 9 7 46
  • Slide 126
  • Example: SSSP Parallel BFS in Pregel 0 10 5 5 23 2 1 9 7 46 11 7 12 8 14
  • Slide 127
  • Example: SSSP Parallel BFS in Pregel 0 8 5 11 7 10 5 23 2 1 9 7 46
  • Slide 128
  • Example: SSSP Parallel BFS in Pregel 0 8 5 11 7 10 5 23 2 1 9 7 46 9 14 13 15
  • Slide 129
  • Example: SSSP Parallel BFS in Pregel 0 8 5 9 7 10 5 23 2 1 9 7 46
  • Slide 130
  • Example: SSSP Parallel BFS in Pregel 0 8 5 9 7 10 5 23 2 1 9 7 46 13
  • Slide 131
  • Example: SSSP Parallel BFS in Pregel 0 8 5 9 7 10 5 23 2 1 9 7 46
  • Slide 132
  • Differences from MapReduce Graph algorithms can be written as a series of chained MapReduce invocation Pregel Keeps vertices & edges on the machine that performs computation Uses network transfers only for messages MapReduce Passes the entire state of the graph from one stage to the next Needs to coordinate the steps of a chained MapReduce
  • Slide 133
  • C++ API Writing a Pregel program Subclassing the predefined Vertex class Override this! in msgs out msg
  • Slide 134
  • Example: Vertex Class for SSSP
  • Slide 135
  • System Architecture Pregel system also uses the master/worker model Master Maintains worker Recovers faults of workers Provides Web-UI monitoring tool of job progress Worker Processes its task Communicates with the other workers Persistent data is stored as files on a distributed storage system (such as GFS or BigTable) Temporary data is stored on local disk
  • Slide 136
  • Execution of a Pregel Program 1.Many copies of the program begin executing on a cluster of machines 2.The master assigns a partition of the input to each worker Each worker loads the vertices and marks them as active 3.The master instructs each worker to perform a superstep Each worker loops through its active vertices & computes for each vertex Messages are sent asynchronously, but are delivered before the end of the superstep This step is repeated as long as any vertices are active, or any messages are in transit 4.After the computation halts, the master may instruct each worker to save its portion of the graph
  • Slide 137
  • Fault Tolerance Checkpointing The master periodically instructs the workers to save the state of their partitions to persistent storage e.g., Vertex values, edge values, incoming messages Failure detection Using regular ping messages Recovery The master reassigns graph partitions to the currently available workers The workers all reload their partition state from most recent available checkpoint
  • Slide 138
  • Experiments Environment H/W: A cluster of 300 multicore commodity PCs Data: binary trees, log-normal random graphs (general graphs) Nave SSSP implementation The weight of all edges = 1 No checkpointing
  • Slide 139
  • Experiments SSSP 1 billion vertex binary tree: varying # of worker tasks
  • Slide 140
  • Experiments SSSP binary trees: varying graph sizes on 800 worker tasks
  • Slide 141
  • Experiments SSSP Random graphs: varying graph sizes on 800 worker tasks
  • Slide 142
  • Conclusion & Future Work Pregel is a scalable and fault-tolerant platform with an API that is sufficiently flexible to express arbitrary graph algorithms Future work Relaxing the synchronicity of the model Not to wait for slower workers at inter-superstep barriers Assigning vertices to machines to minimize inter-machine communication Caring dense graphs in which most vertices send messages to most other vertices