Distributed Cluster Computing Platforms. Outline What is the purpose of Data Intensive Super...

download Distributed Cluster Computing Platforms. Outline What is the purpose of Data Intensive Super Computing? MapReduce Pregel Dryad Spark/Shark Distributed

If you can't read please download the document

  • date post

  • Category


  • view

  • download


Embed Size (px)

Transcript of Distributed Cluster Computing Platforms. Outline What is the purpose of Data Intensive Super...

  • Slide 1
  • Distributed Cluster Computing Platforms
  • Slide 2
  • Outline What is the purpose of Data Intensive Super Computing? MapReduce Pregel Dryad Spark/Shark Distributed Graph Computing
  • Slide 3
  • Why DISC DISC stands for Data Intensive Super Computing A lot of applications. scientific data, web search engine, social network economic, GIS New data are continuously generated People want to understand the data BigData analysis is now considered as a very important method for scientific research.
  • Slide 4
  • What are the required features for the platform to handle DISC? Application specific: it is very difficult or even impossible to construct one system to fit them all. One example is the POSIX compatible file system. Each system should be re-configure or even re-designed for a specific application. Think about the motivation for building the Google file system for Google search engine. Programmer friendly interfaces: The Application programmer should not consider how to handle the infrastructure such as machines and networks. Fault Tolerant: The platform should handle the fault components automatically without any special treatment from the application. Scalability: The platform should run on top of at least thousands of machines and harnessing the power of all the components. The load balance should be achieved by the platform instead of the application itself. Try to understand all these four features during the introduction of the concrete platform below.
  • Slide 5
  • Google MapReduc e Programming Model Implementation Refinements Evaluation Conclusion
  • Slide 6
  • Motivation: large scale data processing Process lots of data to produce other derived data Input: crawled documents, web request logs etc. Output: inverted indices, web page graph structure, top queries in a day etc. Want to use hundreds or thousands of CPUs but want to only focus on the functionality MapReduce hides messy details in a library: Parallelization Data distribution Fault-tolerance Load balancing
  • Slide 7
  • Motivation: Large Scale Data Processing Want to process lots of data ( > 1 TB) Want to parallelize across hundreds/thousands of CPUs Want to make this easy "Google Earth uses 70.5 TB: 70 TB for the raw imagery and 500 GB for the index data." From: http://googlesystem.blogspot.com/2006/09/how-much- data-does-google-store.html
  • Slide 8
  • MapReduce Automatic parallelization & distribution Fault-tolerant Provides status and monitoring tools Clean abstraction for programmers
  • Slide 9
  • Programming Model Borrows from functional programming Users implement interface of two functions: map (in_key, in_value) -> (out_key, intermediate_value) list reduce (out_key, intermediate_value list) -> out_value list
  • Slide 10
  • map Records from the data source (lines out of files, rows of a database, etc) are fed into the map function as key*value pairs: e.g., (filename, line). map() produces one or more intermediate values along with an output key from the input.
  • Slide 11
  • reduce After the map phase is over, all the intermediate values for a given output key are combined together into a list reduce() combines those intermediate values into one or more final values for that same output key (in practice, usually only one final value per key)
  • Slide 12
  • Architecture
  • Slide 13
  • Parallelism map() functions run in parallel, creating different intermediate values from different input data sets reduce() functions also run in parallel, each working on a different output key All values are processed independently Bottleneck: reduce phase cant start until map phase is completely finished.
  • Slide 14
  • Example: Count word occurrences map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); reduce(String output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result));
  • Slide 15
  • Example vs. Actual Source Code Example is written in pseudo-code Actual implementation is in C++, using a MapReduce library Bindings for Python and Java exist via interfaces True code is somewhat more involved (defines how the input key/values are divided up and accessed, etc.)
  • Slide 16
  • Example Page 1: the weather is good Page 2: today is good Page 3: good weather is good.
  • Slide 17
  • Map output Worker 1: (the 1), (weather 1), (is 1), (good 1). Worker 2: (today 1), (is 1), (good 1). Worker 3: (good 1), (weather 1), (is 1), (good 1).
  • Slide 18
  • Reduce Input Worker 1: (the 1) Worker 2: (is 1), (is 1), (is 1) Worker 3: (weather 1), (weather 1) Worker 4: (today 1) Worker 5: (good 1), (good 1), (good 1), (good 1)
  • Slide 19
  • Reduce Output Worker 1: (the 1) Worker 2: (is 3) Worker 3: (weather 2) Worker 4: (today 1) Worker 5: (good 4)
  • Slide 20
  • Some Other Real Examples Term frequencies through the whole Web repository Count of URL access frequency Reverse web-link graph
  • Slide 21
  • Implementation Overview Typical cluster: 100s/1000s of 2-CPU x86 machines, 2-4 GB of memory Limited bisection bandwidth Storage is on local IDE disks GFS: distributed file system manages data (SOSP'03) Job scheduling system: jobs made up of tasks, scheduler assigns tasks to machines Implementation is a C++ library linked into user programs
  • Slide 22
  • Architecture
  • Slide 23
  • Execution
  • Slide 24
  • Parallel Execution
  • Slide 25
  • Task Granularity And Pipelining Fine granularity tasks: many more map tasks than machines Minimizes time for fault recovery Can pipeline shuffling with map execution Better dynamic load balancing Often use 200,000 map/5000 reduce tasks w/ 2000 machines
  • Slide 26
  • Locality Master program divvies up tasks based on location of data: (Asks GFS for locations of replicas of input file blocks) tries to have map() tasks on same machine as physical file data, or at least same rack map() task inputs are divided into 64 MB blocks: same size as Google File System chunks Without this, rack switches limit read rate Effect: Thousands of machines read input at local disk speed
  • Slide 27
  • Fault Tolerance Master detects worker failures Re-executes completed & in-progress map() tasks Re-executes in-progress reduce() tasks Master notices particular input key/values cause crashes in map(), and skips those values on re-execution. Effect: Can work around bugs in third- party libraries!
  • Slide 28
  • Fault Tolerance On worker failure: Detect failure via periodic heartbeats Re-execute completed and in-progress map tasks Re-execute in progress reduce tasks Task completion committed through master Master failure: Could handle, but don't yet (master failure unlikely) Robust: lost 1600 of 1800 machines once, but finished fine
  • Slide 29
  • Optimizations No reduce can start until map is complete: A single slow disk controller can rate-limit the whole process Master redundantly executes slow-moving map tasks; uses results of first copy to finish, (one finishes first wins) Why is it safe to redundantly execute map tasks? Wouldnt this mess up the total computation? Slow workers significantly lengthen completion time Other jobs consuming resources on machine Bad disks with soft errors transfer data very slowly Weird things: processor caches disabled (!!)
  • Slide 30
  • Optimizations Combiner functions can run on same machine as a mapper Causes a mini-reduce phase to occur before the real reduce phase, to save bandwidth Under what conditions is it sound to use a combiner?
  • Slide 31
  • Refinement Sorting guarantees within each reduce partition Compression of intermediate data Combiner: useful for saving network bandwidth Local execution for debugging/testing User-defined counters
  • Slide 32
  • Performance Tests run on cluster of 1800 machines: 4 GB of memory Dual-processor 2 GHz Xeons with Hyperthreading Dual 160 GB IDE disks Gigabit Ethernet per machine Bisection bandwidth approximately 100 Gbps Two benchmarks: MR_GrepScan 10 10 100-byte records to extract records matching a rare pattern (92K matching records) MR_SortSort 10 10 100-byte records (modeled after TeraSort benchmark)
  • Slide 33
  • MR_Grep Locality optimization helps: 1800 machines read 1 TB of data at peak of ~31 GB/s Without this, rack switches would limit to 10 GB/s Startup overhead is significant for short jobs
  • Slide 34
  • MR_Sort Backup tasks reduce job completion time significantly System deals well with failures NormalNo Backup Tasks200 processes killed
  • Slide 35
  • More and more MapReduce MapReduce Programs In Google Source Tree Example uses: distributed grep distributed sort web link-graph reversal term-vector per host web access log stats inverted index construction document clustering machine learning statistical machine translation
  • Slide 36
  • Real MapReduce : Rewrite of Productio