Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan...

29
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data Reading Group Presentation

Transcript of Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan...

Page 1: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data.

Map-Reduce-Merge: Simplified Relational Data Processing on

Large ClustersH.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA)

Shimin ChenBig Data Reading

Group Presentation

Page 2: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data.

Motivation

Map-Reduce framework Compared to relational DBMS “simplified” for data processing in search engines

Problem: join multiple heterogeneous datasets Not quite fit into map-reduce Ad-hoc solutions: map-reduce on one data set while

reading data from the other dataset on the fly

Page 3: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data.

Contribution

Goal: support relational algebra primitives without sacrificing existing generality and simplicity

Proposal: map-reduce-merge

Page 4: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data.

Outline

Introduction Map-Reduce Map-Reduce-Merge Applications to Relational Data Processing Optimizations and Enhancements Case Studies Conclusion

Page 5: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data.

Let’s Refresh Our Memory

Functional programming model

Page 6: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data.

Comments

Low-cost unreliable commodity hardware Failure often occurs during each map/reduce task Coordinator re-run mapper or reducer

Homogenization: for equi-join Transform each dataset into (join key, payload) Then apply map-reduce to merge entries from different

datasets Problem: only equi-joins may take lots of extra disk space, incur excessive

communications

Page 7: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data.

Outline

Introduction Map-Reduce Map-Reduce-Merge Applications to Relational Data Processing Optimizations and Enhancements Case Studies Conclusion

Page 8: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data.

Map-Reduce-Merge Primitiveskey

join

Page 9: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data.

Focusing on Merge

Two sets of inputs generated by multiple reducers: Which α reducers and β reducers match? How to get the next key-value pair? Customized preprocessing for inputs? Merging algorithm?

All of these are customizable

Page 10: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data.

Focusing on Merge

Two sets of inputs generated by multiple reducers: Partition Selector: Which α reducers and β reducers match?

Iterator: How to get the next key-value pair? Processor: Customized preprocessing for inputs? Merger: Merging algorithm?

All of these are customizable

Page 11: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data.

Example: Emp & Dept

Employee Department

Page 12: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data.

Partition Selector

LHS: reduce key:dept-id, emp-id partition key: dept-id RHS: reduce key:dept-id, partition key: dept-id Assuming #reducer is the same, LHS reducer K matches

RHS reducer K

Page 13: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data.

Processor

Pre-processing for each input E.g. building hash table for hash join

This example is sort-merge Processor is empty

Page 14: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data.

Iterator for sort-merge

Page 15: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data.

Merger

Page 16: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data.

Other Iterators

Nested-loop: For each (k,v) of the first input, get all the second input Then rewind the second input and process the next (k,v) of

the first input

Hash join: Read all of one input, then read all of the other input

Page 17: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data.

Outline

Introduction Map-Reduce Map-Reduce-Merge Applications to Relational Data Processing Optimizations and Enhancements Case Studies Conclusion

Page 18: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data.

Relation

Relation R with an attribute set A A is broken down into a key part K, and a

value part V

Page 19: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data.

Relational Operators Generalized selection: choosing a subset of records

Filtering can be done in mapper/reducer/merger Projection: choosing a subset of attributes

User-defined mapper (k,v)(k’,v’) Aggregation

Group-by is performed before reduce Easy to implement aggregation in reducer

Joins (set union, intersection, difference, cartesian product) Sort-merge, hash join, nested-loop

Rename

Page 20: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data.

Outline

Introduction Map-Reduce Map-Reduce-Merge Applications to Relational Data Processing Optimizations and Enhancements Case Studies Conclusion

Page 21: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data.

Partition Selector

In general: LHS has R1 reducers, RHS has R2 reducers, performing cartesian-product like operator

Suppose R1 R2, use R1 merger, where merger j selects: Input from LHS reducer j Input from RHS all reducers Remote reads: R1*(1+R2) = R1 + R1*R2

Natural equi-join case: Let R1==R2==R, use R merger, where merger j selects: LHS reducer j and RHS reducer j Remote reads: 2*R

Page 22: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data.

Combining Phases

Entire workflow consists of multiple map-reduce-merge To avoid remote copying:

ReduceMap, MergeMap:co-locate next mapper with previous reducer or merger

ReduceMerger:co-locate merger with one of the reducer

ReduceMergeMap

Page 23: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data.

Map-Reduce-Merge Library

Put common merge implementations into a library Joins Common iterators etc.

Page 24: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data.

Configuration API for building a Customized Workflow

Map/reduce

Map/reduce/merge Multiple Map/reduce/merge

Page 25: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data.

Outline

Introduction Map-Reduce Map-Reduce-Merge Applications to Relational Data Processing Optimizations and Enhancements Case Studies Conclusion

Page 26: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data.

Webgraphs

Each row: (URL, in-links, out-links) Potentially large number of links Only a few are needed for many operations Store each column of the table in a separate file

Reconstruct the table by join E.g. compute the intersection of in-links and out-

links

Page 27: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data.

TPC-H Query 2

Page 28: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data.

After Combining Phases

Page 29: Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data.

Conclusion

Extend map-reduce Support relational operators However, the merge step seems quite

complicated