Theta join (M-bucket-I algorithm explained)
-
Upload
minsub-yim -
Category
Engineering
-
view
49 -
download
2
Transcript of Theta join (M-bucket-I algorithm explained)
Processing Theta Joins using MapReduce
by Minsub Yim
Processing pipeline at a reducer
Goal: We want to minimize job completion time. Since it’s a function of both input and output, we need a way to model both inputs and outputs to a reducer.
Reducer Join OutputMapper Output
time = f(input size) time = f(output size)
Receive Mapper Output
Sort input by key
Read input
Run join algorithm
Send join output
Theta Join Model
S_id Value
1 5
2 6
3 6
4 8
5 8
6 10
Dataset S Dataset TT_id Value
1 5
2 5
3 6
4 8
5 8
6 10
Assuming join condition: S.value = T.value
Theta Join Model
S_id Value
1 5
2 6
3 6
4 8
5 8
6 10
Dataset S Dataset TT_id Value
1 5
2 5
3 6
4 8
5 8
6 10
Assuming join condition: S.value = T.value
5 5 6 8 8 105668810
[ Join Matrix M ]
: tuple satisfying the join condition
ST
Theta Join Model (Examples)
5 5 6 8 8 1056688
10
Join condition: S.value <= T.value
ST 5 5 6 8 8 10
5668810
Join condition: abs (S.value - T.value) < 2
ST 5 5 6 8 8 10
5668810
Join condition: S.value = T.value
ST
Theta Join Model (Examples)
5 5 6 8 8 1056688
10
Join condition: S.value <= T.value
ST 5 5 6 8 8 10
5668810
Join condition: abs (S.value - T.value) < 2
ST 5 5 6 8 8 10
5668810
Join condition: S.value = T.value
ST
Theta Join Model (Examples)
5 5 6 8 8 1056688
10
Join condition: S.value <= T.value
ST 5 5 6 8 8 10
5668810
Join condition: abs (S.value - T.value) < 2
ST 5 5 6 8 8 10
5668810
Join condition: S.value = T.value
ST
Goal Revisited
• We want to minimize job completion time
• We need to assign every true cell to exactly one reducer. (find a mapping from M to R)
Goal Revisited
• We want to minimize job completion time
• We need to assign every true cell to exactly one reducer. (find a mapping from M to R)
• Goal: Find a mapping from the join matrix M to reducers that minimizes job completion time
Mappings from join matrix to reducers
5 5 6 8 8 1056688
10
Join condition: S.value = T.value
ST
(1)
(2)
(3)
(4)
[R1] Input: S1, T1, T2 Output: 2 tuples ![R2] Input: S2, S3, T3 Output: 2 tuples ![R3] Input: S4, S5, T4, T5 Output: 4 tuples ![R4] Input: S6, T6 Output: 1 tuple !Max-Reducer-Input: 4 Max-Reducer-Output: 4
5 5 6 8 8 105668810
Join condition: S.value = T.value
ST
(1)(2)
(3)(4)
[R1] Input: S1, S4, S5, T1, T4, T5 Output: 3 tuples ![R2] Input: S2, S4, T3,T5 Output: 2 tuples ![R3] Input: S1, S5, T2, T4 Output: 2 tuples ![R4] Input: S3, S6, T3, T6 Output: 2 tuples !MRI: 6 MRO: 3
(1)
(1)
(2)
(3)
(4)
Stndard equi-join algorithm Random
Mappings from join matrix to reducers
5 5 6 8 8 1056688
10
Join condition: S.value = T.value
ST
(1)
(2)
(3)
(4)
[R1] Input: S1, T1, T2 Output: 2 tuples ![R2] Input: S2, S3, T3 Output: 2 tuples ![R3] Input: S4, S5, T4, T5 Output: 4 tuples ![R4] Input: S6, T6 Output: 1 tuple !Max-Reducer-Input: 4 Max-Reducer-Output: 4
5 5 6 8 8 105668810
Join condition: S.value = T.value
ST
(1)(2)
(3)(4)
[R1] Input: S1, S4, S5, T1, T4, T5 Output: 3 tuples ![R2] Input: S2, S4, T3,T5 Output: 2 tuples ![R3] Input: S1, S5, T2, T4 Output: 2 tuples ![R4] Input: S3, S6, T3, T6 Output: 2 tuples !MRI: 6 MRO: 3
(1)
(1)
(2)
(3)
(4)
Stndard equi-join algorithm Random
Mappings from join matrix to reducers
5 5 6 8 8 1056688
10
Join condition: S.value = T.value
ST
(1)
(2)
(3)
(4)
[R1] Input: S1, T1, T2 Output: 2 tuples ![R2] Input: S2, S3, T3 Output: 2 tuples ![R3] Input: S4, S5, T4, T5 Output: 4 tuples ![R4] Input: S6, T6 Output: 1 tuple !Max-Reducer-Input: 4 Max-Reducer-Output: 4
5 5 6 8 8 105668810
Join condition: S.value = T.value
ST
(1)(2)
(3)(4)
[R1] Input: S1, S4, S5, T1, T4, T5 Output: 3 tuples ![R2] Input: S2, S4, T3,T5 Output: 2 tuples ![R3] Input: S1, S5, T2, T4 Output: 2 tuples ![R4] Input: S3, S6, T3, T6 Output: 2 tuples !MRI: 6 MRO: 3
(1)
(1)
(2)
(3)
(4)
Stndard equi-join algorithm Random
Mappings from join matrix to reducers
5 5 6 8 8 1056688
10
Join condition: S.value = T.value
ST
(1)
(2)
(3)
(4)
[R1] Input: S1, T1, T2 Output: 2 tuples ![R2] Input: S2, S3, T3 Output: 2 tuples ![R3] Input: S4, S5, T4, T5 Output: 4 tuples ![R4] Input: S6, T6 Output: 1 tuple !Max-Reducer-Input: 4 Max-Reducer-Output: 4
5 5 6 8 8 105668810
Join condition: S.value = T.value
ST
(1)(2)
(3)(4)
[R1] Input: S1, S4, S5, T1, T4, T5 Output: 3 tuples ![R2] Input: S2, S4, T3,T5 Output: 2 tuples ![R3] Input: S1, S5, T2, T4 Output: 2 tuples ![R4] Input: S3, S6, T3, T6 Output: 2 tuples !MRI: 6 MRO: 3
(1)
(1)
(2)
(3)
(4)
Stndard equi-join algorithm Random
Mappings from join matrix to reducers
5 5 6 8 8 1056688
10
Join condition: S.value = T.value
ST
(1)(2)
(3)
[R1] Input: S1, S2, T1, T2 Output: 2 tuples ![R2] Input: S3, S4, T1, T2, T3 Output: 2 tuples ![R3] Input: S4, S5, S6, T4, T5, T6 Output: 5 tuples !!Max-Reducer-Input: 6 Max-Reducer-Output: 5
Mappings from join matrix to reducers
5 5 6 8 8 1056688
10
Join condition: S.value = T.value
ST
(1)(2)
(3)
[R1] Input: S1, S2, T1, T2 Output: 2 tuples ![R2] Input: S3, S4, T1, T2, T3 Output: 2 tuples ![R3] Input: S4, S5, S6, T4, T5, T6 Output: 5 tuples !!Max-Reducer-Input: 6 Max-Reducer-Output: 5
Mappings from join matrix to reducers
• We see there could be many possible mappings from join matrix to reducers
• We will see in different cases, which mapping is (close to) optimal and algorithms to compute such mapping.
LemmaWe will be using the following lemma repeatedly to show how (close to) optimal each mapping is.
[ LEMMA 1 ] A reducer that is assigned to c cells of the join matrix M will receive at least input tuples
[ Proof ] Consider a reducer r that receives m records from T and n records from S. Then,
!!
2pc
mn � c2pmn � 2
pc
m+ n � 2pc
LemmaWe will be using the following lemma repeatedly to show how (close to) optimal each mapping is.
[ LEMMA 1 ] A reducer that is assigned to c cells of the join matrix M will receive at least input tuples
[ Proof ] Consider a reducer r that receives m records from T and n records from S. Then,
!!
2pc
mn � c2pmn � 2
pc
m+ n � 2pc
Cross Product• We first consider cross product, where all of
tuples from two datasets satisfy the join condition. The join matrix would look like the following:
5 5 6 8 8 105668810
Join condition: S X T
ST
Cross Product• We first consider cross product, where all of
tuples from two datasets satisfy the join condition. The join matrix would look like the following:
5 5 6 8 8 105668810
Join condition: S X T
ST
Cross Product• Since all entries of the join matrix are true, we
can see that the maximum-reducer-output (MRO) . (Otherwise, there would be tuples not mapped to a reducer.)
• Along with Lemma 1, we have a lower bound for the maximum-reducer-input (MRI):
MRI
� |S||T |/r
� 2
r|S||T |
r
[ LEMMA 1 ] A reducer that is assigned to c cells of the join matrix M will receive at least input tuples 2
pc
Cross Product• Since all entries of the join matrix are true, we
can see that the maximum-reducer-output (MRO) . (Otherwise, there would be tuples not mapped to a reducer.
• Along with Lemma 1, we have a lower bound for the maximum-reducer-input (MRI):
MRI
� |S||T |/r
� 2
r|S||T |
r
[ LEMMA 1 ] A reducer that is assigned to c cells of the join matrix M will receive at least input tuples 2
pc
Cross Product• We will revisit these two properties frequently to
see the quality of join mappings:
� |S||T |/rMRO and MRI � 2
r|S||T |
r
p|S||T |/rCase 1: Suppose |S| and |T| are multiples of .
Namely, and .|S| = csp|S||T |/r |T | = cT
p|S||T |/r
Then, partitioning the join matrix with squares of size is an optimal mapping.p
|S||T |/r
Proof : is trivial. Each region mapped to a reducer !has output size: and input size: |S||T |/r 2
r|S||T |
r
Cross Product� |S||T |/rMRO and MRI � 2
r|S||T |
r
Properties
p|S||T |/rCase 1: Suppose |S| and |T| are multiples of .
Namely, and .|S| = csp|S||T |/r |T | = cT
p|S||T |/r
Then, partitioning the join matrix with squares of size is an optimal mapping.p
|S||T |/r
Proof : is trivial. Each region mapped to a reducer !has output size: and input size: |S||T |/r 2
r|S||T |
r
Cross Product� |S||T |/rMRO and MRI � 2
r|S||T |
r
Properties
Cross Product� |S||T |/rMRO and MRI � 2
r|S||T |
r
Properties
5 5 6 8 8 105668810
ST
Suppose |S| = |T| = 6 and r = 9
Cross Product� |S||T |/rMRO and MRI � 2
r|S||T |
r
Properties
5 5 6 8 8 105668810
ST
Suppose |S| = |T| = 6 and r = 9
Cross Product� |S||T |/rMRO and MRI � 2
r|S||T |
r
Properties
5 5 6 8 8 105668810
ST
Suppose |S| = |T| = 6 and r = 9
Cross Product� |S||T |/rMRO and MRI � 2
r|S||T |
r
Properties
5 5 6 8 8 105668810
ST
Suppose |S| = |T| = 6 and r = 9
MRO = 4 = 2
r|S||T |
r
MRI = 4 = |S||T |/r
Case 2: Suppose the cardinality of one dataset is significantly greater than that of the other. (WLOG, assume ). Then, rectangle cover
Cross Product� |S||T |/rMRO and MRI � 2
r|S||T |
r
Properties
|S| < |T |/r |S|⇥ |T |/ris the optimal mapping.
Case 2: Suppose the cardinality of one dataset is significantly greater than that of the other. (WLOG, assume ). Then, rectangle cover
Cross Product� |S||T |/rMRO and MRI � 2
r|S||T |
r
Properties
|S| < |T |/r |S|⇥ |T |/ris the optimal mapping.
(e.g., |S| = 3, |T| = 20, r = 5)
Case 3: The remaining case where . !
Let , !
Then, covering M with squares is a mapping worse than an optimal mapping by a factor no greater than 4.
Cross Product� |S||T |/rMRO and MRI � 2
r|S||T |
r
Properties
|T |/r |S| |T |
CT =
$|T |/
r|S||T |
r
%CS =
$|S|/
r|S||T |
r
%
p|S||T |/r ⇥
p|S||T |/r
If |S| and/or |T| is not a multiple of , scale each !
side by and/or respectively to !
cover M. Given , we see that
Cross Product� |S||T |/rMRO and MRI � 2
r|S||T |
r
Properties
p|S||T |/r
✓1 +
1
CS
◆ ✓1 +
1
CT
◆
|T |/r |S| |T |✓1 +
1
CS
◆r|S||T |
r 2
r|S||T |
r
Hence, and
Cross Product� |S||T |/rMRO and MRI � 2
r|S||T |
r
Properties
Comparing these with the lower bounds given above, we see that the MRO and MRI produced by this mapping are at most 4 times (twice for MRI) the lower bounds.
MRI 4p
|S||T |/rMRO 4|S||T |/r
Implementation• Now we know how to (nearly) optimally partition
the join matrix. So let’s run it!!
• However, when a reducer is given a record (either from S or T), it does NOT have enough information where exactly in the dataset (in which row/col) the record belongs to.
• We could run another pre-process to get that info, but it can be avoided by running a randomized algorithm!
Implementation• Now we know how to (nearly) optimally partition
the join matrix. So let’s run it!!
• However, when a reducer is given a record (either from S or T), it does NOT have enough information where exactly in the dataset (in which row/col) the record belongs to.
• We could run another pre-process to get that info, but it can be avoided by running a randomized algorithm!
Implementation• Now we know how to (nearly) optimally partition
the join matrix. So let’s run it!!
• However, when a reducer is given a record (either from S or T), it does NOT have enough information where exactly in the dataset (in which row/col) the record belongs to.
• We could run another pre-process to get that info, but it can be avoided by running a randomized algorithm!
Mapping & Randomized Algorithm
Algorithm 1 : Map (Theta - Join) !Input : input tuple 1: if then 2: matrixRow = random(1,|S|) 3: for all regionID in lookup.getRegions(matrixRow) do 4: Output (regionID, (x, “S”) ) 5: else 6: matrixCol = random (1,|T|) 7: for all regionID in lookup.getRegions(matrixCol) do 8: Output (regionID, (x, “T”) )
x 2 S [ T
x 2 S
Mapping & Randomized Algorithm
Algorithm 1 : Map (Theta - Join) !Input : input tuple 1: if then 2: matrixRow = random(1,|S|) 3: for all regionID in lookup.getRegions(matrixRow) do 4: Output (regionID, (x, “S”) ) 5: else 6: matrixCol = random (1,|T|) 7: for all regionID in lookup.getRegions(matrixCol) do 8: Output (regionID, (x, “T”) )
x 2 S [ T
x 2 S
1. Given a record ( WLOG ) 2. Get a row uniformly randomly 3. Get all the regions intersecting that row and output ( regID, (x, S) )
x 2 S
Mapping & Randomized Algorithm
5 7 7 7 8 9577899
ST
Join condition: S.value = T.value
(1) (2)
(3)
3 5 1 5 1 2
6 2 2 3 6 4
(1,S1) (2,S1) (3,S2) (1,S3) (2,S3) (3,S4) (1,S5) (2,S5) (1,S6) (2,S6) (2,T1) (3,T1) (1,T2) (3,T2) (1,T3) (3,T3) (1,T4) (3,T4) (2,T5) (3,T5) (2,T6) (3,T6)
Input Tuple
Random Row/Col Output
MapReducer 1 : key 1 (regID)Input: S1, S3, S5, S6, T2, T3, T4Output: (S3,T2) (S3,T3) (S3,T4)
Reducer 2 : key 2 (regID)Input: S1, S3, S5, S6, T1, T5, T6Output: (S1,T1) (S5,T6) (S6,T6)
Reducer 3 : key 3 (regID)Input: S2, S4, T1, T2, T3, T4, T5, T6Output: (S2,T2) (S2,T3) (S2,T4) (S4,T5)
Reduce
S1.A = 5 S2.A = 7 S3.A = 7 S4.A = 8 S5.A = 9 S6.A = 9 T1.A = 5 T2.A = 7 T3.A = 7 T4.A = 7 T5.A = 8 T6.A = 9
Cross Product… NOT!
• We have verified that 1 Bucket Theta algorithm is close to optimal when the join condition is cross product.
• How does 1 Bucket Theta algorithm perform when join condition is NOT cross product ?
• We will compare the quality of 1 Bucket Theta algorithm to any join algorithm
Cross Product… NOT!
• We have verified that 1 Bucket Theta algorithm is close to optimal when the join condition is cross product.
• How does 1 Bucket Theta algorithm perform when join condition is NOT cross product ?
• We will compare the quality of 1 Bucket Theta algorithm to any join algorithm
Cross Product… NOT!
• We have verified that 1 Bucket Theta algorithm is close to optimal when the join condition is cross product.
• How does 1 Bucket Theta algorithm perform when join condition is NOT cross product ?
• We will compare the quality of 1 Bucket Theta algorithm to any join algorithm
1BT vs ANY join algorithmLet . Any matrix to reducer mapping that has to cover at least of the cells of the join matrix, by Lemma 1, has MRI
1 � x > 0
x|S||T | |S||T |� 2
px|S||T |
[ LEMMA 1 ] A reducer that is assigned to c cells of the join matrix M will receive at least input tuples 2
pc
As we have seen, 1BT guarantees that MRI . !Hence,
4p|S||T |
MRI1BT
MRI
AnyJoinAlg
=4p
|S||T |/r2p
x|S||T |/r=
2px
1BT vs ANY join algorithm
1BT vs ANY join algorithm
When , the ratio < 3. !Hence,compared to ANY join algorithm that assigns more than 50% of its matrix cells to reducers, the MRI for 1BT is at most 3 times the MRI of that algorithm.
x = 0.5
1BT vs ANY join algorithm
When , the ratio < 3. !Hence,compared to ANY join algorithm that assigns more than 50% of its matrix cells to reducers, the MRI for 1BT is at most 3 times the MRI of that algorithm.
x = 0.5
M-Bucket-I• In the previous slide, we see that instead of
covering the entire matrix, mapping smaller regions would yield better MRI result.
• Ideally, we only want to map those satisfying the join condition, but it cannot be done before knowing input statistics and/or join condition.
• M-Bucket-I exploits statistics to improve over 1 Bucket Theta join algorithm
M-Bucket-I• In the previous slide, we see that instead of
covering the entire matrix, mapping smaller regions would yield better MRI result.
• Ideally, we only want to map those satisfying the join condition, but it cannot be done before knowing input statistics and/or join condition.
• M-Bucket-I exploits statistics to improve over 1 Bucket Theta join algorithm
M-Bucket-I• In the previous slide, we see that instead of
covering the entire matrix, mapping smaller regions would yield better MRI result.
• Ideally, we only want to map those satisfying the join condition, but it cannot be done before knowing input statistics and/or join condition.
• M-Bucket-I exploits statistics to improve over 1 Bucket Theta join algorithm
M-Bucket-I[ Step 1 ] Approximate Equi-Depth Histograms
1) With probability n /|S|, sample approx. n records from |S|
2) Build k-quantiles (k buckets), where k < n 3) Iterate through |S| and count the number of
records in each bucket 4) Do the same for |T| and build the join matrix
accordingly
M-Bucket-I[ Step 1 ] Approximate Equi-Depth Histograms
S_id Value
1 7
2 2
3 4
4 2
5 1
6 9
7 10
8 2
9 5
10 3
Dataset S Dataset T
T_id Value
1 5
2 5
3 6
4 8
5 8
6 10
7 2
8 4
9 1
10 3
M-Bucket-I[ Step 1 ] Approximate Equi-Depth Histograms
S_id Value
1 7
2 2
3 4
4 2
5 1
6 9
7 10
8 2
9 5
10 3
Dataset S Dataset T
T_id Value
1 5
2 5
3 6
4 8
5 8
6 10
7 2
8 4
9 1
10 3
Sample S 7, 2, 2, 9, 2, 3
Sample T 5, 6, 8, 2, 1, 3
M-Bucket-I[ Step 1 ] Approximate Equi-Depth Histograms
S_id Value
1 7
2 2
3 4
4 2
5 1
6 9
7 10
8 2
9 5
10 3
Dataset S Dataset T
T_id Value
1 5
2 5
3 6
4 8
5 8
6 10
7 2
8 4
9 1
10 3
Sample S 7, 2, 2, 9, 2, 3
Sample T 5, 6, 8, 2, 1, 3
Samples
M-Bucket-I[ Step 1 ] Approximate Equi-Depth Histograms
S_id Value
1 7
2 2
3 4
4 2
5 1
6 9
7 10
8 2
9 5
10 3
Dataset S Dataset T
T_id Value
1 5
2 5
3 6
4 8
5 8
6 10
7 2
8 4
9 1
10 3
Sample S 7, 2, 2, 9, 2, 3
Sample T 5, 6, 8, 2, 1, 3
Samples
Buckets
S
T
0 2 3 9
0 1 5 8 1
1
M-Bucket-I[ Step 1 ] Approximate Equi-Depth Histograms
S_id Value
1 7
2 2
3 4
4 2
5 1
6 9
7 10
8 2
9 5
10 3
Dataset S Dataset T
T_id Value
1 5
2 5
3 6
4 8
5 8
6 10
7 2
8 4
9 1
10 3
Sample S 7, 2, 2, 9, 2, 3
Sample T 5, 6, 8, 2, 1, 3
Samples
Buckets
S
T
0 2 3 9
0 1 5 8 1
1
4 1 4 1
1 5 3 1
M-Bucket-I[ Step 1 ] Approximate Equi-Depth Histograms
S S S S S S S S S STTTTTTTTTT
Join condition: S.value = T.value
M-Bucket-I[ Step 1 ] Approximate Equi-Depth Histograms
S S S S S S S S S STTTTTTTTTT
2 3 9
1
5
8
Join condition: S.value = T.value
M-Bucket-I[ Step 1 ] Approximate Equi-Depth Histograms
S S S S S S S S S STTTTTTTTTT
2 3 9
1
5
8
Join condition: S.value = T.value
We now have candidate cells. How do we map these cells to reducers?
M-Bucket-I[ Step 2 ] M-Bucket-I Algorithm
Algorithm : M-Bucket-I !Input : maxInput, r, M 1: row = 0 2: while row < M.noOfRows do 3: (row,r) = CoverSubMatrix(row, maxInput, r, M) 4: if r < 0 then!5: return false 6: return true!
M-Bucket-I
Algorithm : CoverSubMatrix !Input : row_s, maxInput, r, M 1: maxScore = -1, rUsed = 0 2: for i = 1 to maxInput-1 do 3: R_i = CoverRows(row_s, row_s + i, maxInput, M) 4: area = totalCandidateArea(row_s, row_s + i, M) 5: score = area/R_i.size 6: if score >= maxScore then!7: bestRow = row_s + i 8: rUsed = R_i.size 9: r = r - rUsed 10: return (bestRow + 1, r)
[ Step 2 ] M-Bucket-I Algorithm
M-Bucket-I
Algorithm : CoverRows !Input : row_f, row_l, maxInput, M 1: Regions = 0; r = newRegion() 2: for all c_i in M.getColumns do 3: if r. cap < c_i.candidateInputCosts then!4: Regions = Regions U r 5: r = newRegion() 6: r.Cells = r.Cells U c_i.candidateCells 7: return Regions
[ Step 2 ] M-Bucket-I Algorithm
M-Bucket-I
Run the algorithm with r = 6 maxInput = 5
[ Step 2 ] M-Bucket-I Algorithm
M-Bucket-I
Run the algorithm with r = 6 maxInput = 5
row : 0 cost : 4
[ Step 2 ] M-Bucket-I Algorithm
M-Bucket-I
Run the algorithm with r = 6 maxInput = 5
row : 0 cost : 4
row : 1 cost : 13/3 = 4.3
[ Step 2 ] M-Bucket-I Algorithm
M-Bucket-I
Run the algorithm with r = 6 maxInput = 5
row : 0 cost : 4
row : 1 cost : 13/3 = 4.3
row : 2 cost : 22/4 = 5.5
[ Step 2 ] M-Bucket-I Algorithm
M-Bucket-I
Run the algorithm with r = 6 maxInput = 5
row : 0 cost : 4
row : 1 cost : 13/3 = 4.3
row : 2 cost : 22/4 = 5.5
row : 3 cost : 31/7 = 4.428..
[ Step 2 ] M-Bucket-I Algorithm
M-Bucket-I
Run the algorithm with r = 6 maxInput = 5
row : 0 cost : 4
row : 1 cost : 13/3 = 4.3
row : 2 cost : 22/4 = 5.5
row : 3 cost : 31/7 = 4.428..
We choose the mapping with highest score!
(1) (2)(3) (4)
[ Step 2 ] M-Bucket-I Algorithm
M-Bucket-I
Run the algorithm with r = 6 maxInput = 5
row : 3 cost : 3
(1) (2)(3) (4) So on and so forth…
[ Step 2 ] M-Bucket-I Algorithm
M-Bucket-I
Run the algorithm with r = 6 maxInput = 5
Final mapping!
[ Step 2 ] M-Bucket-I Algorithm
(1) (2)(3) (4)
(7)(6)(5)
(8) (9)(10)
(11) (12)(13)
M-Bucket-I
Run the algorithm with r = 6 maxInput = 5
(1) (2)(3) (4)
However, we have mapped the candidate cells to > r reducers. !We do binary search until we get to the point where we a mapping to <= r reducers.(7)(6)(5)
(8) (9)(10)
(11) (12)(13)
[ Step 2 ] M-Bucket-I Algorithm
M-Bucket-I[ Step 3 ] Binary Search
MaxInput = |S|+|T| = 20
Num.Reducers = 1
MaxInput = 5
Num.Reducers = 13
M-Bucket-I[ Step 3 ] Binary Search
MaxInput = |S|+|T| = 20
Num.Reducers = 1
MaxInput = 5
Num.Reducers = 13
MaxInput = 12
Num.Reducers = 3
M-Bucket-I[ Step 3 ] Binary Search
MaxInput = |S|+|T| = 20
Num.Reducers = 1
MaxInput = 5
Num.Reducers = 13
MaxInput = 12
Num.Reducers = 3
MaxInput = 8
Num.Reducers = 5
Since 7 reducers are required when MaxInput = 7, we stop the binary search here and output the mapping with MRI = 8.
Performance1 Bucket Theta Standard Equi Join
Data set Output size (billion)
Output Imbalance
Runtime (secs)
Output Imbalance
Runtime (secs)
Synth - 0 25.00 1.0030 657 1.0124 701
Synth - 0.4 24.99 1.0023 650 1.2541 722
Synth - 0.6 24.98 1.0033 676 1.7780 923
Synth - 0.8 24.95 1.0068 678 3.0103 1482
Synth - 1 24.91 1.0089 667 5.3124 2489
Skew
ed
Where Output Imbalance = MRI
Ave.RI
MRI
Ave.RI
Skew Resistance of 1 Bucket Theta
Performance1 Bucket Theta Standard Equi Join
Data set Output size (billion)
Output Imbalance
Runtime (secs)
Output Imbalance
Runtime (secs)
Synth - 0 25.00 1.0030 657 1.0124 701
Synth - 0.4 24.99 1.0023 650 1.2541 722
Synth - 0.6 24.98 1.0033 676 1.7780 923
Synth - 0.8 24.95 1.0068 678 3.0103 1482
Synth - 1 24.91 1.0089 667 5.3124 2489
Skew
ed
Where Output Imbalance = MRI
Ave.RI
MRI
Ave.RI
Skew Resistance of 1 Bucket Theta
Performance
Step Number of Buckets
1 10 100 1000 10,000 100,000 1,000,000
M-Bucket-I cost details (seconds)
Quantiles 0 115 120 117 122 124 122
Histogram 0 140 145 147 157 167 604
Heuristic 74.01 9.21 0.84 1.50 16.67 118.03 111.27
Join 49384 10905 1157 595 548 540 536
Total 49,458.01 11,169.21 1,422.84 860.5 843.67 949.03 1,373.27