Theta join (M-bucket-I algorithm explained)

77
Processing Theta Joins using MapReduce by Minsub Yim

Transcript of Theta join (M-bucket-I algorithm explained)

Page 1: Theta join (M-bucket-I algorithm explained)

Processing Theta Joins using MapReduce

by Minsub Yim

Page 2: Theta join (M-bucket-I algorithm explained)

Processing pipeline at a reducer

Goal: We want to minimize job completion time. Since it’s a function of both input and output, we need a way to model both inputs and outputs to a reducer.

Reducer Join OutputMapper Output

time = f(input size) time = f(output size)

Receive Mapper Output

Sort input by key

Read input

Run join algorithm

Send join output

Page 3: Theta join (M-bucket-I algorithm explained)

Theta Join Model

S_id Value

1 5

2 6

3 6

4 8

5 8

6 10

Dataset S Dataset TT_id Value

1 5

2 5

3 6

4 8

5 8

6 10

Assuming join condition: S.value = T.value

Page 4: Theta join (M-bucket-I algorithm explained)

Theta Join Model

S_id Value

1 5

2 6

3 6

4 8

5 8

6 10

Dataset S Dataset TT_id Value

1 5

2 5

3 6

4 8

5 8

6 10

Assuming join condition: S.value = T.value

5 5 6 8 8 105668810

[ Join Matrix M ]

: tuple satisfying the join condition

ST

Page 5: Theta join (M-bucket-I algorithm explained)

Theta Join Model (Examples)

5 5 6 8 8 1056688

10

Join condition: S.value <= T.value

ST 5 5 6 8 8 10

5668810

Join condition: abs (S.value - T.value) < 2

ST 5 5 6 8 8 10

5668810

Join condition: S.value = T.value

ST

Page 6: Theta join (M-bucket-I algorithm explained)

Theta Join Model (Examples)

5 5 6 8 8 1056688

10

Join condition: S.value <= T.value

ST 5 5 6 8 8 10

5668810

Join condition: abs (S.value - T.value) < 2

ST 5 5 6 8 8 10

5668810

Join condition: S.value = T.value

ST

Page 7: Theta join (M-bucket-I algorithm explained)

Theta Join Model (Examples)

5 5 6 8 8 1056688

10

Join condition: S.value <= T.value

ST 5 5 6 8 8 10

5668810

Join condition: abs (S.value - T.value) < 2

ST 5 5 6 8 8 10

5668810

Join condition: S.value = T.value

ST

Page 8: Theta join (M-bucket-I algorithm explained)

Goal Revisited

• We want to minimize job completion time

• We need to assign every true cell to exactly one reducer. (find a mapping from M to R)

Page 9: Theta join (M-bucket-I algorithm explained)

Goal Revisited

• We want to minimize job completion time

• We need to assign every true cell to exactly one reducer. (find a mapping from M to R)

• Goal: Find a mapping from the join matrix M to reducers that minimizes job completion time

Page 10: Theta join (M-bucket-I algorithm explained)

Mappings from join matrix to reducers

5 5 6 8 8 1056688

10

Join condition: S.value = T.value

ST

(1)

(2)

(3)

(4)

[R1] Input: S1, T1, T2 Output: 2 tuples ![R2] Input: S2, S3, T3 Output: 2 tuples ![R3] Input: S4, S5, T4, T5 Output: 4 tuples ![R4] Input: S6, T6 Output: 1 tuple !Max-Reducer-Input: 4 Max-Reducer-Output: 4

5 5 6 8 8 105668810

Join condition: S.value = T.value

ST

(1)(2)

(3)(4)

[R1] Input: S1, S4, S5, T1, T4, T5 Output: 3 tuples ![R2] Input: S2, S4, T3,T5 Output: 2 tuples ![R3] Input: S1, S5, T2, T4 Output: 2 tuples ![R4] Input: S3, S6, T3, T6 Output: 2 tuples !MRI: 6 MRO: 3

(1)

(1)

(2)

(3)

(4)

Stndard equi-join algorithm Random

Page 11: Theta join (M-bucket-I algorithm explained)

Mappings from join matrix to reducers

5 5 6 8 8 1056688

10

Join condition: S.value = T.value

ST

(1)

(2)

(3)

(4)

[R1] Input: S1, T1, T2 Output: 2 tuples ![R2] Input: S2, S3, T3 Output: 2 tuples ![R3] Input: S4, S5, T4, T5 Output: 4 tuples ![R4] Input: S6, T6 Output: 1 tuple !Max-Reducer-Input: 4 Max-Reducer-Output: 4

5 5 6 8 8 105668810

Join condition: S.value = T.value

ST

(1)(2)

(3)(4)

[R1] Input: S1, S4, S5, T1, T4, T5 Output: 3 tuples ![R2] Input: S2, S4, T3,T5 Output: 2 tuples ![R3] Input: S1, S5, T2, T4 Output: 2 tuples ![R4] Input: S3, S6, T3, T6 Output: 2 tuples !MRI: 6 MRO: 3

(1)

(1)

(2)

(3)

(4)

Stndard equi-join algorithm Random

Page 12: Theta join (M-bucket-I algorithm explained)

Mappings from join matrix to reducers

5 5 6 8 8 1056688

10

Join condition: S.value = T.value

ST

(1)

(2)

(3)

(4)

[R1] Input: S1, T1, T2 Output: 2 tuples ![R2] Input: S2, S3, T3 Output: 2 tuples ![R3] Input: S4, S5, T4, T5 Output: 4 tuples ![R4] Input: S6, T6 Output: 1 tuple !Max-Reducer-Input: 4 Max-Reducer-Output: 4

5 5 6 8 8 105668810

Join condition: S.value = T.value

ST

(1)(2)

(3)(4)

[R1] Input: S1, S4, S5, T1, T4, T5 Output: 3 tuples ![R2] Input: S2, S4, T3,T5 Output: 2 tuples ![R3] Input: S1, S5, T2, T4 Output: 2 tuples ![R4] Input: S3, S6, T3, T6 Output: 2 tuples !MRI: 6 MRO: 3

(1)

(1)

(2)

(3)

(4)

Stndard equi-join algorithm Random

Page 13: Theta join (M-bucket-I algorithm explained)

Mappings from join matrix to reducers

5 5 6 8 8 1056688

10

Join condition: S.value = T.value

ST

(1)

(2)

(3)

(4)

[R1] Input: S1, T1, T2 Output: 2 tuples ![R2] Input: S2, S3, T3 Output: 2 tuples ![R3] Input: S4, S5, T4, T5 Output: 4 tuples ![R4] Input: S6, T6 Output: 1 tuple !Max-Reducer-Input: 4 Max-Reducer-Output: 4

5 5 6 8 8 105668810

Join condition: S.value = T.value

ST

(1)(2)

(3)(4)

[R1] Input: S1, S4, S5, T1, T4, T5 Output: 3 tuples ![R2] Input: S2, S4, T3,T5 Output: 2 tuples ![R3] Input: S1, S5, T2, T4 Output: 2 tuples ![R4] Input: S3, S6, T3, T6 Output: 2 tuples !MRI: 6 MRO: 3

(1)

(1)

(2)

(3)

(4)

Stndard equi-join algorithm Random

Page 14: Theta join (M-bucket-I algorithm explained)

Mappings from join matrix to reducers

5 5 6 8 8 1056688

10

Join condition: S.value = T.value

ST

(1)(2)

(3)

[R1] Input: S1, S2, T1, T2 Output: 2 tuples ![R2] Input: S3, S4, T1, T2, T3 Output: 2 tuples ![R3] Input: S4, S5, S6, T4, T5, T6 Output: 5 tuples !!Max-Reducer-Input: 6 Max-Reducer-Output: 5

Page 15: Theta join (M-bucket-I algorithm explained)

Mappings from join matrix to reducers

5 5 6 8 8 1056688

10

Join condition: S.value = T.value

ST

(1)(2)

(3)

[R1] Input: S1, S2, T1, T2 Output: 2 tuples ![R2] Input: S3, S4, T1, T2, T3 Output: 2 tuples ![R3] Input: S4, S5, S6, T4, T5, T6 Output: 5 tuples !!Max-Reducer-Input: 6 Max-Reducer-Output: 5

Page 16: Theta join (M-bucket-I algorithm explained)

Mappings from join matrix to reducers

• We see there could be many possible mappings from join matrix to reducers

• We will see in different cases, which mapping is (close to) optimal and algorithms to compute such mapping.

Page 17: Theta join (M-bucket-I algorithm explained)

LemmaWe will be using the following lemma repeatedly to show how (close to) optimal each mapping is.

[ LEMMA 1 ] A reducer that is assigned to c cells of the join matrix M will receive at least input tuples

[ Proof ] Consider a reducer r that receives m records from T and n records from S. Then,

!!

2pc

mn � c2pmn � 2

pc

m+ n � 2pc

Page 18: Theta join (M-bucket-I algorithm explained)

LemmaWe will be using the following lemma repeatedly to show how (close to) optimal each mapping is.

[ LEMMA 1 ] A reducer that is assigned to c cells of the join matrix M will receive at least input tuples

[ Proof ] Consider a reducer r that receives m records from T and n records from S. Then,

!!

2pc

mn � c2pmn � 2

pc

m+ n � 2pc

Page 19: Theta join (M-bucket-I algorithm explained)

Cross Product• We first consider cross product, where all of

tuples from two datasets satisfy the join condition. The join matrix would look like the following:

5 5 6 8 8 105668810

Join condition: S X T

ST

Page 20: Theta join (M-bucket-I algorithm explained)

Cross Product• We first consider cross product, where all of

tuples from two datasets satisfy the join condition. The join matrix would look like the following:

5 5 6 8 8 105668810

Join condition: S X T

ST

Page 21: Theta join (M-bucket-I algorithm explained)

Cross Product• Since all entries of the join matrix are true, we

can see that the maximum-reducer-output (MRO) . (Otherwise, there would be tuples not mapped to a reducer.)

• Along with Lemma 1, we have a lower bound for the maximum-reducer-input (MRI):

MRI

� |S||T |/r

� 2

r|S||T |

r

[ LEMMA 1 ] A reducer that is assigned to c cells of the join matrix M will receive at least input tuples 2

pc

Page 22: Theta join (M-bucket-I algorithm explained)

Cross Product• Since all entries of the join matrix are true, we

can see that the maximum-reducer-output (MRO) . (Otherwise, there would be tuples not mapped to a reducer.

• Along with Lemma 1, we have a lower bound for the maximum-reducer-input (MRI):

MRI

� |S||T |/r

� 2

r|S||T |

r

[ LEMMA 1 ] A reducer that is assigned to c cells of the join matrix M will receive at least input tuples 2

pc

Page 23: Theta join (M-bucket-I algorithm explained)

Cross Product• We will revisit these two properties frequently to

see the quality of join mappings:

� |S||T |/rMRO and MRI � 2

r|S||T |

r

Page 24: Theta join (M-bucket-I algorithm explained)

p|S||T |/rCase 1: Suppose |S| and |T| are multiples of .

Namely, and .|S| = csp|S||T |/r |T | = cT

p|S||T |/r

Then, partitioning the join matrix with squares of size is an optimal mapping.p

|S||T |/r

Proof : is trivial. Each region mapped to a reducer !has output size: and input size: |S||T |/r 2

r|S||T |

r

Cross Product� |S||T |/rMRO and MRI � 2

r|S||T |

r

Properties

Page 25: Theta join (M-bucket-I algorithm explained)

p|S||T |/rCase 1: Suppose |S| and |T| are multiples of .

Namely, and .|S| = csp|S||T |/r |T | = cT

p|S||T |/r

Then, partitioning the join matrix with squares of size is an optimal mapping.p

|S||T |/r

Proof : is trivial. Each region mapped to a reducer !has output size: and input size: |S||T |/r 2

r|S||T |

r

Cross Product� |S||T |/rMRO and MRI � 2

r|S||T |

r

Properties

Page 26: Theta join (M-bucket-I algorithm explained)

Cross Product� |S||T |/rMRO and MRI � 2

r|S||T |

r

Properties

5 5 6 8 8 105668810

ST

Suppose |S| = |T| = 6 and r = 9

Page 27: Theta join (M-bucket-I algorithm explained)

Cross Product� |S||T |/rMRO and MRI � 2

r|S||T |

r

Properties

5 5 6 8 8 105668810

ST

Suppose |S| = |T| = 6 and r = 9

Page 28: Theta join (M-bucket-I algorithm explained)

Cross Product� |S||T |/rMRO and MRI � 2

r|S||T |

r

Properties

5 5 6 8 8 105668810

ST

Suppose |S| = |T| = 6 and r = 9

Page 29: Theta join (M-bucket-I algorithm explained)

Cross Product� |S||T |/rMRO and MRI � 2

r|S||T |

r

Properties

5 5 6 8 8 105668810

ST

Suppose |S| = |T| = 6 and r = 9

MRO = 4 = 2

r|S||T |

r

MRI = 4 = |S||T |/r

Page 30: Theta join (M-bucket-I algorithm explained)

Case 2: Suppose the cardinality of one dataset is significantly greater than that of the other. (WLOG, assume ). Then, rectangle cover

Cross Product� |S||T |/rMRO and MRI � 2

r|S||T |

r

Properties

|S| < |T |/r |S|⇥ |T |/ris the optimal mapping.

Page 31: Theta join (M-bucket-I algorithm explained)

Case 2: Suppose the cardinality of one dataset is significantly greater than that of the other. (WLOG, assume ). Then, rectangle cover

Cross Product� |S||T |/rMRO and MRI � 2

r|S||T |

r

Properties

|S| < |T |/r |S|⇥ |T |/ris the optimal mapping.

(e.g., |S| = 3, |T| = 20, r = 5)

Page 32: Theta join (M-bucket-I algorithm explained)

Case 3: The remaining case where . !

Let , !

Then, covering M with squares is a mapping worse than an optimal mapping by a factor no greater than 4.

Cross Product� |S||T |/rMRO and MRI � 2

r|S||T |

r

Properties

|T |/r |S| |T |

CT =

$|T |/

r|S||T |

r

%CS =

$|S|/

r|S||T |

r

%

p|S||T |/r ⇥

p|S||T |/r

Page 33: Theta join (M-bucket-I algorithm explained)

If |S| and/or |T| is not a multiple of , scale each !

side by and/or respectively to !

cover M. Given , we see that

Cross Product� |S||T |/rMRO and MRI � 2

r|S||T |

r

Properties

p|S||T |/r

✓1 +

1

CS

◆ ✓1 +

1

CT

|T |/r |S| |T |✓1 +

1

CS

◆r|S||T |

r 2

r|S||T |

r

Page 34: Theta join (M-bucket-I algorithm explained)

Hence, and

Cross Product� |S||T |/rMRO and MRI � 2

r|S||T |

r

Properties

Comparing these with the lower bounds given above, we see that the MRO and MRI produced by this mapping are at most 4 times (twice for MRI) the lower bounds.

MRI 4p

|S||T |/rMRO 4|S||T |/r

Page 35: Theta join (M-bucket-I algorithm explained)

Implementation• Now we know how to (nearly) optimally partition

the join matrix. So let’s run it!!

• However, when a reducer is given a record (either from S or T), it does NOT have enough information where exactly in the dataset (in which row/col) the record belongs to.

• We could run another pre-process to get that info, but it can be avoided by running a randomized algorithm!

Page 36: Theta join (M-bucket-I algorithm explained)

Implementation• Now we know how to (nearly) optimally partition

the join matrix. So let’s run it!!

• However, when a reducer is given a record (either from S or T), it does NOT have enough information where exactly in the dataset (in which row/col) the record belongs to.

• We could run another pre-process to get that info, but it can be avoided by running a randomized algorithm!

Page 37: Theta join (M-bucket-I algorithm explained)

Implementation• Now we know how to (nearly) optimally partition

the join matrix. So let’s run it!!

• However, when a reducer is given a record (either from S or T), it does NOT have enough information where exactly in the dataset (in which row/col) the record belongs to.

• We could run another pre-process to get that info, but it can be avoided by running a randomized algorithm!

Page 38: Theta join (M-bucket-I algorithm explained)

Mapping & Randomized Algorithm

Algorithm 1 : Map (Theta - Join) !Input : input tuple 1: if then 2: matrixRow = random(1,|S|) 3: for all regionID in lookup.getRegions(matrixRow) do 4: Output (regionID, (x, “S”) ) 5: else 6: matrixCol = random (1,|T|) 7: for all regionID in lookup.getRegions(matrixCol) do 8: Output (regionID, (x, “T”) )

x 2 S [ T

x 2 S

Page 39: Theta join (M-bucket-I algorithm explained)

Mapping & Randomized Algorithm

Algorithm 1 : Map (Theta - Join) !Input : input tuple 1: if then 2: matrixRow = random(1,|S|) 3: for all regionID in lookup.getRegions(matrixRow) do 4: Output (regionID, (x, “S”) ) 5: else 6: matrixCol = random (1,|T|) 7: for all regionID in lookup.getRegions(matrixCol) do 8: Output (regionID, (x, “T”) )

x 2 S [ T

x 2 S

1. Given a record ( WLOG ) 2. Get a row uniformly randomly 3. Get all the regions intersecting that row and output ( regID, (x, S) )

x 2 S

Page 40: Theta join (M-bucket-I algorithm explained)

Mapping & Randomized Algorithm

5 7 7 7 8 9577899

ST

Join condition: S.value = T.value

(1) (2)

(3)

3 5 1 5 1 2

6 2 2 3 6 4

(1,S1) (2,S1) (3,S2) (1,S3) (2,S3) (3,S4) (1,S5) (2,S5) (1,S6) (2,S6) (2,T1) (3,T1) (1,T2) (3,T2) (1,T3) (3,T3) (1,T4) (3,T4) (2,T5) (3,T5) (2,T6) (3,T6)

Input Tuple

Random Row/Col Output

MapReducer 1 : key 1 (regID)Input: S1, S3, S5, S6, T2, T3, T4Output: (S3,T2) (S3,T3) (S3,T4)

Reducer 2 : key 2 (regID)Input: S1, S3, S5, S6, T1, T5, T6Output: (S1,T1) (S5,T6) (S6,T6)

Reducer 3 : key 3 (regID)Input: S2, S4, T1, T2, T3, T4, T5, T6Output: (S2,T2) (S2,T3) (S2,T4) (S4,T5)

Reduce

S1.A = 5 S2.A = 7 S3.A = 7 S4.A = 8 S5.A = 9 S6.A = 9 T1.A = 5 T2.A = 7 T3.A = 7 T4.A = 7 T5.A = 8 T6.A = 9

Page 41: Theta join (M-bucket-I algorithm explained)

Cross Product… NOT!

• We have verified that 1 Bucket Theta algorithm is close to optimal when the join condition is cross product.

• How does 1 Bucket Theta algorithm perform when join condition is NOT cross product ?

• We will compare the quality of 1 Bucket Theta algorithm to any join algorithm

Page 42: Theta join (M-bucket-I algorithm explained)

Cross Product… NOT!

• We have verified that 1 Bucket Theta algorithm is close to optimal when the join condition is cross product.

• How does 1 Bucket Theta algorithm perform when join condition is NOT cross product ?

• We will compare the quality of 1 Bucket Theta algorithm to any join algorithm

Page 43: Theta join (M-bucket-I algorithm explained)

Cross Product… NOT!

• We have verified that 1 Bucket Theta algorithm is close to optimal when the join condition is cross product.

• How does 1 Bucket Theta algorithm perform when join condition is NOT cross product ?

• We will compare the quality of 1 Bucket Theta algorithm to any join algorithm

Page 44: Theta join (M-bucket-I algorithm explained)

1BT vs ANY join algorithmLet . Any matrix to reducer mapping that has to cover at least of the cells of the join matrix, by Lemma 1, has MRI

1 � x > 0

x|S||T | |S||T |� 2

px|S||T |

[ LEMMA 1 ] A reducer that is assigned to c cells of the join matrix M will receive at least input tuples 2

pc

As we have seen, 1BT guarantees that MRI . !Hence,

4p|S||T |

MRI1BT

MRI

AnyJoinAlg

=4p

|S||T |/r2p

x|S||T |/r=

2px

Page 45: Theta join (M-bucket-I algorithm explained)

1BT vs ANY join algorithm

Page 46: Theta join (M-bucket-I algorithm explained)

1BT vs ANY join algorithm

When , the ratio < 3. !Hence,compared to ANY join algorithm that assigns more than 50% of its matrix cells to reducers, the MRI for 1BT is at most 3 times the MRI of that algorithm.

x = 0.5

Page 47: Theta join (M-bucket-I algorithm explained)

1BT vs ANY join algorithm

When , the ratio < 3. !Hence,compared to ANY join algorithm that assigns more than 50% of its matrix cells to reducers, the MRI for 1BT is at most 3 times the MRI of that algorithm.

x = 0.5

Page 48: Theta join (M-bucket-I algorithm explained)

M-Bucket-I• In the previous slide, we see that instead of

covering the entire matrix, mapping smaller regions would yield better MRI result.

• Ideally, we only want to map those satisfying the join condition, but it cannot be done before knowing input statistics and/or join condition.

• M-Bucket-I exploits statistics to improve over 1 Bucket Theta join algorithm

Page 49: Theta join (M-bucket-I algorithm explained)

M-Bucket-I• In the previous slide, we see that instead of

covering the entire matrix, mapping smaller regions would yield better MRI result.

• Ideally, we only want to map those satisfying the join condition, but it cannot be done before knowing input statistics and/or join condition.

• M-Bucket-I exploits statistics to improve over 1 Bucket Theta join algorithm

Page 50: Theta join (M-bucket-I algorithm explained)

M-Bucket-I• In the previous slide, we see that instead of

covering the entire matrix, mapping smaller regions would yield better MRI result.

• Ideally, we only want to map those satisfying the join condition, but it cannot be done before knowing input statistics and/or join condition.

• M-Bucket-I exploits statistics to improve over 1 Bucket Theta join algorithm

Page 51: Theta join (M-bucket-I algorithm explained)

M-Bucket-I[ Step 1 ] Approximate Equi-Depth Histograms

1) With probability n /|S|, sample approx. n records from |S|

2) Build k-quantiles (k buckets), where k < n 3) Iterate through |S| and count the number of

records in each bucket 4) Do the same for |T| and build the join matrix

accordingly

Page 52: Theta join (M-bucket-I algorithm explained)

M-Bucket-I[ Step 1 ] Approximate Equi-Depth Histograms

S_id Value

1 7

2 2

3 4

4 2

5 1

6 9

7 10

8 2

9 5

10 3

Dataset S Dataset T

T_id Value

1 5

2 5

3 6

4 8

5 8

6 10

7 2

8 4

9 1

10 3

Page 53: Theta join (M-bucket-I algorithm explained)

M-Bucket-I[ Step 1 ] Approximate Equi-Depth Histograms

S_id Value

1 7

2 2

3 4

4 2

5 1

6 9

7 10

8 2

9 5

10 3

Dataset S Dataset T

T_id Value

1 5

2 5

3 6

4 8

5 8

6 10

7 2

8 4

9 1

10 3

Sample S 7, 2, 2, 9, 2, 3

Sample T 5, 6, 8, 2, 1, 3

Page 54: Theta join (M-bucket-I algorithm explained)

M-Bucket-I[ Step 1 ] Approximate Equi-Depth Histograms

S_id Value

1 7

2 2

3 4

4 2

5 1

6 9

7 10

8 2

9 5

10 3

Dataset S Dataset T

T_id Value

1 5

2 5

3 6

4 8

5 8

6 10

7 2

8 4

9 1

10 3

Sample S 7, 2, 2, 9, 2, 3

Sample T 5, 6, 8, 2, 1, 3

Samples

Page 55: Theta join (M-bucket-I algorithm explained)

M-Bucket-I[ Step 1 ] Approximate Equi-Depth Histograms

S_id Value

1 7

2 2

3 4

4 2

5 1

6 9

7 10

8 2

9 5

10 3

Dataset S Dataset T

T_id Value

1 5

2 5

3 6

4 8

5 8

6 10

7 2

8 4

9 1

10 3

Sample S 7, 2, 2, 9, 2, 3

Sample T 5, 6, 8, 2, 1, 3

Samples

Buckets

S

T

0 2 3 9

0 1 5 8 1

1

Page 56: Theta join (M-bucket-I algorithm explained)

M-Bucket-I[ Step 1 ] Approximate Equi-Depth Histograms

S_id Value

1 7

2 2

3 4

4 2

5 1

6 9

7 10

8 2

9 5

10 3

Dataset S Dataset T

T_id Value

1 5

2 5

3 6

4 8

5 8

6 10

7 2

8 4

9 1

10 3

Sample S 7, 2, 2, 9, 2, 3

Sample T 5, 6, 8, 2, 1, 3

Samples

Buckets

S

T

0 2 3 9

0 1 5 8 1

1

4 1 4 1

1 5 3 1

Page 57: Theta join (M-bucket-I algorithm explained)

M-Bucket-I[ Step 1 ] Approximate Equi-Depth Histograms

S S S S S S S S S STTTTTTTTTT

Join condition: S.value = T.value

Page 58: Theta join (M-bucket-I algorithm explained)

M-Bucket-I[ Step 1 ] Approximate Equi-Depth Histograms

S S S S S S S S S STTTTTTTTTT

2 3 9

1

5

8

Join condition: S.value = T.value

Page 59: Theta join (M-bucket-I algorithm explained)

M-Bucket-I[ Step 1 ] Approximate Equi-Depth Histograms

S S S S S S S S S STTTTTTTTTT

2 3 9

1

5

8

Join condition: S.value = T.value

We now have candidate cells. How do we map these cells to reducers?

Page 60: Theta join (M-bucket-I algorithm explained)

M-Bucket-I[ Step 2 ] M-Bucket-I Algorithm

Algorithm : M-Bucket-I !Input : maxInput, r, M 1: row = 0 2: while row < M.noOfRows do 3: (row,r) = CoverSubMatrix(row, maxInput, r, M) 4: if r < 0 then!5: return false 6: return true!

Page 61: Theta join (M-bucket-I algorithm explained)

M-Bucket-I

Algorithm : CoverSubMatrix !Input : row_s, maxInput, r, M 1: maxScore = -1, rUsed = 0 2: for i = 1 to maxInput-1 do 3: R_i = CoverRows(row_s, row_s + i, maxInput, M) 4: area = totalCandidateArea(row_s, row_s + i, M) 5: score = area/R_i.size 6: if score >= maxScore then!7: bestRow = row_s + i 8: rUsed = R_i.size 9: r = r - rUsed 10: return (bestRow + 1, r)

[ Step 2 ] M-Bucket-I Algorithm

Page 62: Theta join (M-bucket-I algorithm explained)

M-Bucket-I

Algorithm : CoverRows !Input : row_f, row_l, maxInput, M 1: Regions = 0; r = newRegion() 2: for all c_i in M.getColumns do 3: if r. cap < c_i.candidateInputCosts then!4: Regions = Regions U r 5: r = newRegion() 6: r.Cells = r.Cells U c_i.candidateCells 7: return Regions

[ Step 2 ] M-Bucket-I Algorithm

Page 63: Theta join (M-bucket-I algorithm explained)

M-Bucket-I

Run the algorithm with r = 6 maxInput = 5

[ Step 2 ] M-Bucket-I Algorithm

Page 64: Theta join (M-bucket-I algorithm explained)

M-Bucket-I

Run the algorithm with r = 6 maxInput = 5

row : 0 cost : 4

[ Step 2 ] M-Bucket-I Algorithm

Page 65: Theta join (M-bucket-I algorithm explained)

M-Bucket-I

Run the algorithm with r = 6 maxInput = 5

row : 0 cost : 4

row : 1 cost : 13/3 = 4.3

[ Step 2 ] M-Bucket-I Algorithm

Page 66: Theta join (M-bucket-I algorithm explained)

M-Bucket-I

Run the algorithm with r = 6 maxInput = 5

row : 0 cost : 4

row : 1 cost : 13/3 = 4.3

row : 2 cost : 22/4 = 5.5

[ Step 2 ] M-Bucket-I Algorithm

Page 67: Theta join (M-bucket-I algorithm explained)

M-Bucket-I

Run the algorithm with r = 6 maxInput = 5

row : 0 cost : 4

row : 1 cost : 13/3 = 4.3

row : 2 cost : 22/4 = 5.5

row : 3 cost : 31/7 = 4.428..

[ Step 2 ] M-Bucket-I Algorithm

Page 68: Theta join (M-bucket-I algorithm explained)

M-Bucket-I

Run the algorithm with r = 6 maxInput = 5

row : 0 cost : 4

row : 1 cost : 13/3 = 4.3

row : 2 cost : 22/4 = 5.5

row : 3 cost : 31/7 = 4.428..

We choose the mapping with highest score!

(1) (2)(3) (4)

[ Step 2 ] M-Bucket-I Algorithm

Page 69: Theta join (M-bucket-I algorithm explained)

M-Bucket-I

Run the algorithm with r = 6 maxInput = 5

row : 3 cost : 3

(1) (2)(3) (4) So on and so forth…

[ Step 2 ] M-Bucket-I Algorithm

Page 70: Theta join (M-bucket-I algorithm explained)

M-Bucket-I

Run the algorithm with r = 6 maxInput = 5

Final mapping!

[ Step 2 ] M-Bucket-I Algorithm

(1) (2)(3) (4)

(7)(6)(5)

(8) (9)(10)

(11) (12)(13)

Page 71: Theta join (M-bucket-I algorithm explained)

M-Bucket-I

Run the algorithm with r = 6 maxInput = 5

(1) (2)(3) (4)

However, we have mapped the candidate cells to > r reducers. !We do binary search until we get to the point where we a mapping to <= r reducers.(7)(6)(5)

(8) (9)(10)

(11) (12)(13)

[ Step 2 ] M-Bucket-I Algorithm

Page 72: Theta join (M-bucket-I algorithm explained)

M-Bucket-I[ Step 3 ] Binary Search

MaxInput = |S|+|T| = 20

Num.Reducers = 1

MaxInput = 5

Num.Reducers = 13

Page 73: Theta join (M-bucket-I algorithm explained)

M-Bucket-I[ Step 3 ] Binary Search

MaxInput = |S|+|T| = 20

Num.Reducers = 1

MaxInput = 5

Num.Reducers = 13

MaxInput = 12

Num.Reducers = 3

Page 74: Theta join (M-bucket-I algorithm explained)

M-Bucket-I[ Step 3 ] Binary Search

MaxInput = |S|+|T| = 20

Num.Reducers = 1

MaxInput = 5

Num.Reducers = 13

MaxInput = 12

Num.Reducers = 3

MaxInput = 8

Num.Reducers = 5

Since 7 reducers are required when MaxInput = 7, we stop the binary search here and output the mapping with MRI = 8.

Page 75: Theta join (M-bucket-I algorithm explained)

Performance1 Bucket Theta Standard Equi Join

Data set Output size (billion)

Output Imbalance

Runtime (secs)

Output Imbalance

Runtime (secs)

Synth - 0 25.00 1.0030 657 1.0124 701

Synth - 0.4 24.99 1.0023 650 1.2541 722

Synth - 0.6 24.98 1.0033 676 1.7780 923

Synth - 0.8 24.95 1.0068 678 3.0103 1482

Synth - 1 24.91 1.0089 667 5.3124 2489

Skew

ed

Where Output Imbalance = MRI

Ave.RI

MRI

Ave.RI

Skew Resistance of 1 Bucket Theta

Page 76: Theta join (M-bucket-I algorithm explained)

Performance1 Bucket Theta Standard Equi Join

Data set Output size (billion)

Output Imbalance

Runtime (secs)

Output Imbalance

Runtime (secs)

Synth - 0 25.00 1.0030 657 1.0124 701

Synth - 0.4 24.99 1.0023 650 1.2541 722

Synth - 0.6 24.98 1.0033 676 1.7780 923

Synth - 0.8 24.95 1.0068 678 3.0103 1482

Synth - 1 24.91 1.0089 667 5.3124 2489

Skew

ed

Where Output Imbalance = MRI

Ave.RI

MRI

Ave.RI

Skew Resistance of 1 Bucket Theta

Page 77: Theta join (M-bucket-I algorithm explained)

Performance

Step Number of Buckets

1 10 100 1000 10,000 100,000 1,000,000

M-Bucket-I cost details (seconds)

Quantiles 0 115 120 117 122 124 122

Histogram 0 140 145 147 157 167 604

Heuristic 74.01 9.21 0.84 1.50 16.67 118.03 111.27

Join 49384 10905 1157 595 548 540 536

Total 49,458.01 11,169.21 1,422.84 860.5 843.67 949.03 1,373.27