Efficient Parallel kNN Joins for Large Data in...

108
Efficient Parallel kNN Joins for Large Data in MapReduce Chi Zhang 1 Feifei Li 2 Jeffrey Jestes 2 1 Dept of Computer Science 2 School of Computing Florida State University University of Utah April 4, 2012

Transcript of Efficient Parallel kNN Joins for Large Data in...

Page 1: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Efficient Parallel kNN Joins for Large Data inMapReduce

Chi Zhang1 Feifei Li2 Jeffrey Jestes2

1Dept of Computer Science 2School of ComputingFlorida State University University of Utah

April 4, 2012

Page 2: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Outline

1 Introduction

2 Background: kNN Join

3 Parallel kNN Join for Multi-dimensional Data Using MapReduceExact kNN JoinApproximate kNN Join

4 Experiments

5 Conclusions

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 3: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Outline

1 Introduction

2 Background: kNN Join

3 Parallel kNN Join for Multi-dimensional Data Using MapReduceExact kNN JoinApproximate kNN Join

4 Experiments

5 Conclusions

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 4: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

k Nearest Neighbor Join

k nearest neighbor join (kNN join)

Given two data sets R and S , for every point q in R, kNN joinreturns k nearest points of q from S .

q

Point in R Point in S

Numerous applications: knowledge discovery, data mining, spatialdatabases, multimedia databases, etc.

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 5: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

k Nearest Neighbor Join

k nearest neighbor join (kNN join)

Given two data sets R and S , for every point q in R, kNN joinreturns k nearest points of q from S .

q

p3

(q, p1)

(q, p3)

(q, p4)

3-NN join for qp1

p4

Point in R Point in S

Numerous applications: knowledge discovery, data mining, spatialdatabases, multimedia databases, etc.

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 6: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

k Nearest Neighbor Join

k nearest neighbor join (kNN join)

Given two data sets R and S , for every point q in R, kNN joinreturns k nearest points of q from S .

Point in R Point in S

Find kNN in S for all points in R

Numerous applications: knowledge discovery, data mining, spatialdatabases, multimedia databases, etc.

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 7: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

k Nearest Neighbor Join

k nearest neighbor join (kNN join)

Given two data sets R and S , for every point q in R, kNN joinreturns k nearest points of q from S .

Point in R Point in S

Find kNN in S for all points in R

Numerous applications: knowledge discovery, data mining, spatialdatabases, multimedia databases, etc.

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 8: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Data Growth

Source: IDC

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 9: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Rise of Distributed and Parallel Computing

Data sets are growing at an exponential rate.

A single machine cannot handle large data efficiently.Parallel and distributed computing is the trend.

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 10: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Rise of Distributed and Parallel Computing

Data sets are growing at an exponential rate.

A single machine cannot handle large data efficiently.Parallel and distributed computing is the trend.

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 11: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Rise of Distributed and Parallel Computing

Challenges:

Minimize communication and computation.Achieve good load balance.

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 12: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Outline

1 Introduction

2 Background: kNN Join

3 Parallel kNN Join for Multi-dimensional Data Using MapReduceExact kNN JoinApproximate kNN Join

4 Experiments

5 Conclusions

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 13: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

kNN Join

Exact kNN Join

knn(r , S) = set of kNN of r from S .knnJ(R, S) = {(r , knn(r , S))| for all r ∈ R}.

Approximate kNN Joinaknn(r , S) = approximate kNN of r from S .

p = kth NN of r in knn(r , S).p′ = kth NN for r in aknn(r , S)aknn(r , S) is a c-approximation ofknn(r , S) : d(r , p) ≤ d(r , p′) ≤ c · d(r , p).

aknnJ(R,S) = {(r , aknn(r ,S))|∀r ∈ R}.

r

p

Point in R Point in S

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 14: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

kNN Join

Exact kNN Join

knn(r , S) = set of kNN of r from S .knnJ(R, S) = {(r , knn(r , S))| for all r ∈ R}.

Approximate kNN Joinaknn(r , S) = approximate kNN of r from S .

p = kth NN of r in knn(r , S).p′ = kth NN for r in aknn(r , S)aknn(r , S) is a c-approximation ofknn(r , S) : d(r , p) ≤ d(r , p′) ≤ c · d(r , p).

aknnJ(R,S) = {(r , aknn(r ,S))|∀r ∈ R}.

r

p

p′

Point in R Point in S

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 15: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Outline

1 Introduction

2 Background: kNN Join

3 Parallel kNN Join for Multi-dimensional Data Using MapReduceExact kNN JoinApproximate kNN Join

4 Experiments

5 Conclusions

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 16: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Exact kNN join: Block Nested Loop Join

Block nested loop join (BNLJ) based method

1 Partition R and S , each into n equal-sized disjoint blocks.2 Perform (BNLJ) for each possible Ri ,Sj pairs of blocks3 Get global kNN results from n local kNN results for every record in R

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 17: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Exact kNN join: Block Nested Loop Join

Block nested loop join (BNLJ) based method1 Partition R and S , each into n equal-sized disjoint blocks.

2 Perform (BNLJ) for each possible Ri ,Sj pairs of blocks3 Get global kNN results from n local kNN results for every record in R

R

S

R1

R2

S1

S2

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 18: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Exact kNN join: Block Nested Loop Join

Block nested loop join (BNLJ) based method1 Partition R and S , each into n equal-sized disjoint blocks.2 Perform (BNLJ) for each possible Ri ,Sj pairs of blocks

3 Get global kNN results from n local kNN results for every record in R

R

S

R1

R2

S1

S2

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 19: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Exact kNN join: Block Nested Loop Join

Block nested loop join (BNLJ) based method1 Partition R and S , each into n equal-sized disjoint blocks.2 Perform (BNLJ) for each possible Ri ,Sj pairs of blocks

3 Get global kNN results from n local kNN results for every record in R

R

S

R1

R2

S1

S2

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 20: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Exact kNN join: Block Nested Loop Join

Block nested loop join (BNLJ) based method1 Partition R and S , each into n equal-sized disjoint blocks.2 Perform (BNLJ) for each possible Ri ,Sj pairs of blocks

3 Get global kNN results from n local kNN results for every record in R

R

S

R1

R2

S1

S2

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 21: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Exact kNN join: Block Nested Loop Join

Block nested loop join (BNLJ) based method1 Partition R and S , each into n equal-sized disjoint blocks.2 Perform (BNLJ) for each possible Ri ,Sj pairs of blocks

3 Get global kNN results from n local kNN results for every record in R

R

S

R1

R2

S1

S2

BNLJ(R1, S1)

BNLJ(R1, S2)

BNLJ(R2, S1)

BNLJ(R2, S2)

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 22: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Exact kNN join: Block Nested Loop Join

Block nested loop join (BNLJ) based method1 Partition R and S , each into n equal-sized disjoint blocks.2 Perform (BNLJ) for each possible Ri ,Sj pairs of blocks3 Get global kNN results from n local kNN results for every record in R

R

S

R1

R2

S1

S2

BNLJ(R1, S1)

BNLJ(R1, S2)

BNLJ(R2, S1)

BNLJ(R2, S2)

BNLJ(R1, S)

BNLJ(R2, S)

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 23: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Exact kNN join: Block Nested Loop Join

Two-round MapReduce algorithm: Round 1

R

S

Mapper

Mapper

(1) Divide R and S into blocks

(2) Duplicate each blocks into 2 partitions

R1

R1

1

3

R2

2

R2

S1

S1

2

1

S2

S2

3

4

4

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 24: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Exact kNN join: Block Nested Loop Join

Two-round MapReduce algorithm: Round 1

DFS

R

S

Mapper

Mapper

(1) Divide R and S into blocks

(2) Duplicate each blocks into 2 partitions

R1

R1

1

3

R2

2

R2

S1

S1

2

1

S2

S2

3

Shuffle

R1

1S1

R2

2S1

R1

3S2

R2

4S2

Reducer

Reducer

Reducer

Reducer

BNLJ

DFS

DFS

DFS

4

4

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 25: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Exact kNN join: Block Nested Loop Join

Two-round MapReduce algorithm: Round 2

(r1, s1, d1,1)...

File 1

File 2

(r3, s1, d3,1)

(r1, s7, d1,8)

(r3, s5, d3,5)

Mapper

partition by record ids

Mapper

(r1, s1, d1,1)

(r3, s1, d3,1)

(r1, s7, d1,8)

(r3, s5, d3,5)

...

...

...

...

...

...

...

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 26: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Exact kNN join: Block Nested Loop Join

Two-round MapReduce algorithm: Round 2

(r1, s1, d1,1)...

File 1

File 2

(r3, s1, d3,1)

(r1, s7, d1,8)

(r3, s5, d3,5)

Mapper

partition by record ids

Mapper

(r1, s1, d1,1)

(r3, s1, d3,1)

(r1, s7, d1,8)

(r3, s5, d3,5)

Shuffle

(r1, s1, d1,1)

(r1, s7, d1,7)

(r3, s1, d3,1)

(r3, s5, d3,5)

sort list(s, d(r, s))get top k(= 2) results for r

Reducer

Reducer

DFS

(r1, s1, d1,1)(r1, s7, d1,7)

(r3, s5, d3,5)(r3, s6, d3,6)

DFS...

...

...

...

...

...

...

...

...

...

...

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 27: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Exact kNN join: Block R-tree Join

Use spatial index (R-tree) to improve performance

Build R-tree index for a block of S in a bucket to speed up kNNcomputations.Similar to BNLJ algorithm, only need to replace BNLJ with blockR-tree join (BRJ) in the first round.

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 28: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Exact kNN join: Block R-tree Join

Use spatial index (R-tree) to improve performance

Build R-tree index for a block of S in a bucket to speed up kNNcomputations.Similar to BNLJ algorithm, only need to replace BNLJ with blockR-tree join (BRJ) in the first round.

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 29: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Exact kNN join: Block R-tree Join

Use spatial index (R-tree) to improve performance

Build R-tree index for a block of S in a bucket to speed up kNNcomputations.Similar to BNLJ algorithm, only need to replace BNLJ with blockR-tree join (BRJ) in the first round.

DFS

R

S

Mapper

Mapper

(1) Divide R and S into blocks

(2) Duplicate each blocks into 2 partitions

R1

R1

1

3

R2

2

R2

S1

S1

2

1

S2

S2

3

Shuffle

R1

1S1

R2

2S1

R1

3S2

R2

4S2

Reducer

Reducer

Reducer

Reducer

BNLJ

DFS

DFS

DFS

4

4

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 30: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Exact kNN join: Block R-tree Join

Use spatial index (R-tree) to improve performance

Build R-tree index for a block of S in a bucket to speed up kNNcomputations.Similar to BNLJ algorithm, only need to replace BNLJ with blockR-tree join (BRJ) in the first round.

DFS

R

S

Mapper

Mapper

(1) Divide R and S into blocks

(2) Duplicate each blocks into 2 partitions

R1

R1

1

3

R2

2

R2

S1

S1

2

1

S2

S2

3

Shuffle

R1

1S1

R2

2S1

R1

3S2

R2

4S2

Reducer

Reducer

Reducer

Reducer

BNLJ

DFS

DFS

DFS

BRJ

4

4

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 31: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Outline

1 Introduction

2 Background: kNN Join

3 Parallel kNN Join for Multi-dimensional Data Using MapReduceExact kNN JoinApproximate kNN Join

4 Experiments

5 Conclusions

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 32: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Approximate kNN join

Problems with exact kNN join solution

Too much communication and computation (n2 buckets required)Find solution requiring O(n) buckets.

We search for approximate solutions.Space-filling curve based methods ([YLK10], dubbed zkNN)

DFS

R

S

Mapper

Mapper

(1) Divide R and S into blocks

(2) Duplicate each blocks into 2 partitions

R1

R1

1

3

R2

2

R2

S1

S1

2

1

S2

S2

3

Shuffle

R1

1S1

R2

2S1

R1

3S2

R2

4S2

Reducer

Reducer

Reducer

Reducer

BNLJ

DFS

DFS

DFS

BRJ

4

4

[YLK10] B. Yao, F. Li, P. Kumar. K nearest neighbor queries and knn-joins in large relational databases (almost) for free. ICDE, 2010.

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 33: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Approximate kNN join

Problems with exact kNN join solutionToo much communication and computation (n2 buckets required)

Find solution requiring O(n) buckets.

We search for approximate solutions.Space-filling curve based methods ([YLK10], dubbed zkNN)

DFS

R

S

Mapper

Mapper

(1) Divide R and S into blocks

(2) Duplicate each blocks into 2 partitions

R1

R1

1

3

R2

2

R2

S1

S1

2

1

S2

S2

3

Shuffle

R1

1S1

R2

2S1

R1

3S2

R2

4S2

Reducer

Reducer

Reducer

Reducer

BNLJ

DFS

DFS

DFS

BRJn2 buckets required, too much cost.

4

4

[YLK10] B. Yao, F. Li, P. Kumar. K nearest neighbor queries and knn-joins in large relational databases (almost) for free. ICDE, 2010.

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 34: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Approximate kNN join

Problems with exact kNN join solutionToo much communication and computation (n2 buckets required)

Find solution requiring O(n) buckets.

We search for approximate solutions.Space-filling curve based methods ([YLK10], dubbed zkNN)

DFS

R

S

Mapper

Mapper

(1) Divide R and S into blocks

(2) Duplicate each blocks into 2 partitions

R1

R1

1

3

R2

2

R2

S1

S1

2

1

S2

S2

3

Shuffle

R1

1S1

R2

2S1

R1

3S2

R2

4S2

Reducer

Reducer

Reducer

Reducer

BNLJ

DFS

DFS

DFS

BRJn2 buckets required, too much cost.

4

4

[YLK10] B. Yao, F. Li, P. Kumar. K nearest neighbor queries and knn-joins in large relational databases (almost) for free. ICDE, 2010.

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 35: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Approximate kNN join

Problems with exact kNN join solutionToo much communication and computation (n2 buckets required)

Find solution requiring O(n) buckets.We search for approximate solutions.Space-filling curve based methods ([YLK10], dubbed zkNN)

DFS

R

S

Mapper

Mapper

(1) Divide R and S into blocks

(2) Duplicate each blocks into 2 partitions

R1

R1

1

3

R2

2

R2

S1

S1

2

1

S2

S2

3

Shuffle

R1

1S1

R2

2S1

R1

3S2

R2

4S2

Reducer

Reducer

Reducer

Reducer

BNLJ

DFS

DFS

DFS

BRJn2 buckets required, too much cost.

4

4

[YLK10] B. Yao, F. Li, P. Kumar. K nearest neighbor queries and knn-joins in large relational databases (almost) for free. ICDE, 2010.

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 36: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Approximate kNN join: Z-order kNN join

The idea of zkNN

Transform d-dimensional points to 1-D values using Z-value.Map d-dimensional kNN join query to to 1-D range queries.Multiple random shift copies are used to improve spatial locality.

In practice 2 copies is arleady good enough.

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 37: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Approximate kNN join: Z-order kNN join

The idea of zkNN

Transform d-dimensional points to 1-D values using Z-value.Map d-dimensional kNN join query to to 1-D range queries.Multiple random shift copies are used to improve spatial locality.

In practice 2 copies is arleady good enough.

p3

p1

p5

p6

: points in P

p2

p4

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 38: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Approximate kNN join: Z-order kNN join

The idea of zkNN

Transform d-dimensional points to 1-D values using Z-value.Map d-dimensional kNN join query to to 1-D range queries.Multiple random shift copies are used to improve spatial locality.

In practice 2 copies is arleady good enough.

p3

p1

p5

p6

: points in P

p2

p4

pi,2

pi,6

pi,1

pi,5pi,3

pi,4

: points in Pi

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 39: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Approximate kNN join: Z-order kNN join

The idea of zkNN

Transform d-dimensional points to 1-D values using Z-value.Map d-dimensional kNN join query to to 1-D range queries.Multiple random shift copies are used to improve spatial locality.

In practice 2 copies is arleady good enough.

p3

p1

p5

p6

: points in P

p2

p4

pi,2

pi,6

pi,1

pi,5pi,3

pi,4

zi,5zi,1zi,3

zi,4

zi,2zi,6

: points in Pi

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 40: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Approximate kNN join: Z-order kNN join

The idea of zkNN

Transform d-dimensional points to 1-D values using Z-value.Map d-dimensional kNN join query to to 1-D range queries.Multiple random shift copies are used to improve spatial locality.

In practice 2 copies is arleady good enough.

p3

p1

p5

p6

: points in P

p2

p4

pi,2

pi,6

pi,1

pi,5pi,3

pi,4

zi,5zi,1zi,3

zi,4

zi,2zi,6

: points in Pi

ZPi

qi=q + vi

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 41: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Approximate kNN join: Z-order kNN join

The idea of zkNN

Transform d-dimensional points to 1-D values using Z-value.Map d-dimensional kNN join query to to 1-D range queries.Multiple random shift copies are used to improve spatial locality.

In practice 2 copies is arleady good enough.

p3

p1

p5

p6

: points in P

p2

p4

pi,2

pi,6

pi,1

pi,5pi,3

pi,4

zi,5zi,1zi,3

zi,4

zi,2zi,6

: points in Pi

ZPizqi

z−(zqi, k, Pi) z+(zqi, k, Pi)

qi=q + vi

B+-tree

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 42: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Approximate kNN join: Z-order kNN join

The idea of zkNN

Transform d-dimensional points to 1-D values using Z-value.Map d-dimensional kNN join query to to 1-D range queries.Multiple random shift copies are used to improve spatial locality.

In practice 2 copies is arleady good enough.

p3

p1

p5

p6

: points in P

p2

p4

pi,2

pi,6

pi,1

pi,5pi,3

pi,4

zi,5zi,1zi,3

zi,4

zi,2zi,6

: points in Pi

ZPizqi

z−(zqi, k, Pi) z+(zqi, k, Pi)

Ci(q)

qi=q + vi

B+-tree

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 43: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Approximate kNN join: Z-order kNN join

In our group’s previous work we derive the following guarantee forthe zkNN join:

Theorem

Given a query point q ∈ Rd , a data set P ⊂ Rd , and a small constantα ∈ Z+. We generate (α− 1) random vectors {v2, . . . , vα}, such that forany i , vi ∈ Rd , and shift P by these vectors to obtain {P1, . . . ,Pα}(P1 = P). Then, the zkNN join returns a constant approximation in anyfixed dimension for knn(q,P) in expectation.

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 44: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Approximate kNN join: H-zkNNJ

Apply zkNN for join in MapReduce (H-zkNNJ)

Partition based algorithm

Partitioning policy:

To achieve linear communication and computation costs (to thenumber of blocks n in each input data set)

Partitioning by z-values:

Partition input data sets Ri and Si into {Ri,1, ...,Ri,n} and{Si,1, ..., Si,n} using (n − 1) z-values {zi,1, ..., zi,n}

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 45: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Approximate kNN join: H-zkNNJ

Apply zkNN for join in MapReduce (H-zkNNJ)

Partition based algorithm

Partitioning policy:

To achieve linear communication and computation costs (to thenumber of blocks n in each input data set)

Partitioning by z-values:

Partition input data sets Ri and Si into {Ri,1, ...,Ri,n} and{Si,1, ..., Si,n} using (n − 1) z-values {zi,1, ..., zi,n}

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 46: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Approximate kNN join: H-zkNNJ

Apply zkNN for join in MapReduce (H-zkNNJ)

Partition based algorithm

Partitioning policy:

To achieve linear communication and computation costs (to thenumber of blocks n in each input data set)

Partitioning by z-values:

Partition input data sets Ri and Si into {Ri,1, ...,Ri,n} and{Si,1, ..., Si,n} using (n − 1) z-values {zi,1, ..., zi,n}

ZSi

ZRi

zi,1 zi,2

Ri,1 Ri,2 Ri,3

Si,2

zr

Si,1

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 47: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Approximate kNN join: H-zkNNJ

Apply zkNN for join in MapReduce (H-zkNNJ)

Partition based algorithm

Partitioning policy:

To achieve linear communication and computation costs (to thenumber of blocks n in each input data set)

Partitioning by z-values:

Partition input data sets Ri and Si into {Ri,1, ...,Ri,n} and{Si,1, ..., Si,n} using (n − 1) z-values {zi,1, ..., zi,n}

ZSi

ZRi

zi,1 zi,2

Ri,1 Ri,2 Ri,3

Si,2

zr

? ?Si,1

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 48: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Approximate kNN join: H-zkNNJ

Apply zkNN for join in MapReduce (H-zkNNJ)

Partition based algorithm

Partitioning policy:

To achieve linear communication and computation costs (to thenumber of blocks n in each input data set)

Partitioning by z-values:

Partition input data sets Ri and Si into {Ri,1, ...,Ri,n} and{Si,1, ..., Si,n} using (n − 1) z-values {zi,1, ..., zi,n}

ZSi

ZRi

zi,1 zi,2

Ri,1 Ri,2 Ri,3

Si,1 Si,2 Si,3

zr

zr

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 49: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Approximate kNN join: H-zkNNJ

Apply zkNN for join in MapReduce (H-zkNNJ)

Partition based algorithm

Partitioning policy:

To achieve linear communication and computation costs (to thenumber of blocks n in each input data set)

Partitioning by z-values:

Partition input data sets Ri and Si into {Ri,1, ...,Ri,n} and{Si,1, ..., Si,n} using (n − 1) z-values {zi,1, ..., zi,n}

ZSi

ZRi

zi,1 zi,2

Ri,1 Ri,2 Ri,3

Si,1 Si,2 Si,3

zr

Ci(r)

zr

small neighborhoodsearch!!!

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 50: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Approximate kNN join: H-zkNNJ

Apply zkNN for join in MapReduce (H-zkNNJ)

Partition based algorithm

Partitioning policy:

To achieve linear communication and computation costs (to thenumber of blocks n in each input data set)

Partitioning by z-values:

Partition input data sets Ri and Si into {Ri,1, ...,Ri,n} and{Si,1, ..., Si,n} using (n − 1) z-values {zi,1, ..., zi,n}

ZSi

zr

Si,1 Si,2 Si,4

Ri,1 Ri,2 Ri,4 ZRi

zi,1 zi,2

Si,3

zi,3

Ri,3

zr

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 51: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Approximate kNN join: H-zkNNJ

Apply zkNN for join in MapReduce (H-zkNNJ)

Partition based algorithm

Partitioning policy:

To achieve linear communication and computation costs (to thenumber of blocks n in each input data set)

Partitioning by z-values:

Partition input data sets Ri and Si into {Ri,1, ...,Ri,n} and{Si,1, ..., Si,n} using (n − 1) z-values {zi,1, ..., zi,n}

ZSi

zr

Si,1 Si,2 Si,4

Ri,1 Ri,2 Ri,4 ZRi

zi,1 zi,2

Si,3

zi,3

Ri,3

zr

Ci(r)

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 52: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Approximate kNN join: H-zkNNJ

Apply zkNN for join in MapReduce (H-zkNNJ)

Partition based algorithm

Partitioning policy:

To achieve linear communication and computation costs (to thenumber of blocks n in each input data set)

Partitioning by z-values:

Partition input data sets Ri and Si into {Ri,1, ...,Ri,n} and{Si,1, ..., Si,n} using (n − 1) z-values {zi,1, ..., zi,n}

ZSi

zr

Si,1 Si,2 Si,4

Ri,1 Ri,2 Ri,4 ZRi

zi,1 zi,2

Si,3

zi,3

Ri,3

zr

Ci(r)

copy copy

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 53: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Approximate kNN join: H-zkNNJ

Choice of partitioning values.

Each block of Ri and Si shares the same boundary so we only searcha small neighborhood and minimize communication.Goal: load balance.

Evenly partition Ri or Si .

Evenly partition Ri → O( |Ri |n

log |Si |)Evenly partition Si → O(|Ri |log |Si |)

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 54: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Approximate kNN join: H-zkNNJ

Choice of partitioning values.

Each block of Ri and Si shares the same boundary so we only searcha small neighborhood and minimize communication.Goal: load balance.Evenly partition Ri or Si .

Evenly partition Ri → O( |Ri |n

log |Si |)Evenly partition Si → O(|Ri |log |Si |)

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 55: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Approximate kNN join: H-zkNNJ

Choice of partitioning values.

Each block of Ri and Si shares the same boundary so we only searcha small neighborhood and minimize communication.Goal: load balance.Evenly partition Ri or Si .

Evenly partition Ri → O( |Ri |n

log |Si |)Evenly partition Si → O(|Ri |log |Si |)

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 56: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Approximate kNN join: H-zkNNJ

Computation of partitioning values.

Quantiles can be used for evenly partitioning a data set D.Sort a data set D and retrieve its (n − 1) quantiles (expensive).

We propose sampling based method to estimate quantiles.

We proved that both estimations are close enough (within εN) to theoriginal ranks with a high probability (1-e−2/ε).

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 57: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Approximate kNN join: H-zkNNJ

Computation of partitioning values.

Quantiles can be used for evenly partitioning a data set D.Sort a data set D and retrieve its (n − 1) quantiles (expensive).

We propose sampling based method to estimate quantiles.

We proved that both estimations are close enough (within εN) to theoriginal ranks with a high probability (1-e−2/ε).

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 58: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Approximate kNN join: H-zkNNJ

H−zkNNJ algorithm can be implemented in 3 rounds of MapReduce.

Round 1: construct random shift copies for R and S , Ri and Si , i ∈ [1, α],and generate partitioning values for Ri and Si

R

S

shift by vicompute z-value

Ri

Si

ith shift

ith shift

DFS

DFS

sample

sample of ith shiftRi

sample of ith shiftSi

Map

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 59: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Approximate kNN join: H-zkNNJ

H−zkNNJ algorithm can be implemented in 3 rounds of MapReduce.

Round 1: construct random shift copies for R and S , Ri and Si , i ∈ [1, α],and generate partitioning values for Ri and Si

R

S

shift by vicompute z-value

Ri

Si

ith shift

ith shift

DFS

DFS

sample

sample of ith shiftRi

sample of ith shiftSi

Map

shuffle

sort&

Ri

Si

estimator 1

estimator 2

DFS

Reduce

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 60: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Approximate kNN join: H-zkNNJ

H−zkNNJ algorithm can be implemented in 3 rounds of MapReduce.

Round 2: partition Ri and Si into blocks and compute the candidatepoints for knn(r , S) for any r ∈ R.

Ri

Si

partition by Si’s ranges

partition by Ri’s ranges

Ri,1 Ri,2 Ri,n. . .

block 1block 2 block n

Si,1 Si,2 Si,n. . .

block 1block 2 block n

Map

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 61: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Approximate kNN join: H-zkNNJ

H−zkNNJ algorithm can be implemented in 3 rounds of MapReduce.

Round 2: partition Ri and Si into blocks and compute the candidatepoints for knn(r , S) for any r ∈ R.

Ri

Si

partition by Si’s ranges

partition by Ri’s ranges

Ri,1 Ri,2 Ri,n. . .

block 1block 2 block n

Si,1 Si,2 Si,n. . .

block 1block 2 block n

Map

shuffle&

sort

Ri,n

Si,n

Ri,2

Si,2

Ri,1

Si,1

. . .

B+-Tree

B+-Tree

B+-Tree

Retrieve Ci(r) for all r ∈ Ri,j, j ∈ [1, n]

DFS

DFS

DFS

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 62: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Approximate kNN join: H-zkNNJ

H−zkNNJ algorithm can be implemented in 3 rounds of MapReduce.

Round 3: determine knn(r ,C(r)) of any r ∈ R from the (r ,Ci (r)) emittedby round 2.

Ri

Si

partition by Si’s ranges

partition by Ri’s ranges

Ri,1 Ri,2 Ri,n. . .

block 1block 2 block n

Si,1 Si,2 Si,n. . .

block 1block 2 block n

Map

shuffle&

sort

Ri,n

Si,n

Ri,2

Si,2

Ri,1

Si,1

. . .

B+-Tree

B+-Tree

B+-Tree

Retrieve Ci(r) for all r ∈ Ri,j, j ∈ [1, n]

DFS

DFS

DFS

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 63: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Outline

1 Introduction

2 Background: kNN Join

3 Parallel kNN Join for Multi-dimensional Data Using MapReduceExact kNN JoinApproximate kNN Join

4 Experiments

5 Conclusions

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 64: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Experiments: algorithms

We implement the following methods in Hadoop 0.20.2:Exact Methods:

The baseline solution is denoted H-BNLJ,The improvement to the baseline solution is denoted H-BRJ.

Approximate Methods:

Our three-round solution is denoted by H-zkNNJ, (meaning ”Hadoopz-value kNN Join”).

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 65: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Experiments: setup

Experiments are performed in a heterogeneous Hadoop cluster with17 machines:

1 9 machines with 2GB of RAM and an Intel Xeon 1.86GHz CPU2 6 machines with 4GB of RAM and an Intel Xeon 2GHz CPU

One is reserved for the master (running JobTracker and NameNode).

3 2 machines with 6GB of RAM and an Intel Xeon 2.13GHz CPU

All machines are directly connected to a 1000Mbps switch.

Each slave node has 300GB hard drive space and 1GB of RAM forHadoop daemon.

The chunk size of DFS is set to 128MB.

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 66: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Experiments: setup

Experiments are performed in a heterogeneous Hadoop cluster with17 machines:

1 9 machines with 2GB of RAM and an Intel Xeon 1.86GHz CPU2 6 machines with 4GB of RAM and an Intel Xeon 2GHz CPU

One is reserved for the master (running JobTracker and NameNode).

3 2 machines with 6GB of RAM and an Intel Xeon 2.13GHz CPU

All machines are directly connected to a 1000Mbps switch.

Each slave node has 300GB hard drive space and 1GB of RAM forHadoop daemon.

The chunk size of DFS is set to 128MB.

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 67: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Experiments: datasets

OpenStreet Map dataset:

the road-networks for 50 states in U.S.160 million records.preprocessed to remove duplicationseach record consists of a 4 bytes integer id, two 4 bytes real typecoordinates representing latitude and longitude, and a descriptioninformation.the coordinates has a positive real domain (0,100000).stored in text format, 6.6GB.

Large synthetic Random-Cluster datasets:

data sets have varying dimensionality (up to 30).each record has a 4-byte id and float type d-dimensional coordinates.

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 68: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Experiments: datasets

OpenStreet Map dataset:

the road-networks for 50 states in U.S.160 million records.preprocessed to remove duplicationseach record consists of a 4 bytes integer id, two 4 bytes real typecoordinates representing latitude and longitude, and a descriptioninformation.the coordinates has a positive real domain (0,100000).stored in text format, 6.6GB.

Large synthetic Random-Cluster datasets:

data sets have varying dimensionality (up to 30).each record has a 4-byte id and float type d-dimensional coordinates.

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 69: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Experiments: configurations and defaults

Data set configurations

(MXN) represents a data set configuration containing M records ofR and N record of S (in 10s of millions).

Default values for OpenStreet dataset:

Symbol Definition Default(MXN) data set configuration (4x4)

k # of nearest neighbor 10α # of shift copies 2ε the error rate of sampling 0.003γ the physical number of machines 16

Values for R-Cluster dataset:

(2x2) is set to be the default data set configuration.

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 70: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Experiments: configurations and defaults

Data set configurations

(MXN) represents a data set configuration containing M records ofR and N record of S (in 10s of millions).

Default values for OpenStreet dataset:

Symbol Definition Default(MXN) data set configuration (4x4)

k # of nearest neighbor 10α # of shift copies 2ε the error rate of sampling 0.003γ the physical number of machines 16

Values for R-Cluster dataset:

(2x2) is set to be the default data set configuration.

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 71: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Experiments: configurations and defaults

Data set configurations

(MXN) represents a data set configuration containing M records ofR and N record of S (in 10s of millions).

Default values for OpenStreet dataset:

Symbol Definition Default(MXN) data set configuration (4x4)

k # of nearest neighbor 10α # of shift copies 2ε the error rate of sampling 0.003γ the physical number of machines 16

Values for R-Cluster dataset:

(2x2) is set to be the default data set configuration.

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 72: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Experiments: Approximation quality

H-zkNNJ: Hadoop z-value kNN Join

1

1.2

1.4

1.6

10 20 40 60 80

Appro

xim

atio

n r

atio

k values

OpenStreet

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 73: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Experiments: Approximation quality

H-zkNNJ: Hadoop z-value kNN Join

0.6

0.7

0.8

0.9

1

10 20 40 60 80

Rec

all

(Pre

cisi

on)

k values

OpenStreet

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 74: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Experiments: Approximation quality

H-zkNNJ: Hadoop z-value kNN Join

1

1.2

1.4

1.6

5 10 15 20 25 30

Appro

xim

atio

n r

atio

Dimensionality

R-Cluster

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 75: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Experiments: Approximation quality

H-zkNNJ: Hadoop z-value kNN Join

0

0.2

0.4

0.6

0.8

1

5 10 15 20 25 30

Rec

all

(Pre

cisi

on)

Dimensionality

R-Cluster

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 76: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Experiments: Running time and communication cost

H-zkNNJ: Hadoop z-value kNN JoinH-BRJ: Hadoop Block R-tree Join

0

10

20

30

40

50

4x4 6x6 8x8 12x12 16x16

Tim

e (

seconds

×10

3)

|R|×|S|:107×10

7

H-zkNNJ

H-BRJ

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 77: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Experiments: Running time and communication cost

H-zkNNJ: Hadoop z-value kNN JoinH-BRJ: Hadoop Block R-tree Join

0

20

40

60

80

100

4x4 6x6 8x8 12x12 16x16

Data

shuff

led (

GB

)

|R|×|S|:107×10

7

H-zkNNJ

H-BRJ

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 78: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Experiments: Effect of d

H-zkNNJ: Hadoop z-value kNN JoinH-BRJ: Hadoop Block R-tree Join

103

104

105

5 10 15 20 25 30

Tim

e (s

econds)

Dimensionality

H-zkNNJ

H-BRJ

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 79: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Experiments: Effect of d

H-zkNNJ: Hadoop z-value kNN JoinH-BRJ: Hadoop Block R-tree Join

0

5

10

15

20

25

5 10 15 20 25 30

Dat

a sh

uff

led (

GB

)

Dimensionality

H-zkNNJ

H-BRJ

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 80: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Conclusions

We study efficient methods to perform kNN joins in MapReduce.

Exact (H-BRJ) and approximate (H-zkNNJ) algorithms are proposed.H-zkNNJ performs orders of magnitude better than other methodswith excellent approximation quality.

We plan to investigate kNN joins on very high dimensions in thefuture.

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 81: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

The End

Thank You

Q and A

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 82: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Approximate kNN join: Z-order kNN join

zkNN algorithm

Algorithm 1: zkNN(q, P, k , α)

generate {v2, . . . , vα}, v1 =−→0 , vi is a random vector in Rd ;1

Pi = P + vi (i ∈ [1, α]; ∀p ∈ P, insert p + vi in Pi );2

for i = 1, . . . , α do3

let qi = q + vi , Ci (q) = ∅, and zqi be qi ’s z-value;4

insert z−(zqi , k,Pi ) into Ci (q);5

insert z+(zqi , k,Pi ) into Ci (q);6

for any p ∈ Ci (q), update p = p − vi ;7

C (q) =⋃α

i=1 Ci (q) = C1(q) ∪ · · · ∪ Cα(q);8

return knn(q,C (q)).9

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 83: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Experiments: Approximation quality

H-zkNNJ: Hadoop z-value kNN Join

1

1.2

1.4

1.6

1.8

4x4 6x6 8x8 12x12 16x16

Appro

xim

atio

n r

atio

|R|×|S|:107×10

7

OpenStreet

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 84: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Experiments: Approximation quality

H-zkNNJ: Hadoop z-value kNN Join

1

1.2

1.4

1.6

10 20 40 60 80

Appro

xim

atio

n r

atio

k values

OpenStreet

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 85: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Experiments: Approximation quality

H-zkNNJ: Hadoop z-value kNN Join

0.6

0.7

0.8

0.9

1

40x40 80x80 120x120 160x160

Rec

all

(Pre

cisi

on)

|R|×|S|:106×10

6

OpenStreet

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 86: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Experiments: Approximation quality

H-zkNNJ: Hadoop z-value kNN Join

0.6

0.7

0.8

0.9

1

10 20 40 60 80

Rec

all

(Pre

cisi

on)

k values

OpenStreet

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 87: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Experiments: Effect of ε

102

103

104

105

0.6 1 3 10 100

Tim

e (

seconds)

ε (×10-3

)

H-zkNNJ

H-BRJ

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 88: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Experiments: Effect of ε

102

103

0.6 1 3 10 100

Sta

ndard

devia

tion

ε (×10-3

)

R blocks

S blocks

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 89: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Experiments: Evaluation of H-BNLJ

101

102

103

104

105

106

5 10 15 20 25

Tim

e (

seconds)

# Reducers

H-zkNNJ

H-BRJ

H-BNLJ

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 90: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Experiments: Evaluation of H-BNLJ

100

101

102

103

104

H-BNLJ H-BRJ H-zkNNJ

Tim

e (s

econds)

Algorithms

Phase1Phase2Phase3

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 91: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Experiments: Speedup

0

5

10

15

20

25

5 10 15 20 25

Tim

e (

seco

nd

s ×

10

3)

# Reducers

H-zkNNJ

H-BRJ

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 92: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Experiments: Speedup

0

2

4

6

8

5 10 15 20 25

Sp

eed

up

# Reducers

H-zkNNJ

H-BRJ

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 93: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Experiments: Running time and communication cost

H-zkNNJ: Hadoop z-value kNN JoinH-BRJ: Hadoop Block R-tree Join

0

1

2

3

4

5

4x4 6x6 8x8 12x12 16x16

Tim

e (

seconds

×10

3)

|R|×|S|:107×10

7

zPhase1zPhase2zPhase3

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 94: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Experiments: Running time and communication cost

H-zkNNJ: Hadoop z-value kNN JoinH-BRJ: Hadoop Block R-tree Join

0

10

20

30

40

50

4x4 6x6 8x8 12x12 16x16

Tim

e (

seconds

×10

3)

|R|×|S|:107×10

7

RPhase1RPhase2

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 95: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Experiments: Running time and communication cost

H-zkNNJ: Hadoop z-value kNN JoinH-BRJ: Hadoop Block R-tree Join

0

10

20

30

40

50

4x4 6x6 8x8 12x12 16x16

Tim

e (

seconds

×10

3)

|R|×|S|:107×10

7

H-zkNNJ

H-BRJ

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 96: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Experiments: Running time and communication cost

H-zkNNJ: Hadoop z-value kNN JoinH-BRJ: Hadoop Block R-tree Join

0

20

40

60

80

100

4x4 6x6 8x8 12x12 16x16

Data

shuff

led (

GB

)

|R|×|S|:107×10

7

H-zkNNJ

H-BRJ

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 97: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Experiments: Effect of d

H-zkNNJ: Hadoop z-value kNN JoinH-BRJ: Hadoop Block R-tree Join

103

104

105

5 10 15 20 25 30

Tim

e (s

econds)

Dimensionality

H-zkNNJ

H-BRJ

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 98: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Experiments: Effect of d

H-zkNNJ: Hadoop z-value kNN JoinH-BRJ: Hadoop Block R-tree Join

0

5

10

15

20

25

5 10 15 20 25 30

Dat

a sh

uff

led (

GB

)

Dimensionality

H-zkNNJ

H-BRJ

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 99: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Experiments: Effect of d

H-zkNNJ: Hadoop z-value kNN JoinH-BRJ: Hadoop Block R-tree Join

1

1.2

1.4

1.6

5 10 15 20 25 30

Appro

xim

atio

n r

atio

Dimensionality

R-Cluster

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 100: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Experiments: Effect of d

H-zkNNJ: Hadoop z-value kNN JoinH-BRJ: Hadoop Block R-tree Join

0

0.2

0.4

0.6

0.8

1

5 10 15 20 25 30

Rec

all

(Pre

cisi

on)

Dimensionality

R-Cluster

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 101: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Experiments: Effect of k

0

1

2

3

4

10 20 40 60 80

Tim

e (

seconds

×10

3)

k values

zPhase1zPhase2zPhase3

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 102: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Experiments: Effect of k

0

4

8

12

16

10 20 40 60 80

Tim

e (

seconds

×10

3)

k values

RPhase1RPhase2

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 103: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Experiments: Effect of k

0

10

20

30

40

3 10 20 40 60 80

Tim

e (

seconds

×10

3)

k values

H-zkNNJ

H-BRJ

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 104: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Experiments: Effect of k

0

50

100

150

200

3 10 20 40 60 80

Data

shuff

led (

GB

)

k values

H-zkNNJ

H-BRJ

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 105: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Experiments: Effect of number of shifts α

103

104

105

2 3 4 5 6

Tim

e (

seconds)

α values

H-zkNNJ H-BRJ

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 106: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Experiments: Effect of number of shifts α

0

5

10

15

2 3 4 5 6

Data

shuff

led (

GB

)

α values

H-zkNNJ

H-BRJ

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 107: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Experiments: Effect of number of shifts α

1

1.2

1.4

1.6

2 3 4 5 6

Appro

xim

atio

n r

atio

α values

R-Cluster

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Page 108: Efficient Parallel kNN Joins for Large Data in MapReduceww2.cs.fsu.edu/~czhang/knnjedbt/knnslides.pdf · 2012-04-04 · E cient Parallel kNN Joins for Large Data in MapReduce Chi

Experiments: Effect of number of shifts α

0

0.2

0.4

0.6

0.8

1

2 3 4 5 6

Recall

(P

recis

ion)

α values

R-Cluster

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce