Efficient Parallel kNN Joins for Large Data in...

Post on 21-May-2020

4 views 0 download

Transcript of Efficient Parallel kNN Joins for Large Data in...

Efficient Parallel kNN Joins for Large Data inMapReduce

Chi Zhang1 Feifei Li2 Jeffrey Jestes2

1Dept of Computer Science 2School of ComputingFlorida State University University of Utah

April 4, 2012

Outline

1 Introduction

2 Background: kNN Join

3 Parallel kNN Join for Multi-dimensional Data Using MapReduceExact kNN JoinApproximate kNN Join

4 Experiments

5 Conclusions

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Outline

1 Introduction

2 Background: kNN Join

3 Parallel kNN Join for Multi-dimensional Data Using MapReduceExact kNN JoinApproximate kNN Join

4 Experiments

5 Conclusions

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

k Nearest Neighbor Join

k nearest neighbor join (kNN join)

Given two data sets R and S , for every point q in R, kNN joinreturns k nearest points of q from S .

q

Point in R Point in S

Numerous applications: knowledge discovery, data mining, spatialdatabases, multimedia databases, etc.

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

k Nearest Neighbor Join

k nearest neighbor join (kNN join)

Given two data sets R and S , for every point q in R, kNN joinreturns k nearest points of q from S .

q

p3

(q, p1)

(q, p3)

(q, p4)

3-NN join for qp1

p4

Point in R Point in S

Numerous applications: knowledge discovery, data mining, spatialdatabases, multimedia databases, etc.

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

k Nearest Neighbor Join

k nearest neighbor join (kNN join)

Given two data sets R and S , for every point q in R, kNN joinreturns k nearest points of q from S .

Point in R Point in S

Find kNN in S for all points in R

Numerous applications: knowledge discovery, data mining, spatialdatabases, multimedia databases, etc.

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

k Nearest Neighbor Join

k nearest neighbor join (kNN join)

Given two data sets R and S , for every point q in R, kNN joinreturns k nearest points of q from S .

Point in R Point in S

Find kNN in S for all points in R

Numerous applications: knowledge discovery, data mining, spatialdatabases, multimedia databases, etc.

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Data Growth

Source: IDC

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Rise of Distributed and Parallel Computing

Data sets are growing at an exponential rate.

A single machine cannot handle large data efficiently.Parallel and distributed computing is the trend.

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Rise of Distributed and Parallel Computing

Data sets are growing at an exponential rate.

A single machine cannot handle large data efficiently.Parallel and distributed computing is the trend.

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Rise of Distributed and Parallel Computing

Challenges:

Minimize communication and computation.Achieve good load balance.

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Outline

1 Introduction

2 Background: kNN Join

3 Parallel kNN Join for Multi-dimensional Data Using MapReduceExact kNN JoinApproximate kNN Join

4 Experiments

5 Conclusions

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

kNN Join

Exact kNN Join

knn(r , S) = set of kNN of r from S .knnJ(R, S) = {(r , knn(r , S))| for all r ∈ R}.

Approximate kNN Joinaknn(r , S) = approximate kNN of r from S .

p = kth NN of r in knn(r , S).p′ = kth NN for r in aknn(r , S)aknn(r , S) is a c-approximation ofknn(r , S) : d(r , p) ≤ d(r , p′) ≤ c · d(r , p).

aknnJ(R,S) = {(r , aknn(r ,S))|∀r ∈ R}.

r

p

Point in R Point in S

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

kNN Join

Exact kNN Join

knn(r , S) = set of kNN of r from S .knnJ(R, S) = {(r , knn(r , S))| for all r ∈ R}.

Approximate kNN Joinaknn(r , S) = approximate kNN of r from S .

p = kth NN of r in knn(r , S).p′ = kth NN for r in aknn(r , S)aknn(r , S) is a c-approximation ofknn(r , S) : d(r , p) ≤ d(r , p′) ≤ c · d(r , p).

aknnJ(R,S) = {(r , aknn(r ,S))|∀r ∈ R}.

r

p

p′

Point in R Point in S

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Outline

1 Introduction

2 Background: kNN Join

3 Parallel kNN Join for Multi-dimensional Data Using MapReduceExact kNN JoinApproximate kNN Join

4 Experiments

5 Conclusions

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Exact kNN join: Block Nested Loop Join

Block nested loop join (BNLJ) based method

1 Partition R and S , each into n equal-sized disjoint blocks.2 Perform (BNLJ) for each possible Ri ,Sj pairs of blocks3 Get global kNN results from n local kNN results for every record in R

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Exact kNN join: Block Nested Loop Join

Block nested loop join (BNLJ) based method1 Partition R and S , each into n equal-sized disjoint blocks.

2 Perform (BNLJ) for each possible Ri ,Sj pairs of blocks3 Get global kNN results from n local kNN results for every record in R

R

S

R1

R2

S1

S2

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Exact kNN join: Block Nested Loop Join

Block nested loop join (BNLJ) based method1 Partition R and S , each into n equal-sized disjoint blocks.2 Perform (BNLJ) for each possible Ri ,Sj pairs of blocks

3 Get global kNN results from n local kNN results for every record in R

R

S

R1

R2

S1

S2

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Exact kNN join: Block Nested Loop Join

Block nested loop join (BNLJ) based method1 Partition R and S , each into n equal-sized disjoint blocks.2 Perform (BNLJ) for each possible Ri ,Sj pairs of blocks

3 Get global kNN results from n local kNN results for every record in R

R

S

R1

R2

S1

S2

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Exact kNN join: Block Nested Loop Join

Block nested loop join (BNLJ) based method1 Partition R and S , each into n equal-sized disjoint blocks.2 Perform (BNLJ) for each possible Ri ,Sj pairs of blocks

3 Get global kNN results from n local kNN results for every record in R

R

S

R1

R2

S1

S2

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Exact kNN join: Block Nested Loop Join

Block nested loop join (BNLJ) based method1 Partition R and S , each into n equal-sized disjoint blocks.2 Perform (BNLJ) for each possible Ri ,Sj pairs of blocks

3 Get global kNN results from n local kNN results for every record in R

R

S

R1

R2

S1

S2

BNLJ(R1, S1)

BNLJ(R1, S2)

BNLJ(R2, S1)

BNLJ(R2, S2)

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Exact kNN join: Block Nested Loop Join

Block nested loop join (BNLJ) based method1 Partition R and S , each into n equal-sized disjoint blocks.2 Perform (BNLJ) for each possible Ri ,Sj pairs of blocks3 Get global kNN results from n local kNN results for every record in R

R

S

R1

R2

S1

S2

BNLJ(R1, S1)

BNLJ(R1, S2)

BNLJ(R2, S1)

BNLJ(R2, S2)

BNLJ(R1, S)

BNLJ(R2, S)

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Exact kNN join: Block Nested Loop Join

Two-round MapReduce algorithm: Round 1

R

S

Mapper

Mapper

(1) Divide R and S into blocks

(2) Duplicate each blocks into 2 partitions

R1

R1

1

3

R2

2

R2

S1

S1

2

1

S2

S2

3

4

4

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Exact kNN join: Block Nested Loop Join

Two-round MapReduce algorithm: Round 1

DFS

R

S

Mapper

Mapper

(1) Divide R and S into blocks

(2) Duplicate each blocks into 2 partitions

R1

R1

1

3

R2

2

R2

S1

S1

2

1

S2

S2

3

Shuffle

R1

1S1

R2

2S1

R1

3S2

R2

4S2

Reducer

Reducer

Reducer

Reducer

BNLJ

DFS

DFS

DFS

4

4

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Exact kNN join: Block Nested Loop Join

Two-round MapReduce algorithm: Round 2

(r1, s1, d1,1)...

File 1

File 2

(r3, s1, d3,1)

(r1, s7, d1,8)

(r3, s5, d3,5)

Mapper

partition by record ids

Mapper

(r1, s1, d1,1)

(r3, s1, d3,1)

(r1, s7, d1,8)

(r3, s5, d3,5)

...

...

...

...

...

...

...

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Exact kNN join: Block Nested Loop Join

Two-round MapReduce algorithm: Round 2

(r1, s1, d1,1)...

File 1

File 2

(r3, s1, d3,1)

(r1, s7, d1,8)

(r3, s5, d3,5)

Mapper

partition by record ids

Mapper

(r1, s1, d1,1)

(r3, s1, d3,1)

(r1, s7, d1,8)

(r3, s5, d3,5)

Shuffle

(r1, s1, d1,1)

(r1, s7, d1,7)

(r3, s1, d3,1)

(r3, s5, d3,5)

sort list(s, d(r, s))get top k(= 2) results for r

Reducer

Reducer

DFS

(r1, s1, d1,1)(r1, s7, d1,7)

(r3, s5, d3,5)(r3, s6, d3,6)

DFS...

...

...

...

...

...

...

...

...

...

...

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Exact kNN join: Block R-tree Join

Use spatial index (R-tree) to improve performance

Build R-tree index for a block of S in a bucket to speed up kNNcomputations.Similar to BNLJ algorithm, only need to replace BNLJ with blockR-tree join (BRJ) in the first round.

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Exact kNN join: Block R-tree Join

Use spatial index (R-tree) to improve performance

Build R-tree index for a block of S in a bucket to speed up kNNcomputations.Similar to BNLJ algorithm, only need to replace BNLJ with blockR-tree join (BRJ) in the first round.

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Exact kNN join: Block R-tree Join

Use spatial index (R-tree) to improve performance

Build R-tree index for a block of S in a bucket to speed up kNNcomputations.Similar to BNLJ algorithm, only need to replace BNLJ with blockR-tree join (BRJ) in the first round.

DFS

R

S

Mapper

Mapper

(1) Divide R and S into blocks

(2) Duplicate each blocks into 2 partitions

R1

R1

1

3

R2

2

R2

S1

S1

2

1

S2

S2

3

Shuffle

R1

1S1

R2

2S1

R1

3S2

R2

4S2

Reducer

Reducer

Reducer

Reducer

BNLJ

DFS

DFS

DFS

4

4

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Exact kNN join: Block R-tree Join

Use spatial index (R-tree) to improve performance

Build R-tree index for a block of S in a bucket to speed up kNNcomputations.Similar to BNLJ algorithm, only need to replace BNLJ with blockR-tree join (BRJ) in the first round.

DFS

R

S

Mapper

Mapper

(1) Divide R and S into blocks

(2) Duplicate each blocks into 2 partitions

R1

R1

1

3

R2

2

R2

S1

S1

2

1

S2

S2

3

Shuffle

R1

1S1

R2

2S1

R1

3S2

R2

4S2

Reducer

Reducer

Reducer

Reducer

BNLJ

DFS

DFS

DFS

BRJ

4

4

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Outline

1 Introduction

2 Background: kNN Join

3 Parallel kNN Join for Multi-dimensional Data Using MapReduceExact kNN JoinApproximate kNN Join

4 Experiments

5 Conclusions

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Approximate kNN join

Problems with exact kNN join solution

Too much communication and computation (n2 buckets required)Find solution requiring O(n) buckets.

We search for approximate solutions.Space-filling curve based methods ([YLK10], dubbed zkNN)

DFS

R

S

Mapper

Mapper

(1) Divide R and S into blocks

(2) Duplicate each blocks into 2 partitions

R1

R1

1

3

R2

2

R2

S1

S1

2

1

S2

S2

3

Shuffle

R1

1S1

R2

2S1

R1

3S2

R2

4S2

Reducer

Reducer

Reducer

Reducer

BNLJ

DFS

DFS

DFS

BRJ

4

4

[YLK10] B. Yao, F. Li, P. Kumar. K nearest neighbor queries and knn-joins in large relational databases (almost) for free. ICDE, 2010.

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Approximate kNN join

Problems with exact kNN join solutionToo much communication and computation (n2 buckets required)

Find solution requiring O(n) buckets.

We search for approximate solutions.Space-filling curve based methods ([YLK10], dubbed zkNN)

DFS

R

S

Mapper

Mapper

(1) Divide R and S into blocks

(2) Duplicate each blocks into 2 partitions

R1

R1

1

3

R2

2

R2

S1

S1

2

1

S2

S2

3

Shuffle

R1

1S1

R2

2S1

R1

3S2

R2

4S2

Reducer

Reducer

Reducer

Reducer

BNLJ

DFS

DFS

DFS

BRJn2 buckets required, too much cost.

4

4

[YLK10] B. Yao, F. Li, P. Kumar. K nearest neighbor queries and knn-joins in large relational databases (almost) for free. ICDE, 2010.

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Approximate kNN join

Problems with exact kNN join solutionToo much communication and computation (n2 buckets required)

Find solution requiring O(n) buckets.

We search for approximate solutions.Space-filling curve based methods ([YLK10], dubbed zkNN)

DFS

R

S

Mapper

Mapper

(1) Divide R and S into blocks

(2) Duplicate each blocks into 2 partitions

R1

R1

1

3

R2

2

R2

S1

S1

2

1

S2

S2

3

Shuffle

R1

1S1

R2

2S1

R1

3S2

R2

4S2

Reducer

Reducer

Reducer

Reducer

BNLJ

DFS

DFS

DFS

BRJn2 buckets required, too much cost.

4

4

[YLK10] B. Yao, F. Li, P. Kumar. K nearest neighbor queries and knn-joins in large relational databases (almost) for free. ICDE, 2010.

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Approximate kNN join

Problems with exact kNN join solutionToo much communication and computation (n2 buckets required)

Find solution requiring O(n) buckets.We search for approximate solutions.Space-filling curve based methods ([YLK10], dubbed zkNN)

DFS

R

S

Mapper

Mapper

(1) Divide R and S into blocks

(2) Duplicate each blocks into 2 partitions

R1

R1

1

3

R2

2

R2

S1

S1

2

1

S2

S2

3

Shuffle

R1

1S1

R2

2S1

R1

3S2

R2

4S2

Reducer

Reducer

Reducer

Reducer

BNLJ

DFS

DFS

DFS

BRJn2 buckets required, too much cost.

4

4

[YLK10] B. Yao, F. Li, P. Kumar. K nearest neighbor queries and knn-joins in large relational databases (almost) for free. ICDE, 2010.

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Approximate kNN join: Z-order kNN join

The idea of zkNN

Transform d-dimensional points to 1-D values using Z-value.Map d-dimensional kNN join query to to 1-D range queries.Multiple random shift copies are used to improve spatial locality.

In practice 2 copies is arleady good enough.

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Approximate kNN join: Z-order kNN join

The idea of zkNN

Transform d-dimensional points to 1-D values using Z-value.Map d-dimensional kNN join query to to 1-D range queries.Multiple random shift copies are used to improve spatial locality.

In practice 2 copies is arleady good enough.

p3

p1

p5

p6

: points in P

p2

p4

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Approximate kNN join: Z-order kNN join

The idea of zkNN

Transform d-dimensional points to 1-D values using Z-value.Map d-dimensional kNN join query to to 1-D range queries.Multiple random shift copies are used to improve spatial locality.

In practice 2 copies is arleady good enough.

p3

p1

p5

p6

: points in P

p2

p4

pi,2

pi,6

pi,1

pi,5pi,3

pi,4

: points in Pi

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Approximate kNN join: Z-order kNN join

The idea of zkNN

Transform d-dimensional points to 1-D values using Z-value.Map d-dimensional kNN join query to to 1-D range queries.Multiple random shift copies are used to improve spatial locality.

In practice 2 copies is arleady good enough.

p3

p1

p5

p6

: points in P

p2

p4

pi,2

pi,6

pi,1

pi,5pi,3

pi,4

zi,5zi,1zi,3

zi,4

zi,2zi,6

: points in Pi

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Approximate kNN join: Z-order kNN join

The idea of zkNN

Transform d-dimensional points to 1-D values using Z-value.Map d-dimensional kNN join query to to 1-D range queries.Multiple random shift copies are used to improve spatial locality.

In practice 2 copies is arleady good enough.

p3

p1

p5

p6

: points in P

p2

p4

pi,2

pi,6

pi,1

pi,5pi,3

pi,4

zi,5zi,1zi,3

zi,4

zi,2zi,6

: points in Pi

ZPi

qi=q + vi

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Approximate kNN join: Z-order kNN join

The idea of zkNN

Transform d-dimensional points to 1-D values using Z-value.Map d-dimensional kNN join query to to 1-D range queries.Multiple random shift copies are used to improve spatial locality.

In practice 2 copies is arleady good enough.

p3

p1

p5

p6

: points in P

p2

p4

pi,2

pi,6

pi,1

pi,5pi,3

pi,4

zi,5zi,1zi,3

zi,4

zi,2zi,6

: points in Pi

ZPizqi

z−(zqi, k, Pi) z+(zqi, k, Pi)

qi=q + vi

B+-tree

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Approximate kNN join: Z-order kNN join

The idea of zkNN

Transform d-dimensional points to 1-D values using Z-value.Map d-dimensional kNN join query to to 1-D range queries.Multiple random shift copies are used to improve spatial locality.

In practice 2 copies is arleady good enough.

p3

p1

p5

p6

: points in P

p2

p4

pi,2

pi,6

pi,1

pi,5pi,3

pi,4

zi,5zi,1zi,3

zi,4

zi,2zi,6

: points in Pi

ZPizqi

z−(zqi, k, Pi) z+(zqi, k, Pi)

Ci(q)

qi=q + vi

B+-tree

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Approximate kNN join: Z-order kNN join

In our group’s previous work we derive the following guarantee forthe zkNN join:

Theorem

Given a query point q ∈ Rd , a data set P ⊂ Rd , and a small constantα ∈ Z+. We generate (α− 1) random vectors {v2, . . . , vα}, such that forany i , vi ∈ Rd , and shift P by these vectors to obtain {P1, . . . ,Pα}(P1 = P). Then, the zkNN join returns a constant approximation in anyfixed dimension for knn(q,P) in expectation.

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Approximate kNN join: H-zkNNJ

Apply zkNN for join in MapReduce (H-zkNNJ)

Partition based algorithm

Partitioning policy:

To achieve linear communication and computation costs (to thenumber of blocks n in each input data set)

Partitioning by z-values:

Partition input data sets Ri and Si into {Ri,1, ...,Ri,n} and{Si,1, ..., Si,n} using (n − 1) z-values {zi,1, ..., zi,n}

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Approximate kNN join: H-zkNNJ

Apply zkNN for join in MapReduce (H-zkNNJ)

Partition based algorithm

Partitioning policy:

To achieve linear communication and computation costs (to thenumber of blocks n in each input data set)

Partitioning by z-values:

Partition input data sets Ri and Si into {Ri,1, ...,Ri,n} and{Si,1, ..., Si,n} using (n − 1) z-values {zi,1, ..., zi,n}

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Approximate kNN join: H-zkNNJ

Apply zkNN for join in MapReduce (H-zkNNJ)

Partition based algorithm

Partitioning policy:

To achieve linear communication and computation costs (to thenumber of blocks n in each input data set)

Partitioning by z-values:

Partition input data sets Ri and Si into {Ri,1, ...,Ri,n} and{Si,1, ..., Si,n} using (n − 1) z-values {zi,1, ..., zi,n}

ZSi

ZRi

zi,1 zi,2

Ri,1 Ri,2 Ri,3

Si,2

zr

Si,1

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Approximate kNN join: H-zkNNJ

Apply zkNN for join in MapReduce (H-zkNNJ)

Partition based algorithm

Partitioning policy:

To achieve linear communication and computation costs (to thenumber of blocks n in each input data set)

Partitioning by z-values:

Partition input data sets Ri and Si into {Ri,1, ...,Ri,n} and{Si,1, ..., Si,n} using (n − 1) z-values {zi,1, ..., zi,n}

ZSi

ZRi

zi,1 zi,2

Ri,1 Ri,2 Ri,3

Si,2

zr

? ?Si,1

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Approximate kNN join: H-zkNNJ

Apply zkNN for join in MapReduce (H-zkNNJ)

Partition based algorithm

Partitioning policy:

To achieve linear communication and computation costs (to thenumber of blocks n in each input data set)

Partitioning by z-values:

Partition input data sets Ri and Si into {Ri,1, ...,Ri,n} and{Si,1, ..., Si,n} using (n − 1) z-values {zi,1, ..., zi,n}

ZSi

ZRi

zi,1 zi,2

Ri,1 Ri,2 Ri,3

Si,1 Si,2 Si,3

zr

zr

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Approximate kNN join: H-zkNNJ

Apply zkNN for join in MapReduce (H-zkNNJ)

Partition based algorithm

Partitioning policy:

To achieve linear communication and computation costs (to thenumber of blocks n in each input data set)

Partitioning by z-values:

Partition input data sets Ri and Si into {Ri,1, ...,Ri,n} and{Si,1, ..., Si,n} using (n − 1) z-values {zi,1, ..., zi,n}

ZSi

ZRi

zi,1 zi,2

Ri,1 Ri,2 Ri,3

Si,1 Si,2 Si,3

zr

Ci(r)

zr

small neighborhoodsearch!!!

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Approximate kNN join: H-zkNNJ

Apply zkNN for join in MapReduce (H-zkNNJ)

Partition based algorithm

Partitioning policy:

To achieve linear communication and computation costs (to thenumber of blocks n in each input data set)

Partitioning by z-values:

Partition input data sets Ri and Si into {Ri,1, ...,Ri,n} and{Si,1, ..., Si,n} using (n − 1) z-values {zi,1, ..., zi,n}

ZSi

zr

Si,1 Si,2 Si,4

Ri,1 Ri,2 Ri,4 ZRi

zi,1 zi,2

Si,3

zi,3

Ri,3

zr

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Approximate kNN join: H-zkNNJ

Apply zkNN for join in MapReduce (H-zkNNJ)

Partition based algorithm

Partitioning policy:

To achieve linear communication and computation costs (to thenumber of blocks n in each input data set)

Partitioning by z-values:

Partition input data sets Ri and Si into {Ri,1, ...,Ri,n} and{Si,1, ..., Si,n} using (n − 1) z-values {zi,1, ..., zi,n}

ZSi

zr

Si,1 Si,2 Si,4

Ri,1 Ri,2 Ri,4 ZRi

zi,1 zi,2

Si,3

zi,3

Ri,3

zr

Ci(r)

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Approximate kNN join: H-zkNNJ

Apply zkNN for join in MapReduce (H-zkNNJ)

Partition based algorithm

Partitioning policy:

To achieve linear communication and computation costs (to thenumber of blocks n in each input data set)

Partitioning by z-values:

Partition input data sets Ri and Si into {Ri,1, ...,Ri,n} and{Si,1, ..., Si,n} using (n − 1) z-values {zi,1, ..., zi,n}

ZSi

zr

Si,1 Si,2 Si,4

Ri,1 Ri,2 Ri,4 ZRi

zi,1 zi,2

Si,3

zi,3

Ri,3

zr

Ci(r)

copy copy

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Approximate kNN join: H-zkNNJ

Choice of partitioning values.

Each block of Ri and Si shares the same boundary so we only searcha small neighborhood and minimize communication.Goal: load balance.

Evenly partition Ri or Si .

Evenly partition Ri → O( |Ri |n

log |Si |)Evenly partition Si → O(|Ri |log |Si |)

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Approximate kNN join: H-zkNNJ

Choice of partitioning values.

Each block of Ri and Si shares the same boundary so we only searcha small neighborhood and minimize communication.Goal: load balance.Evenly partition Ri or Si .

Evenly partition Ri → O( |Ri |n

log |Si |)Evenly partition Si → O(|Ri |log |Si |)

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Approximate kNN join: H-zkNNJ

Choice of partitioning values.

Each block of Ri and Si shares the same boundary so we only searcha small neighborhood and minimize communication.Goal: load balance.Evenly partition Ri or Si .

Evenly partition Ri → O( |Ri |n

log |Si |)Evenly partition Si → O(|Ri |log |Si |)

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Approximate kNN join: H-zkNNJ

Computation of partitioning values.

Quantiles can be used for evenly partitioning a data set D.Sort a data set D and retrieve its (n − 1) quantiles (expensive).

We propose sampling based method to estimate quantiles.

We proved that both estimations are close enough (within εN) to theoriginal ranks with a high probability (1-e−2/ε).

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Approximate kNN join: H-zkNNJ

Computation of partitioning values.

Quantiles can be used for evenly partitioning a data set D.Sort a data set D and retrieve its (n − 1) quantiles (expensive).

We propose sampling based method to estimate quantiles.

We proved that both estimations are close enough (within εN) to theoriginal ranks with a high probability (1-e−2/ε).

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Approximate kNN join: H-zkNNJ

H−zkNNJ algorithm can be implemented in 3 rounds of MapReduce.

Round 1: construct random shift copies for R and S , Ri and Si , i ∈ [1, α],and generate partitioning values for Ri and Si

R

S

shift by vicompute z-value

Ri

Si

ith shift

ith shift

DFS

DFS

sample

sample of ith shiftRi

sample of ith shiftSi

Map

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Approximate kNN join: H-zkNNJ

H−zkNNJ algorithm can be implemented in 3 rounds of MapReduce.

Round 1: construct random shift copies for R and S , Ri and Si , i ∈ [1, α],and generate partitioning values for Ri and Si

R

S

shift by vicompute z-value

Ri

Si

ith shift

ith shift

DFS

DFS

sample

sample of ith shiftRi

sample of ith shiftSi

Map

shuffle

sort&

Ri

Si

estimator 1

estimator 2

DFS

Reduce

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Approximate kNN join: H-zkNNJ

H−zkNNJ algorithm can be implemented in 3 rounds of MapReduce.

Round 2: partition Ri and Si into blocks and compute the candidatepoints for knn(r , S) for any r ∈ R.

Ri

Si

partition by Si’s ranges

partition by Ri’s ranges

Ri,1 Ri,2 Ri,n. . .

block 1block 2 block n

Si,1 Si,2 Si,n. . .

block 1block 2 block n

Map

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Approximate kNN join: H-zkNNJ

H−zkNNJ algorithm can be implemented in 3 rounds of MapReduce.

Round 2: partition Ri and Si into blocks and compute the candidatepoints for knn(r , S) for any r ∈ R.

Ri

Si

partition by Si’s ranges

partition by Ri’s ranges

Ri,1 Ri,2 Ri,n. . .

block 1block 2 block n

Si,1 Si,2 Si,n. . .

block 1block 2 block n

Map

shuffle&

sort

Ri,n

Si,n

Ri,2

Si,2

Ri,1

Si,1

. . .

B+-Tree

B+-Tree

B+-Tree

Retrieve Ci(r) for all r ∈ Ri,j, j ∈ [1, n]

DFS

DFS

DFS

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Approximate kNN join: H-zkNNJ

H−zkNNJ algorithm can be implemented in 3 rounds of MapReduce.

Round 3: determine knn(r ,C(r)) of any r ∈ R from the (r ,Ci (r)) emittedby round 2.

Ri

Si

partition by Si’s ranges

partition by Ri’s ranges

Ri,1 Ri,2 Ri,n. . .

block 1block 2 block n

Si,1 Si,2 Si,n. . .

block 1block 2 block n

Map

shuffle&

sort

Ri,n

Si,n

Ri,2

Si,2

Ri,1

Si,1

. . .

B+-Tree

B+-Tree

B+-Tree

Retrieve Ci(r) for all r ∈ Ri,j, j ∈ [1, n]

DFS

DFS

DFS

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Outline

1 Introduction

2 Background: kNN Join

3 Parallel kNN Join for Multi-dimensional Data Using MapReduceExact kNN JoinApproximate kNN Join

4 Experiments

5 Conclusions

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Experiments: algorithms

We implement the following methods in Hadoop 0.20.2:Exact Methods:

The baseline solution is denoted H-BNLJ,The improvement to the baseline solution is denoted H-BRJ.

Approximate Methods:

Our three-round solution is denoted by H-zkNNJ, (meaning ”Hadoopz-value kNN Join”).

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Experiments: setup

Experiments are performed in a heterogeneous Hadoop cluster with17 machines:

1 9 machines with 2GB of RAM and an Intel Xeon 1.86GHz CPU2 6 machines with 4GB of RAM and an Intel Xeon 2GHz CPU

One is reserved for the master (running JobTracker and NameNode).

3 2 machines with 6GB of RAM and an Intel Xeon 2.13GHz CPU

All machines are directly connected to a 1000Mbps switch.

Each slave node has 300GB hard drive space and 1GB of RAM forHadoop daemon.

The chunk size of DFS is set to 128MB.

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Experiments: setup

Experiments are performed in a heterogeneous Hadoop cluster with17 machines:

1 9 machines with 2GB of RAM and an Intel Xeon 1.86GHz CPU2 6 machines with 4GB of RAM and an Intel Xeon 2GHz CPU

One is reserved for the master (running JobTracker and NameNode).

3 2 machines with 6GB of RAM and an Intel Xeon 2.13GHz CPU

All machines are directly connected to a 1000Mbps switch.

Each slave node has 300GB hard drive space and 1GB of RAM forHadoop daemon.

The chunk size of DFS is set to 128MB.

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Experiments: datasets

OpenStreet Map dataset:

the road-networks for 50 states in U.S.160 million records.preprocessed to remove duplicationseach record consists of a 4 bytes integer id, two 4 bytes real typecoordinates representing latitude and longitude, and a descriptioninformation.the coordinates has a positive real domain (0,100000).stored in text format, 6.6GB.

Large synthetic Random-Cluster datasets:

data sets have varying dimensionality (up to 30).each record has a 4-byte id and float type d-dimensional coordinates.

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Experiments: datasets

OpenStreet Map dataset:

the road-networks for 50 states in U.S.160 million records.preprocessed to remove duplicationseach record consists of a 4 bytes integer id, two 4 bytes real typecoordinates representing latitude and longitude, and a descriptioninformation.the coordinates has a positive real domain (0,100000).stored in text format, 6.6GB.

Large synthetic Random-Cluster datasets:

data sets have varying dimensionality (up to 30).each record has a 4-byte id and float type d-dimensional coordinates.

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Experiments: configurations and defaults

Data set configurations

(MXN) represents a data set configuration containing M records ofR and N record of S (in 10s of millions).

Default values for OpenStreet dataset:

Symbol Definition Default(MXN) data set configuration (4x4)

k # of nearest neighbor 10α # of shift copies 2ε the error rate of sampling 0.003γ the physical number of machines 16

Values for R-Cluster dataset:

(2x2) is set to be the default data set configuration.

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Experiments: configurations and defaults

Data set configurations

(MXN) represents a data set configuration containing M records ofR and N record of S (in 10s of millions).

Default values for OpenStreet dataset:

Symbol Definition Default(MXN) data set configuration (4x4)

k # of nearest neighbor 10α # of shift copies 2ε the error rate of sampling 0.003γ the physical number of machines 16

Values for R-Cluster dataset:

(2x2) is set to be the default data set configuration.

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Experiments: configurations and defaults

Data set configurations

(MXN) represents a data set configuration containing M records ofR and N record of S (in 10s of millions).

Default values for OpenStreet dataset:

Symbol Definition Default(MXN) data set configuration (4x4)

k # of nearest neighbor 10α # of shift copies 2ε the error rate of sampling 0.003γ the physical number of machines 16

Values for R-Cluster dataset:

(2x2) is set to be the default data set configuration.

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Experiments: Approximation quality

H-zkNNJ: Hadoop z-value kNN Join

1

1.2

1.4

1.6

10 20 40 60 80

Appro

xim

atio

n r

atio

k values

OpenStreet

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Experiments: Approximation quality

H-zkNNJ: Hadoop z-value kNN Join

0.6

0.7

0.8

0.9

1

10 20 40 60 80

Rec

all

(Pre

cisi

on)

k values

OpenStreet

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Experiments: Approximation quality

H-zkNNJ: Hadoop z-value kNN Join

1

1.2

1.4

1.6

5 10 15 20 25 30

Appro

xim

atio

n r

atio

Dimensionality

R-Cluster

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Experiments: Approximation quality

H-zkNNJ: Hadoop z-value kNN Join

0

0.2

0.4

0.6

0.8

1

5 10 15 20 25 30

Rec

all

(Pre

cisi

on)

Dimensionality

R-Cluster

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Experiments: Running time and communication cost

H-zkNNJ: Hadoop z-value kNN JoinH-BRJ: Hadoop Block R-tree Join

0

10

20

30

40

50

4x4 6x6 8x8 12x12 16x16

Tim

e (

seconds

×10

3)

|R|×|S|:107×10

7

H-zkNNJ

H-BRJ

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Experiments: Running time and communication cost

H-zkNNJ: Hadoop z-value kNN JoinH-BRJ: Hadoop Block R-tree Join

0

20

40

60

80

100

4x4 6x6 8x8 12x12 16x16

Data

shuff

led (

GB

)

|R|×|S|:107×10

7

H-zkNNJ

H-BRJ

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Experiments: Effect of d

H-zkNNJ: Hadoop z-value kNN JoinH-BRJ: Hadoop Block R-tree Join

103

104

105

5 10 15 20 25 30

Tim

e (s

econds)

Dimensionality

H-zkNNJ

H-BRJ

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Experiments: Effect of d

H-zkNNJ: Hadoop z-value kNN JoinH-BRJ: Hadoop Block R-tree Join

0

5

10

15

20

25

5 10 15 20 25 30

Dat

a sh

uff

led (

GB

)

Dimensionality

H-zkNNJ

H-BRJ

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Conclusions

We study efficient methods to perform kNN joins in MapReduce.

Exact (H-BRJ) and approximate (H-zkNNJ) algorithms are proposed.H-zkNNJ performs orders of magnitude better than other methodswith excellent approximation quality.

We plan to investigate kNN joins on very high dimensions in thefuture.

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

The End

Thank You

Q and A

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Approximate kNN join: Z-order kNN join

zkNN algorithm

Algorithm 1: zkNN(q, P, k , α)

generate {v2, . . . , vα}, v1 =−→0 , vi is a random vector in Rd ;1

Pi = P + vi (i ∈ [1, α]; ∀p ∈ P, insert p + vi in Pi );2

for i = 1, . . . , α do3

let qi = q + vi , Ci (q) = ∅, and zqi be qi ’s z-value;4

insert z−(zqi , k,Pi ) into Ci (q);5

insert z+(zqi , k,Pi ) into Ci (q);6

for any p ∈ Ci (q), update p = p − vi ;7

C (q) =⋃α

i=1 Ci (q) = C1(q) ∪ · · · ∪ Cα(q);8

return knn(q,C (q)).9

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Experiments: Approximation quality

H-zkNNJ: Hadoop z-value kNN Join

1

1.2

1.4

1.6

1.8

4x4 6x6 8x8 12x12 16x16

Appro

xim

atio

n r

atio

|R|×|S|:107×10

7

OpenStreet

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Experiments: Approximation quality

H-zkNNJ: Hadoop z-value kNN Join

1

1.2

1.4

1.6

10 20 40 60 80

Appro

xim

atio

n r

atio

k values

OpenStreet

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Experiments: Approximation quality

H-zkNNJ: Hadoop z-value kNN Join

0.6

0.7

0.8

0.9

1

40x40 80x80 120x120 160x160

Rec

all

(Pre

cisi

on)

|R|×|S|:106×10

6

OpenStreet

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Experiments: Approximation quality

H-zkNNJ: Hadoop z-value kNN Join

0.6

0.7

0.8

0.9

1

10 20 40 60 80

Rec

all

(Pre

cisi

on)

k values

OpenStreet

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Experiments: Effect of ε

102

103

104

105

0.6 1 3 10 100

Tim

e (

seconds)

ε (×10-3

)

H-zkNNJ

H-BRJ

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Experiments: Effect of ε

102

103

0.6 1 3 10 100

Sta

ndard

devia

tion

ε (×10-3

)

R blocks

S blocks

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Experiments: Evaluation of H-BNLJ

101

102

103

104

105

106

5 10 15 20 25

Tim

e (

seconds)

# Reducers

H-zkNNJ

H-BRJ

H-BNLJ

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Experiments: Evaluation of H-BNLJ

100

101

102

103

104

H-BNLJ H-BRJ H-zkNNJ

Tim

e (s

econds)

Algorithms

Phase1Phase2Phase3

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Experiments: Speedup

0

5

10

15

20

25

5 10 15 20 25

Tim

e (

seco

nd

s ×

10

3)

# Reducers

H-zkNNJ

H-BRJ

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Experiments: Speedup

0

2

4

6

8

5 10 15 20 25

Sp

eed

up

# Reducers

H-zkNNJ

H-BRJ

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Experiments: Running time and communication cost

H-zkNNJ: Hadoop z-value kNN JoinH-BRJ: Hadoop Block R-tree Join

0

1

2

3

4

5

4x4 6x6 8x8 12x12 16x16

Tim

e (

seconds

×10

3)

|R|×|S|:107×10

7

zPhase1zPhase2zPhase3

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Experiments: Running time and communication cost

H-zkNNJ: Hadoop z-value kNN JoinH-BRJ: Hadoop Block R-tree Join

0

10

20

30

40

50

4x4 6x6 8x8 12x12 16x16

Tim

e (

seconds

×10

3)

|R|×|S|:107×10

7

RPhase1RPhase2

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Experiments: Running time and communication cost

H-zkNNJ: Hadoop z-value kNN JoinH-BRJ: Hadoop Block R-tree Join

0

10

20

30

40

50

4x4 6x6 8x8 12x12 16x16

Tim

e (

seconds

×10

3)

|R|×|S|:107×10

7

H-zkNNJ

H-BRJ

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Experiments: Running time and communication cost

H-zkNNJ: Hadoop z-value kNN JoinH-BRJ: Hadoop Block R-tree Join

0

20

40

60

80

100

4x4 6x6 8x8 12x12 16x16

Data

shuff

led (

GB

)

|R|×|S|:107×10

7

H-zkNNJ

H-BRJ

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Experiments: Effect of d

H-zkNNJ: Hadoop z-value kNN JoinH-BRJ: Hadoop Block R-tree Join

103

104

105

5 10 15 20 25 30

Tim

e (s

econds)

Dimensionality

H-zkNNJ

H-BRJ

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Experiments: Effect of d

H-zkNNJ: Hadoop z-value kNN JoinH-BRJ: Hadoop Block R-tree Join

0

5

10

15

20

25

5 10 15 20 25 30

Dat

a sh

uff

led (

GB

)

Dimensionality

H-zkNNJ

H-BRJ

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Experiments: Effect of d

H-zkNNJ: Hadoop z-value kNN JoinH-BRJ: Hadoop Block R-tree Join

1

1.2

1.4

1.6

5 10 15 20 25 30

Appro

xim

atio

n r

atio

Dimensionality

R-Cluster

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Experiments: Effect of d

H-zkNNJ: Hadoop z-value kNN JoinH-BRJ: Hadoop Block R-tree Join

0

0.2

0.4

0.6

0.8

1

5 10 15 20 25 30

Rec

all

(Pre

cisi

on)

Dimensionality

R-Cluster

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Experiments: Effect of k

0

1

2

3

4

10 20 40 60 80

Tim

e (

seconds

×10

3)

k values

zPhase1zPhase2zPhase3

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Experiments: Effect of k

0

4

8

12

16

10 20 40 60 80

Tim

e (

seconds

×10

3)

k values

RPhase1RPhase2

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Experiments: Effect of k

0

10

20

30

40

3 10 20 40 60 80

Tim

e (

seconds

×10

3)

k values

H-zkNNJ

H-BRJ

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Experiments: Effect of k

0

50

100

150

200

3 10 20 40 60 80

Data

shuff

led (

GB

)

k values

H-zkNNJ

H-BRJ

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Experiments: Effect of number of shifts α

103

104

105

2 3 4 5 6

Tim

e (

seconds)

α values

H-zkNNJ H-BRJ

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Experiments: Effect of number of shifts α

0

5

10

15

2 3 4 5 6

Data

shuff

led (

GB

)

α values

H-zkNNJ

H-BRJ

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Experiments: Effect of number of shifts α

1

1.2

1.4

1.6

2 3 4 5 6

Appro

xim

atio

n r

atio

α values

R-Cluster

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Experiments: Effect of number of shifts α

0

0.2

0.4

0.6

0.8

1

2 3 4 5 6

Recall

(P

recis

ion)

α values

R-Cluster

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce