Efficient Parallel kNN Joins for Large Data in...
Transcript of Efficient Parallel kNN Joins for Large Data in...
Efficient Parallel kNN Joins for Large Data inMapReduce
Chi Zhang1 Feifei Li2 Jeffrey Jestes2
1Dept of Computer Science 2School of ComputingFlorida State University University of Utah
April 4, 2012
Outline
1 Introduction
2 Background: kNN Join
3 Parallel kNN Join for Multi-dimensional Data Using MapReduceExact kNN JoinApproximate kNN Join
4 Experiments
5 Conclusions
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Outline
1 Introduction
2 Background: kNN Join
3 Parallel kNN Join for Multi-dimensional Data Using MapReduceExact kNN JoinApproximate kNN Join
4 Experiments
5 Conclusions
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
k Nearest Neighbor Join
k nearest neighbor join (kNN join)
Given two data sets R and S , for every point q in R, kNN joinreturns k nearest points of q from S .
q
Point in R Point in S
Numerous applications: knowledge discovery, data mining, spatialdatabases, multimedia databases, etc.
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
k Nearest Neighbor Join
k nearest neighbor join (kNN join)
Given two data sets R and S , for every point q in R, kNN joinreturns k nearest points of q from S .
q
p3
(q, p1)
(q, p3)
(q, p4)
3-NN join for qp1
p4
Point in R Point in S
Numerous applications: knowledge discovery, data mining, spatialdatabases, multimedia databases, etc.
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
k Nearest Neighbor Join
k nearest neighbor join (kNN join)
Given two data sets R and S , for every point q in R, kNN joinreturns k nearest points of q from S .
Point in R Point in S
Find kNN in S for all points in R
Numerous applications: knowledge discovery, data mining, spatialdatabases, multimedia databases, etc.
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
k Nearest Neighbor Join
k nearest neighbor join (kNN join)
Given two data sets R and S , for every point q in R, kNN joinreturns k nearest points of q from S .
Point in R Point in S
Find kNN in S for all points in R
Numerous applications: knowledge discovery, data mining, spatialdatabases, multimedia databases, etc.
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Data Growth
Source: IDC
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Rise of Distributed and Parallel Computing
Data sets are growing at an exponential rate.
A single machine cannot handle large data efficiently.Parallel and distributed computing is the trend.
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Rise of Distributed and Parallel Computing
Data sets are growing at an exponential rate.
A single machine cannot handle large data efficiently.Parallel and distributed computing is the trend.
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Rise of Distributed and Parallel Computing
Challenges:
Minimize communication and computation.Achieve good load balance.
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Outline
1 Introduction
2 Background: kNN Join
3 Parallel kNN Join for Multi-dimensional Data Using MapReduceExact kNN JoinApproximate kNN Join
4 Experiments
5 Conclusions
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
kNN Join
Exact kNN Join
knn(r , S) = set of kNN of r from S .knnJ(R, S) = {(r , knn(r , S))| for all r ∈ R}.
Approximate kNN Joinaknn(r , S) = approximate kNN of r from S .
p = kth NN of r in knn(r , S).p′ = kth NN for r in aknn(r , S)aknn(r , S) is a c-approximation ofknn(r , S) : d(r , p) ≤ d(r , p′) ≤ c · d(r , p).
aknnJ(R,S) = {(r , aknn(r ,S))|∀r ∈ R}.
r
p
Point in R Point in S
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
kNN Join
Exact kNN Join
knn(r , S) = set of kNN of r from S .knnJ(R, S) = {(r , knn(r , S))| for all r ∈ R}.
Approximate kNN Joinaknn(r , S) = approximate kNN of r from S .
p = kth NN of r in knn(r , S).p′ = kth NN for r in aknn(r , S)aknn(r , S) is a c-approximation ofknn(r , S) : d(r , p) ≤ d(r , p′) ≤ c · d(r , p).
aknnJ(R,S) = {(r , aknn(r ,S))|∀r ∈ R}.
r
p
p′
Point in R Point in S
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Outline
1 Introduction
2 Background: kNN Join
3 Parallel kNN Join for Multi-dimensional Data Using MapReduceExact kNN JoinApproximate kNN Join
4 Experiments
5 Conclusions
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Exact kNN join: Block Nested Loop Join
Block nested loop join (BNLJ) based method
1 Partition R and S , each into n equal-sized disjoint blocks.2 Perform (BNLJ) for each possible Ri ,Sj pairs of blocks3 Get global kNN results from n local kNN results for every record in R
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Exact kNN join: Block Nested Loop Join
Block nested loop join (BNLJ) based method1 Partition R and S , each into n equal-sized disjoint blocks.
2 Perform (BNLJ) for each possible Ri ,Sj pairs of blocks3 Get global kNN results from n local kNN results for every record in R
R
S
R1
R2
S1
S2
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Exact kNN join: Block Nested Loop Join
Block nested loop join (BNLJ) based method1 Partition R and S , each into n equal-sized disjoint blocks.2 Perform (BNLJ) for each possible Ri ,Sj pairs of blocks
3 Get global kNN results from n local kNN results for every record in R
R
S
R1
R2
S1
S2
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Exact kNN join: Block Nested Loop Join
Block nested loop join (BNLJ) based method1 Partition R and S , each into n equal-sized disjoint blocks.2 Perform (BNLJ) for each possible Ri ,Sj pairs of blocks
3 Get global kNN results from n local kNN results for every record in R
R
S
R1
R2
S1
S2
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Exact kNN join: Block Nested Loop Join
Block nested loop join (BNLJ) based method1 Partition R and S , each into n equal-sized disjoint blocks.2 Perform (BNLJ) for each possible Ri ,Sj pairs of blocks
3 Get global kNN results from n local kNN results for every record in R
R
S
R1
R2
S1
S2
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Exact kNN join: Block Nested Loop Join
Block nested loop join (BNLJ) based method1 Partition R and S , each into n equal-sized disjoint blocks.2 Perform (BNLJ) for each possible Ri ,Sj pairs of blocks
3 Get global kNN results from n local kNN results for every record in R
R
S
R1
R2
S1
S2
BNLJ(R1, S1)
BNLJ(R1, S2)
BNLJ(R2, S1)
BNLJ(R2, S2)
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Exact kNN join: Block Nested Loop Join
Block nested loop join (BNLJ) based method1 Partition R and S , each into n equal-sized disjoint blocks.2 Perform (BNLJ) for each possible Ri ,Sj pairs of blocks3 Get global kNN results from n local kNN results for every record in R
R
S
R1
R2
S1
S2
BNLJ(R1, S1)
BNLJ(R1, S2)
BNLJ(R2, S1)
BNLJ(R2, S2)
BNLJ(R1, S)
BNLJ(R2, S)
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Exact kNN join: Block Nested Loop Join
Two-round MapReduce algorithm: Round 1
R
S
Mapper
Mapper
(1) Divide R and S into blocks
(2) Duplicate each blocks into 2 partitions
R1
R1
1
3
R2
2
R2
S1
S1
2
1
S2
S2
3
4
4
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Exact kNN join: Block Nested Loop Join
Two-round MapReduce algorithm: Round 1
DFS
R
S
Mapper
Mapper
(1) Divide R and S into blocks
(2) Duplicate each blocks into 2 partitions
R1
R1
1
3
R2
2
R2
S1
S1
2
1
S2
S2
3
Shuffle
R1
1S1
R2
2S1
R1
3S2
R2
4S2
Reducer
Reducer
Reducer
Reducer
BNLJ
DFS
DFS
DFS
4
4
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Exact kNN join: Block Nested Loop Join
Two-round MapReduce algorithm: Round 2
(r1, s1, d1,1)...
File 1
File 2
(r3, s1, d3,1)
(r1, s7, d1,8)
(r3, s5, d3,5)
Mapper
partition by record ids
Mapper
(r1, s1, d1,1)
(r3, s1, d3,1)
(r1, s7, d1,8)
(r3, s5, d3,5)
...
...
...
...
...
...
...
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Exact kNN join: Block Nested Loop Join
Two-round MapReduce algorithm: Round 2
(r1, s1, d1,1)...
File 1
File 2
(r3, s1, d3,1)
(r1, s7, d1,8)
(r3, s5, d3,5)
Mapper
partition by record ids
Mapper
(r1, s1, d1,1)
(r3, s1, d3,1)
(r1, s7, d1,8)
(r3, s5, d3,5)
Shuffle
(r1, s1, d1,1)
(r1, s7, d1,7)
(r3, s1, d3,1)
(r3, s5, d3,5)
sort list(s, d(r, s))get top k(= 2) results for r
Reducer
Reducer
DFS
(r1, s1, d1,1)(r1, s7, d1,7)
(r3, s5, d3,5)(r3, s6, d3,6)
DFS...
...
...
...
...
...
...
...
...
...
...
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Exact kNN join: Block R-tree Join
Use spatial index (R-tree) to improve performance
Build R-tree index for a block of S in a bucket to speed up kNNcomputations.Similar to BNLJ algorithm, only need to replace BNLJ with blockR-tree join (BRJ) in the first round.
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Exact kNN join: Block R-tree Join
Use spatial index (R-tree) to improve performance
Build R-tree index for a block of S in a bucket to speed up kNNcomputations.Similar to BNLJ algorithm, only need to replace BNLJ with blockR-tree join (BRJ) in the first round.
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Exact kNN join: Block R-tree Join
Use spatial index (R-tree) to improve performance
Build R-tree index for a block of S in a bucket to speed up kNNcomputations.Similar to BNLJ algorithm, only need to replace BNLJ with blockR-tree join (BRJ) in the first round.
DFS
R
S
Mapper
Mapper
(1) Divide R and S into blocks
(2) Duplicate each blocks into 2 partitions
R1
R1
1
3
R2
2
R2
S1
S1
2
1
S2
S2
3
Shuffle
R1
1S1
R2
2S1
R1
3S2
R2
4S2
Reducer
Reducer
Reducer
Reducer
BNLJ
DFS
DFS
DFS
4
4
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Exact kNN join: Block R-tree Join
Use spatial index (R-tree) to improve performance
Build R-tree index for a block of S in a bucket to speed up kNNcomputations.Similar to BNLJ algorithm, only need to replace BNLJ with blockR-tree join (BRJ) in the first round.
DFS
R
S
Mapper
Mapper
(1) Divide R and S into blocks
(2) Duplicate each blocks into 2 partitions
R1
R1
1
3
R2
2
R2
S1
S1
2
1
S2
S2
3
Shuffle
R1
1S1
R2
2S1
R1
3S2
R2
4S2
Reducer
Reducer
Reducer
Reducer
BNLJ
DFS
DFS
DFS
BRJ
4
4
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Outline
1 Introduction
2 Background: kNN Join
3 Parallel kNN Join for Multi-dimensional Data Using MapReduceExact kNN JoinApproximate kNN Join
4 Experiments
5 Conclusions
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Approximate kNN join
Problems with exact kNN join solution
Too much communication and computation (n2 buckets required)Find solution requiring O(n) buckets.
We search for approximate solutions.Space-filling curve based methods ([YLK10], dubbed zkNN)
DFS
R
S
Mapper
Mapper
(1) Divide R and S into blocks
(2) Duplicate each blocks into 2 partitions
R1
R1
1
3
R2
2
R2
S1
S1
2
1
S2
S2
3
Shuffle
R1
1S1
R2
2S1
R1
3S2
R2
4S2
Reducer
Reducer
Reducer
Reducer
BNLJ
DFS
DFS
DFS
BRJ
4
4
[YLK10] B. Yao, F. Li, P. Kumar. K nearest neighbor queries and knn-joins in large relational databases (almost) for free. ICDE, 2010.
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Approximate kNN join
Problems with exact kNN join solutionToo much communication and computation (n2 buckets required)
Find solution requiring O(n) buckets.
We search for approximate solutions.Space-filling curve based methods ([YLK10], dubbed zkNN)
DFS
R
S
Mapper
Mapper
(1) Divide R and S into blocks
(2) Duplicate each blocks into 2 partitions
R1
R1
1
3
R2
2
R2
S1
S1
2
1
S2
S2
3
Shuffle
R1
1S1
R2
2S1
R1
3S2
R2
4S2
Reducer
Reducer
Reducer
Reducer
BNLJ
DFS
DFS
DFS
BRJn2 buckets required, too much cost.
4
4
[YLK10] B. Yao, F. Li, P. Kumar. K nearest neighbor queries and knn-joins in large relational databases (almost) for free. ICDE, 2010.
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Approximate kNN join
Problems with exact kNN join solutionToo much communication and computation (n2 buckets required)
Find solution requiring O(n) buckets.
We search for approximate solutions.Space-filling curve based methods ([YLK10], dubbed zkNN)
DFS
R
S
Mapper
Mapper
(1) Divide R and S into blocks
(2) Duplicate each blocks into 2 partitions
R1
R1
1
3
R2
2
R2
S1
S1
2
1
S2
S2
3
Shuffle
R1
1S1
R2
2S1
R1
3S2
R2
4S2
Reducer
Reducer
Reducer
Reducer
BNLJ
DFS
DFS
DFS
BRJn2 buckets required, too much cost.
4
4
[YLK10] B. Yao, F. Li, P. Kumar. K nearest neighbor queries and knn-joins in large relational databases (almost) for free. ICDE, 2010.
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Approximate kNN join
Problems with exact kNN join solutionToo much communication and computation (n2 buckets required)
Find solution requiring O(n) buckets.We search for approximate solutions.Space-filling curve based methods ([YLK10], dubbed zkNN)
DFS
R
S
Mapper
Mapper
(1) Divide R and S into blocks
(2) Duplicate each blocks into 2 partitions
R1
R1
1
3
R2
2
R2
S1
S1
2
1
S2
S2
3
Shuffle
R1
1S1
R2
2S1
R1
3S2
R2
4S2
Reducer
Reducer
Reducer
Reducer
BNLJ
DFS
DFS
DFS
BRJn2 buckets required, too much cost.
4
4
[YLK10] B. Yao, F. Li, P. Kumar. K nearest neighbor queries and knn-joins in large relational databases (almost) for free. ICDE, 2010.
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Approximate kNN join: Z-order kNN join
The idea of zkNN
Transform d-dimensional points to 1-D values using Z-value.Map d-dimensional kNN join query to to 1-D range queries.Multiple random shift copies are used to improve spatial locality.
In practice 2 copies is arleady good enough.
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Approximate kNN join: Z-order kNN join
The idea of zkNN
Transform d-dimensional points to 1-D values using Z-value.Map d-dimensional kNN join query to to 1-D range queries.Multiple random shift copies are used to improve spatial locality.
In practice 2 copies is arleady good enough.
p3
p1
p5
p6
: points in P
p2
p4
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Approximate kNN join: Z-order kNN join
The idea of zkNN
Transform d-dimensional points to 1-D values using Z-value.Map d-dimensional kNN join query to to 1-D range queries.Multiple random shift copies are used to improve spatial locality.
In practice 2 copies is arleady good enough.
p3
p1
p5
p6
: points in P
p2
p4
pi,2
pi,6
pi,1
pi,5pi,3
pi,4
: points in Pi
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Approximate kNN join: Z-order kNN join
The idea of zkNN
Transform d-dimensional points to 1-D values using Z-value.Map d-dimensional kNN join query to to 1-D range queries.Multiple random shift copies are used to improve spatial locality.
In practice 2 copies is arleady good enough.
p3
p1
p5
p6
: points in P
p2
p4
pi,2
pi,6
pi,1
pi,5pi,3
pi,4
zi,5zi,1zi,3
zi,4
zi,2zi,6
: points in Pi
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Approximate kNN join: Z-order kNN join
The idea of zkNN
Transform d-dimensional points to 1-D values using Z-value.Map d-dimensional kNN join query to to 1-D range queries.Multiple random shift copies are used to improve spatial locality.
In practice 2 copies is arleady good enough.
p3
p1
p5
p6
: points in P
p2
p4
pi,2
pi,6
pi,1
pi,5pi,3
pi,4
zi,5zi,1zi,3
zi,4
zi,2zi,6
: points in Pi
ZPi
qi=q + vi
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Approximate kNN join: Z-order kNN join
The idea of zkNN
Transform d-dimensional points to 1-D values using Z-value.Map d-dimensional kNN join query to to 1-D range queries.Multiple random shift copies are used to improve spatial locality.
In practice 2 copies is arleady good enough.
p3
p1
p5
p6
: points in P
p2
p4
pi,2
pi,6
pi,1
pi,5pi,3
pi,4
zi,5zi,1zi,3
zi,4
zi,2zi,6
: points in Pi
ZPizqi
z−(zqi, k, Pi) z+(zqi, k, Pi)
qi=q + vi
B+-tree
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Approximate kNN join: Z-order kNN join
The idea of zkNN
Transform d-dimensional points to 1-D values using Z-value.Map d-dimensional kNN join query to to 1-D range queries.Multiple random shift copies are used to improve spatial locality.
In practice 2 copies is arleady good enough.
p3
p1
p5
p6
: points in P
p2
p4
pi,2
pi,6
pi,1
pi,5pi,3
pi,4
zi,5zi,1zi,3
zi,4
zi,2zi,6
: points in Pi
ZPizqi
z−(zqi, k, Pi) z+(zqi, k, Pi)
Ci(q)
qi=q + vi
B+-tree
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Approximate kNN join: Z-order kNN join
In our group’s previous work we derive the following guarantee forthe zkNN join:
Theorem
Given a query point q ∈ Rd , a data set P ⊂ Rd , and a small constantα ∈ Z+. We generate (α− 1) random vectors {v2, . . . , vα}, such that forany i , vi ∈ Rd , and shift P by these vectors to obtain {P1, . . . ,Pα}(P1 = P). Then, the zkNN join returns a constant approximation in anyfixed dimension for knn(q,P) in expectation.
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Approximate kNN join: H-zkNNJ
Apply zkNN for join in MapReduce (H-zkNNJ)
Partition based algorithm
Partitioning policy:
To achieve linear communication and computation costs (to thenumber of blocks n in each input data set)
Partitioning by z-values:
Partition input data sets Ri and Si into {Ri,1, ...,Ri,n} and{Si,1, ..., Si,n} using (n − 1) z-values {zi,1, ..., zi,n}
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Approximate kNN join: H-zkNNJ
Apply zkNN for join in MapReduce (H-zkNNJ)
Partition based algorithm
Partitioning policy:
To achieve linear communication and computation costs (to thenumber of blocks n in each input data set)
Partitioning by z-values:
Partition input data sets Ri and Si into {Ri,1, ...,Ri,n} and{Si,1, ..., Si,n} using (n − 1) z-values {zi,1, ..., zi,n}
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Approximate kNN join: H-zkNNJ
Apply zkNN for join in MapReduce (H-zkNNJ)
Partition based algorithm
Partitioning policy:
To achieve linear communication and computation costs (to thenumber of blocks n in each input data set)
Partitioning by z-values:
Partition input data sets Ri and Si into {Ri,1, ...,Ri,n} and{Si,1, ..., Si,n} using (n − 1) z-values {zi,1, ..., zi,n}
ZSi
ZRi
zi,1 zi,2
Ri,1 Ri,2 Ri,3
Si,2
zr
Si,1
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Approximate kNN join: H-zkNNJ
Apply zkNN for join in MapReduce (H-zkNNJ)
Partition based algorithm
Partitioning policy:
To achieve linear communication and computation costs (to thenumber of blocks n in each input data set)
Partitioning by z-values:
Partition input data sets Ri and Si into {Ri,1, ...,Ri,n} and{Si,1, ..., Si,n} using (n − 1) z-values {zi,1, ..., zi,n}
ZSi
ZRi
zi,1 zi,2
Ri,1 Ri,2 Ri,3
Si,2
zr
? ?Si,1
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Approximate kNN join: H-zkNNJ
Apply zkNN for join in MapReduce (H-zkNNJ)
Partition based algorithm
Partitioning policy:
To achieve linear communication and computation costs (to thenumber of blocks n in each input data set)
Partitioning by z-values:
Partition input data sets Ri and Si into {Ri,1, ...,Ri,n} and{Si,1, ..., Si,n} using (n − 1) z-values {zi,1, ..., zi,n}
ZSi
ZRi
zi,1 zi,2
Ri,1 Ri,2 Ri,3
Si,1 Si,2 Si,3
zr
zr
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Approximate kNN join: H-zkNNJ
Apply zkNN for join in MapReduce (H-zkNNJ)
Partition based algorithm
Partitioning policy:
To achieve linear communication and computation costs (to thenumber of blocks n in each input data set)
Partitioning by z-values:
Partition input data sets Ri and Si into {Ri,1, ...,Ri,n} and{Si,1, ..., Si,n} using (n − 1) z-values {zi,1, ..., zi,n}
ZSi
ZRi
zi,1 zi,2
Ri,1 Ri,2 Ri,3
Si,1 Si,2 Si,3
zr
Ci(r)
zr
small neighborhoodsearch!!!
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Approximate kNN join: H-zkNNJ
Apply zkNN for join in MapReduce (H-zkNNJ)
Partition based algorithm
Partitioning policy:
To achieve linear communication and computation costs (to thenumber of blocks n in each input data set)
Partitioning by z-values:
Partition input data sets Ri and Si into {Ri,1, ...,Ri,n} and{Si,1, ..., Si,n} using (n − 1) z-values {zi,1, ..., zi,n}
ZSi
zr
Si,1 Si,2 Si,4
Ri,1 Ri,2 Ri,4 ZRi
zi,1 zi,2
Si,3
zi,3
Ri,3
zr
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Approximate kNN join: H-zkNNJ
Apply zkNN for join in MapReduce (H-zkNNJ)
Partition based algorithm
Partitioning policy:
To achieve linear communication and computation costs (to thenumber of blocks n in each input data set)
Partitioning by z-values:
Partition input data sets Ri and Si into {Ri,1, ...,Ri,n} and{Si,1, ..., Si,n} using (n − 1) z-values {zi,1, ..., zi,n}
ZSi
zr
Si,1 Si,2 Si,4
Ri,1 Ri,2 Ri,4 ZRi
zi,1 zi,2
Si,3
zi,3
Ri,3
zr
Ci(r)
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Approximate kNN join: H-zkNNJ
Apply zkNN for join in MapReduce (H-zkNNJ)
Partition based algorithm
Partitioning policy:
To achieve linear communication and computation costs (to thenumber of blocks n in each input data set)
Partitioning by z-values:
Partition input data sets Ri and Si into {Ri,1, ...,Ri,n} and{Si,1, ..., Si,n} using (n − 1) z-values {zi,1, ..., zi,n}
ZSi
zr
Si,1 Si,2 Si,4
Ri,1 Ri,2 Ri,4 ZRi
zi,1 zi,2
Si,3
zi,3
Ri,3
zr
Ci(r)
copy copy
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Approximate kNN join: H-zkNNJ
Choice of partitioning values.
Each block of Ri and Si shares the same boundary so we only searcha small neighborhood and minimize communication.Goal: load balance.
Evenly partition Ri or Si .
Evenly partition Ri → O( |Ri |n
log |Si |)Evenly partition Si → O(|Ri |log |Si |)
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Approximate kNN join: H-zkNNJ
Choice of partitioning values.
Each block of Ri and Si shares the same boundary so we only searcha small neighborhood and minimize communication.Goal: load balance.Evenly partition Ri or Si .
Evenly partition Ri → O( |Ri |n
log |Si |)Evenly partition Si → O(|Ri |log |Si |)
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Approximate kNN join: H-zkNNJ
Choice of partitioning values.
Each block of Ri and Si shares the same boundary so we only searcha small neighborhood and minimize communication.Goal: load balance.Evenly partition Ri or Si .
Evenly partition Ri → O( |Ri |n
log |Si |)Evenly partition Si → O(|Ri |log |Si |)
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Approximate kNN join: H-zkNNJ
Computation of partitioning values.
Quantiles can be used for evenly partitioning a data set D.Sort a data set D and retrieve its (n − 1) quantiles (expensive).
We propose sampling based method to estimate quantiles.
We proved that both estimations are close enough (within εN) to theoriginal ranks with a high probability (1-e−2/ε).
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Approximate kNN join: H-zkNNJ
Computation of partitioning values.
Quantiles can be used for evenly partitioning a data set D.Sort a data set D and retrieve its (n − 1) quantiles (expensive).
We propose sampling based method to estimate quantiles.
We proved that both estimations are close enough (within εN) to theoriginal ranks with a high probability (1-e−2/ε).
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Approximate kNN join: H-zkNNJ
H−zkNNJ algorithm can be implemented in 3 rounds of MapReduce.
Round 1: construct random shift copies for R and S , Ri and Si , i ∈ [1, α],and generate partitioning values for Ri and Si
R
S
shift by vicompute z-value
Ri
Si
ith shift
ith shift
DFS
DFS
sample
sample of ith shiftRi
sample of ith shiftSi
Map
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Approximate kNN join: H-zkNNJ
H−zkNNJ algorithm can be implemented in 3 rounds of MapReduce.
Round 1: construct random shift copies for R and S , Ri and Si , i ∈ [1, α],and generate partitioning values for Ri and Si
R
S
shift by vicompute z-value
Ri
Si
ith shift
ith shift
DFS
DFS
sample
sample of ith shiftRi
sample of ith shiftSi
Map
shuffle
sort&
Ri
Si
estimator 1
estimator 2
DFS
Reduce
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Approximate kNN join: H-zkNNJ
H−zkNNJ algorithm can be implemented in 3 rounds of MapReduce.
Round 2: partition Ri and Si into blocks and compute the candidatepoints for knn(r , S) for any r ∈ R.
Ri
Si
partition by Si’s ranges
partition by Ri’s ranges
Ri,1 Ri,2 Ri,n. . .
block 1block 2 block n
Si,1 Si,2 Si,n. . .
block 1block 2 block n
Map
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Approximate kNN join: H-zkNNJ
H−zkNNJ algorithm can be implemented in 3 rounds of MapReduce.
Round 2: partition Ri and Si into blocks and compute the candidatepoints for knn(r , S) for any r ∈ R.
Ri
Si
partition by Si’s ranges
partition by Ri’s ranges
Ri,1 Ri,2 Ri,n. . .
block 1block 2 block n
Si,1 Si,2 Si,n. . .
block 1block 2 block n
Map
shuffle&
sort
Ri,n
Si,n
Ri,2
Si,2
Ri,1
Si,1
. . .
B+-Tree
B+-Tree
B+-Tree
Retrieve Ci(r) for all r ∈ Ri,j, j ∈ [1, n]
DFS
DFS
DFS
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Approximate kNN join: H-zkNNJ
H−zkNNJ algorithm can be implemented in 3 rounds of MapReduce.
Round 3: determine knn(r ,C(r)) of any r ∈ R from the (r ,Ci (r)) emittedby round 2.
Ri
Si
partition by Si’s ranges
partition by Ri’s ranges
Ri,1 Ri,2 Ri,n. . .
block 1block 2 block n
Si,1 Si,2 Si,n. . .
block 1block 2 block n
Map
shuffle&
sort
Ri,n
Si,n
Ri,2
Si,2
Ri,1
Si,1
. . .
B+-Tree
B+-Tree
B+-Tree
Retrieve Ci(r) for all r ∈ Ri,j, j ∈ [1, n]
DFS
DFS
DFS
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Outline
1 Introduction
2 Background: kNN Join
3 Parallel kNN Join for Multi-dimensional Data Using MapReduceExact kNN JoinApproximate kNN Join
4 Experiments
5 Conclusions
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Experiments: algorithms
We implement the following methods in Hadoop 0.20.2:Exact Methods:
The baseline solution is denoted H-BNLJ,The improvement to the baseline solution is denoted H-BRJ.
Approximate Methods:
Our three-round solution is denoted by H-zkNNJ, (meaning ”Hadoopz-value kNN Join”).
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Experiments: setup
Experiments are performed in a heterogeneous Hadoop cluster with17 machines:
1 9 machines with 2GB of RAM and an Intel Xeon 1.86GHz CPU2 6 machines with 4GB of RAM and an Intel Xeon 2GHz CPU
One is reserved for the master (running JobTracker and NameNode).
3 2 machines with 6GB of RAM and an Intel Xeon 2.13GHz CPU
All machines are directly connected to a 1000Mbps switch.
Each slave node has 300GB hard drive space and 1GB of RAM forHadoop daemon.
The chunk size of DFS is set to 128MB.
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Experiments: setup
Experiments are performed in a heterogeneous Hadoop cluster with17 machines:
1 9 machines with 2GB of RAM and an Intel Xeon 1.86GHz CPU2 6 machines with 4GB of RAM and an Intel Xeon 2GHz CPU
One is reserved for the master (running JobTracker and NameNode).
3 2 machines with 6GB of RAM and an Intel Xeon 2.13GHz CPU
All machines are directly connected to a 1000Mbps switch.
Each slave node has 300GB hard drive space and 1GB of RAM forHadoop daemon.
The chunk size of DFS is set to 128MB.
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Experiments: datasets
OpenStreet Map dataset:
the road-networks for 50 states in U.S.160 million records.preprocessed to remove duplicationseach record consists of a 4 bytes integer id, two 4 bytes real typecoordinates representing latitude and longitude, and a descriptioninformation.the coordinates has a positive real domain (0,100000).stored in text format, 6.6GB.
Large synthetic Random-Cluster datasets:
data sets have varying dimensionality (up to 30).each record has a 4-byte id and float type d-dimensional coordinates.
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Experiments: datasets
OpenStreet Map dataset:
the road-networks for 50 states in U.S.160 million records.preprocessed to remove duplicationseach record consists of a 4 bytes integer id, two 4 bytes real typecoordinates representing latitude and longitude, and a descriptioninformation.the coordinates has a positive real domain (0,100000).stored in text format, 6.6GB.
Large synthetic Random-Cluster datasets:
data sets have varying dimensionality (up to 30).each record has a 4-byte id and float type d-dimensional coordinates.
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Experiments: configurations and defaults
Data set configurations
(MXN) represents a data set configuration containing M records ofR and N record of S (in 10s of millions).
Default values for OpenStreet dataset:
Symbol Definition Default(MXN) data set configuration (4x4)
k # of nearest neighbor 10α # of shift copies 2ε the error rate of sampling 0.003γ the physical number of machines 16
Values for R-Cluster dataset:
(2x2) is set to be the default data set configuration.
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Experiments: configurations and defaults
Data set configurations
(MXN) represents a data set configuration containing M records ofR and N record of S (in 10s of millions).
Default values for OpenStreet dataset:
Symbol Definition Default(MXN) data set configuration (4x4)
k # of nearest neighbor 10α # of shift copies 2ε the error rate of sampling 0.003γ the physical number of machines 16
Values for R-Cluster dataset:
(2x2) is set to be the default data set configuration.
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Experiments: configurations and defaults
Data set configurations
(MXN) represents a data set configuration containing M records ofR and N record of S (in 10s of millions).
Default values for OpenStreet dataset:
Symbol Definition Default(MXN) data set configuration (4x4)
k # of nearest neighbor 10α # of shift copies 2ε the error rate of sampling 0.003γ the physical number of machines 16
Values for R-Cluster dataset:
(2x2) is set to be the default data set configuration.
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Experiments: Approximation quality
H-zkNNJ: Hadoop z-value kNN Join
1
1.2
1.4
1.6
10 20 40 60 80
Appro
xim
atio
n r
atio
k values
OpenStreet
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Experiments: Approximation quality
H-zkNNJ: Hadoop z-value kNN Join
0.6
0.7
0.8
0.9
1
10 20 40 60 80
Rec
all
(Pre
cisi
on)
k values
OpenStreet
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Experiments: Approximation quality
H-zkNNJ: Hadoop z-value kNN Join
1
1.2
1.4
1.6
5 10 15 20 25 30
Appro
xim
atio
n r
atio
Dimensionality
R-Cluster
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Experiments: Approximation quality
H-zkNNJ: Hadoop z-value kNN Join
0
0.2
0.4
0.6
0.8
1
5 10 15 20 25 30
Rec
all
(Pre
cisi
on)
Dimensionality
R-Cluster
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Experiments: Running time and communication cost
H-zkNNJ: Hadoop z-value kNN JoinH-BRJ: Hadoop Block R-tree Join
0
10
20
30
40
50
4x4 6x6 8x8 12x12 16x16
Tim
e (
seconds
×10
3)
|R|×|S|:107×10
7
H-zkNNJ
H-BRJ
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Experiments: Running time and communication cost
H-zkNNJ: Hadoop z-value kNN JoinH-BRJ: Hadoop Block R-tree Join
0
20
40
60
80
100
4x4 6x6 8x8 12x12 16x16
Data
shuff
led (
GB
)
|R|×|S|:107×10
7
H-zkNNJ
H-BRJ
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Experiments: Effect of d
H-zkNNJ: Hadoop z-value kNN JoinH-BRJ: Hadoop Block R-tree Join
103
104
105
5 10 15 20 25 30
Tim
e (s
econds)
Dimensionality
H-zkNNJ
H-BRJ
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Experiments: Effect of d
H-zkNNJ: Hadoop z-value kNN JoinH-BRJ: Hadoop Block R-tree Join
0
5
10
15
20
25
5 10 15 20 25 30
Dat
a sh
uff
led (
GB
)
Dimensionality
H-zkNNJ
H-BRJ
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Conclusions
We study efficient methods to perform kNN joins in MapReduce.
Exact (H-BRJ) and approximate (H-zkNNJ) algorithms are proposed.H-zkNNJ performs orders of magnitude better than other methodswith excellent approximation quality.
We plan to investigate kNN joins on very high dimensions in thefuture.
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
The End
Thank You
Q and A
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Approximate kNN join: Z-order kNN join
zkNN algorithm
Algorithm 1: zkNN(q, P, k , α)
generate {v2, . . . , vα}, v1 =−→0 , vi is a random vector in Rd ;1
Pi = P + vi (i ∈ [1, α]; ∀p ∈ P, insert p + vi in Pi );2
for i = 1, . . . , α do3
let qi = q + vi , Ci (q) = ∅, and zqi be qi ’s z-value;4
insert z−(zqi , k,Pi ) into Ci (q);5
insert z+(zqi , k,Pi ) into Ci (q);6
for any p ∈ Ci (q), update p = p − vi ;7
C (q) =⋃α
i=1 Ci (q) = C1(q) ∪ · · · ∪ Cα(q);8
return knn(q,C (q)).9
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Experiments: Approximation quality
H-zkNNJ: Hadoop z-value kNN Join
1
1.2
1.4
1.6
1.8
4x4 6x6 8x8 12x12 16x16
Appro
xim
atio
n r
atio
|R|×|S|:107×10
7
OpenStreet
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Experiments: Approximation quality
H-zkNNJ: Hadoop z-value kNN Join
1
1.2
1.4
1.6
10 20 40 60 80
Appro
xim
atio
n r
atio
k values
OpenStreet
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Experiments: Approximation quality
H-zkNNJ: Hadoop z-value kNN Join
0.6
0.7
0.8
0.9
1
40x40 80x80 120x120 160x160
Rec
all
(Pre
cisi
on)
|R|×|S|:106×10
6
OpenStreet
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Experiments: Approximation quality
H-zkNNJ: Hadoop z-value kNN Join
0.6
0.7
0.8
0.9
1
10 20 40 60 80
Rec
all
(Pre
cisi
on)
k values
OpenStreet
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Experiments: Effect of ε
102
103
104
105
0.6 1 3 10 100
Tim
e (
seconds)
ε (×10-3
)
H-zkNNJ
H-BRJ
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Experiments: Effect of ε
102
103
0.6 1 3 10 100
Sta
ndard
devia
tion
ε (×10-3
)
R blocks
S blocks
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Experiments: Evaluation of H-BNLJ
101
102
103
104
105
106
5 10 15 20 25
Tim
e (
seconds)
# Reducers
H-zkNNJ
H-BRJ
H-BNLJ
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Experiments: Evaluation of H-BNLJ
100
101
102
103
104
H-BNLJ H-BRJ H-zkNNJ
Tim
e (s
econds)
Algorithms
Phase1Phase2Phase3
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Experiments: Speedup
0
5
10
15
20
25
5 10 15 20 25
Tim
e (
seco
nd
s ×
10
3)
# Reducers
H-zkNNJ
H-BRJ
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Experiments: Speedup
0
2
4
6
8
5 10 15 20 25
Sp
eed
up
# Reducers
H-zkNNJ
H-BRJ
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Experiments: Running time and communication cost
H-zkNNJ: Hadoop z-value kNN JoinH-BRJ: Hadoop Block R-tree Join
0
1
2
3
4
5
4x4 6x6 8x8 12x12 16x16
Tim
e (
seconds
×10
3)
|R|×|S|:107×10
7
zPhase1zPhase2zPhase3
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Experiments: Running time and communication cost
H-zkNNJ: Hadoop z-value kNN JoinH-BRJ: Hadoop Block R-tree Join
0
10
20
30
40
50
4x4 6x6 8x8 12x12 16x16
Tim
e (
seconds
×10
3)
|R|×|S|:107×10
7
RPhase1RPhase2
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Experiments: Running time and communication cost
H-zkNNJ: Hadoop z-value kNN JoinH-BRJ: Hadoop Block R-tree Join
0
10
20
30
40
50
4x4 6x6 8x8 12x12 16x16
Tim
e (
seconds
×10
3)
|R|×|S|:107×10
7
H-zkNNJ
H-BRJ
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Experiments: Running time and communication cost
H-zkNNJ: Hadoop z-value kNN JoinH-BRJ: Hadoop Block R-tree Join
0
20
40
60
80
100
4x4 6x6 8x8 12x12 16x16
Data
shuff
led (
GB
)
|R|×|S|:107×10
7
H-zkNNJ
H-BRJ
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Experiments: Effect of d
H-zkNNJ: Hadoop z-value kNN JoinH-BRJ: Hadoop Block R-tree Join
103
104
105
5 10 15 20 25 30
Tim
e (s
econds)
Dimensionality
H-zkNNJ
H-BRJ
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Experiments: Effect of d
H-zkNNJ: Hadoop z-value kNN JoinH-BRJ: Hadoop Block R-tree Join
0
5
10
15
20
25
5 10 15 20 25 30
Dat
a sh
uff
led (
GB
)
Dimensionality
H-zkNNJ
H-BRJ
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Experiments: Effect of d
H-zkNNJ: Hadoop z-value kNN JoinH-BRJ: Hadoop Block R-tree Join
1
1.2
1.4
1.6
5 10 15 20 25 30
Appro
xim
atio
n r
atio
Dimensionality
R-Cluster
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Experiments: Effect of d
H-zkNNJ: Hadoop z-value kNN JoinH-BRJ: Hadoop Block R-tree Join
0
0.2
0.4
0.6
0.8
1
5 10 15 20 25 30
Rec
all
(Pre
cisi
on)
Dimensionality
R-Cluster
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Experiments: Effect of k
0
1
2
3
4
10 20 40 60 80
Tim
e (
seconds
×10
3)
k values
zPhase1zPhase2zPhase3
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Experiments: Effect of k
0
4
8
12
16
10 20 40 60 80
Tim
e (
seconds
×10
3)
k values
RPhase1RPhase2
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Experiments: Effect of k
0
10
20
30
40
3 10 20 40 60 80
Tim
e (
seconds
×10
3)
k values
H-zkNNJ
H-BRJ
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Experiments: Effect of k
0
50
100
150
200
3 10 20 40 60 80
Data
shuff
led (
GB
)
k values
H-zkNNJ
H-BRJ
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Experiments: Effect of number of shifts α
103
104
105
2 3 4 5 6
Tim
e (
seconds)
α values
H-zkNNJ H-BRJ
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Experiments: Effect of number of shifts α
0
5
10
15
2 3 4 5 6
Data
shuff
led (
GB
)
α values
H-zkNNJ
H-BRJ
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Experiments: Effect of number of shifts α
1
1.2
1.4
1.6
2 3 4 5 6
Appro
xim
atio
n r
atio
α values
R-Cluster
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce
Experiments: Effect of number of shifts α
0
0.2
0.4
0.6
0.8
1
2 3 4 5 6
Recall
(P
recis
ion)
α values
R-Cluster
Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce