Efficient Parallel kNN Joins for Large Data in...

Efficient Parallel kNN Joins for Large Data inMapReduce

Chi Zhang1 Feifei Li2 Jeffrey Jestes2

1Dept of Computer Science 2School of ComputingFlorida State University University of Utah

April 4, 2012

Outline

1 Introduction

2 Background: kNN Join

3 Parallel kNN Join for Multi-dimensional Data Using MapReduceExact kNN JoinApproximate kNN Join

4 Experiments

5 Conclusions

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

k Nearest Neighbor Join

k nearest neighbor join (kNN join)

Given two data sets R and S , for every point q in R, kNN joinreturns k nearest points of q from S .

q

Point in R Point in S

Numerous applications: knowledge discovery, data mining, spatialdatabases, multimedia databases, etc.





q

p3

(q, p1)

(q, p3)

(q, p4)

3-NN join for qp1

p4








Find kNN in S for all points in R



Data Growth

Source: IDC


Rise of Distributed and Parallel Computing

Data sets are growing at an exponential rate.

A single machine cannot handle large data efficiently.Parallel and distributed computing is the trend.


Rise of Distributed and Parallel Computing

Challenges:

Minimize communication and computation.Achieve good load balance.


Outline

1 Introduction



4 Experiments

5 Conclusions


kNN Join

Exact kNN Join

knn(r , S) = set of kNN of r from S .knnJ(R, S) = {(r , knn(r , S))| for all r ∈ R}.

Approximate kNN Joinaknn(r , S) = approximate kNN of r from S .

p = kth NN of r in knn(r , S).p′ = kth NN for r in aknn(r , S)aknn(r , S) is a c-approximation ofknn(r , S) : d(r , p) ≤ d(r , p′) ≤ c · d(r , p).

aknnJ(R,S) = {(r , aknn(r ,S))|∀r ∈ R}.

r

p



kNN Join

Exact kNN Join

knn(r , S) = set of kNN of r from S .knnJ(R, S) = {(r , knn(r , S))| for all r ∈ R}.

Approximate kNN Joinaknn(r , S) = approximate kNN of r from S .

p = kth NN of r in knn(r , S).p′ = kth NN for r in aknn(r , S)aknn(r , S) is a c-approximation ofknn(r , S) : d(r , p) ≤ d(r , p′) ≤ c · d(r , p).

aknnJ(R,S) = {(r , aknn(r ,S))|∀r ∈ R}.

r

p

p′



Outline

1 Introduction



4 Experiments

5 Conclusions


Exact kNN join: Block Nested Loop Join

Block nested loop join (BNLJ) based method

1 Partition R and S , each into n equal-sized disjoint blocks.2 Perform (BNLJ) for each possible Ri ,Sj pairs of blocks3 Get global kNN results from n local kNN results for every record in R



Block nested loop join (BNLJ) based method1 Partition R and S , each into n equal-sized disjoint blocks.

2 Perform (BNLJ) for each possible Ri ,Sj pairs of blocks3 Get global kNN results from n local kNN results for every record in R

R

S

R1

R2

S1

S2



Block nested loop join (BNLJ) based method1 Partition R and S , each into n equal-sized disjoint blocks.2 Perform (BNLJ) for each possible Ri ,Sj pairs of blocks

3 Get global kNN results from n local kNN results for every record in R

R

S

R1

R2

S1

S2



Block nested loop join (BNLJ) based method1 Partition R and S , each into n equal-sized disjoint blocks.2 Perform (BNLJ) for each possible Ri ,Sj pairs of blocks

3 Get global kNN results from n local kNN results for every record in R

R

S

R1

R2

S1

S2

BNLJ(R1, S1)

BNLJ(R1, S2)

BNLJ(R2, S1)

BNLJ(R2, S2)



Block nested loop join (BNLJ) based method1 Partition R and S , each into n equal-sized disjoint blocks.2 Perform (BNLJ) for each possible Ri ,Sj pairs of blocks3 Get global kNN results from n local kNN results for every record in R

R

S

R1

R2

S1

S2

BNLJ(R1, S1)

BNLJ(R1, S2)

BNLJ(R2, S1)

BNLJ(R2, S2)

BNLJ(R1, S)

BNLJ(R2, S)



Two-round MapReduce algorithm: Round 1

R

S

Mapper

Mapper

(1) Divide R and S into blocks

(2) Duplicate each blocks into 2 partitions

R1

R1

1

3

R2

2

R2

S1

S1

2

1

S2

S2

3

4

4




DFS

R

S

Mapper

Mapper



R1

R1

1

3

R2

2

R2

S1

S1

2

1

S2

S2

3

Shuffle

R1

1S1

R2

2S1

R1

3S2

R2

4S2

Reducer

Reducer

Reducer

Reducer

BNLJ

DFS

DFS

DFS

4

4




(r1, s1, d1,1)...

File 1

File 2

(r3, s1, d3,1)

(r1, s7, d1,8)

(r3, s5, d3,5)

Mapper

partition by record ids

Mapper

(r1, s1, d1,1)

(r3, s1, d3,1)

(r1, s7, d1,8)

(r3, s5, d3,5)

...

...

...

...

...

...

...




(r1, s1, d1,1)...

File 1

File 2

(r3, s1, d3,1)

(r1, s7, d1,8)

(r3, s5, d3,5)

Mapper

partition by record ids

Mapper

(r1, s1, d1,1)

(r3, s1, d3,1)

(r1, s7, d1,8)

(r3, s5, d3,5)

Shuffle

(r1, s1, d1,1)

(r1, s7, d1,7)

(r3, s1, d3,1)

(r3, s5, d3,5)

sort list(s, d(r, s))get top k(= 2) results for r

Reducer

Reducer

DFS

(r1, s1, d1,1)(r1, s7, d1,7)

(r3, s5, d3,5)(r3, s6, d3,6)

DFS...

...

...

...

...

...

...

...

...

...

...


Exact kNN join: Block R-tree Join

Use spatial index (R-tree) to improve performance

Build R-tree index for a block of S in a bucket to speed up kNNcomputations.Similar to BNLJ algorithm, only need to replace BNLJ with blockR-tree join (BRJ) in the first round.





DFS

R

S

Mapper

Mapper



R1

R1

1

3

R2

2

R2

S1

S1

2

1

S2

S2

3

Shuffle

R1

1S1

R2

2S1

R1

3S2

R2

4S2

Reducer

Reducer

Reducer

Reducer

BNLJ

DFS

DFS

DFS

4

4





DFS

R

S

Mapper

Mapper



R1

R1

1

3

R2

2

R2

S1

S1

2

1

S2

S2

3

Shuffle

R1

1S1

R2

2S1

R1

3S2

R2

4S2

Reducer

Reducer

Reducer

Reducer

BNLJ

DFS

DFS

DFS

BRJ

4

4


Outline

1 Introduction



4 Experiments

5 Conclusions


Approximate kNN join

Problems with exact kNN join solution

Too much communication and computation (n2 buckets required)Find solution requiring O(n) buckets.

We search for approximate solutions.Space-filling curve based methods ([YLK10], dubbed zkNN)

DFS

R

S

Mapper

Mapper



R1

R1

1

3

R2

2

R2

S1

S1

2

1

S2

S2

3

Shuffle

R1

1S1

R2

2S1

R1

3S2

R2

4S2

Reducer

Reducer

Reducer

Reducer

BNLJ

DFS

DFS

DFS

BRJ

4

4

[YLK10] B. Yao, F. Li, P. Kumar. K nearest neighbor queries and knn-joins in large relational databases (almost) for free. ICDE, 2010.



Problems with exact kNN join solutionToo much communication and computation (n2 buckets required)

Find solution requiring O(n) buckets.

We search for approximate solutions.Space-filling curve based methods ([YLK10], dubbed zkNN)

DFS

R

S

Mapper

Mapper



R1

R1

1

3

R2

2

R2

S1

S1

2

1

S2

S2

3

Shuffle

R1

1S1

R2

2S1

R1

3S2

R2

4S2

Reducer

Reducer

Reducer

Reducer

BNLJ

DFS

DFS

DFS

BRJn2 buckets required, too much cost.

4

4




Problems with exact kNN join solutionToo much communication and computation (n2 buckets required)

Find solution requiring O(n) buckets.We search for approximate solutions.Space-filling curve based methods ([YLK10], dubbed zkNN)

DFS

R

S

Mapper

Mapper



R1

R1

1

3

R2

2

R2

S1

S1

2

1

S2

S2

3

Shuffle

R1

1S1

R2

2S1

R1

3S2

R2

4S2

Reducer

Reducer

Reducer

Reducer

BNLJ

DFS

DFS

DFS

BRJn2 buckets required, too much cost.

4

4



Approximate kNN join: Z-order kNN join

The idea of zkNN

Transform d-dimensional points to 1-D values using Z-value.Map d-dimensional kNN join query to to 1-D range queries.Multiple random shift copies are used to improve spatial locality.

In practice 2 copies is arleady good enough.



The idea of zkNN



p3

p1

p5

p6

: points in P

p2

p4



The idea of zkNN



p3

p1

p5

p6

: points in P

p2

p4

pi,2

pi,6

pi,1

pi,5pi,3

pi,4

: points in Pi



The idea of zkNN



p3

p1

p5

p6

: points in P

p2

p4

pi,2

pi,6

pi,1

pi,5pi,3

pi,4

zi,5zi,1zi,3

zi,4

zi,2zi,6

: points in Pi



The idea of zkNN



p3

p1

p5

p6

: points in P

p2

p4

pi,2

pi,6

pi,1

pi,5pi,3

pi,4

zi,5zi,1zi,3

zi,4

zi,2zi,6

: points in Pi

ZPi

qi=q + vi



The idea of zkNN



p3

p1

p5

p6

: points in P

p2

p4

pi,2

pi,6

pi,1

pi,5pi,3

pi,4

zi,5zi,1zi,3

zi,4

zi,2zi,6

: points in Pi

ZPizqi

z−(zqi, k, Pi) z+(zqi, k, Pi)

qi=q + vi

B+-tree



The idea of zkNN



p3

p1

p5

p6

: points in P

p2

p4

pi,2

pi,6

pi,1

pi,5pi,3

pi,4

zi,5zi,1zi,3

zi,4

zi,2zi,6

: points in Pi

ZPizqi

z−(zqi, k, Pi) z+(zqi, k, Pi)

Ci(q)

qi=q + vi

B+-tree



In our group’s previous work we derive the following guarantee forthe zkNN join:

Theorem

Given a query point q ∈ Rd , a data set P ⊂ Rd , and a small constantα ∈ Z+. We generate (α− 1) random vectors {v2, . . . , vα}, such that forany i , vi ∈ Rd , and shift P by these vectors to obtain {P1, . . . ,Pα}(P1 = P). Then, the zkNN join returns a constant approximation in anyfixed dimension for knn(q,P) in expectation.


Approximate kNN join: H-zkNNJ

Apply zkNN for join in MapReduce (H-zkNNJ)

Partition based algorithm

Partitioning policy:

To achieve linear communication and computation costs (to thenumber of blocks n in each input data set)

Partitioning by z-values:

Partition input data sets Ri and Si into {Ri,1, ...,Ri,n} and{Si,1, ..., Si,n} using (n − 1) z-values {zi,1, ..., zi,n}









ZSi

ZRi

zi,1 zi,2

Ri,1 Ri,2 Ri,3

Si,2

zr

Si,1









ZSi

ZRi

zi,1 zi,2

Ri,1 Ri,2 Ri,3

Si,2

zr

? ?Si,1









ZSi

ZRi

zi,1 zi,2

Ri,1 Ri,2 Ri,3

Si,1 Si,2 Si,3

zr

zr









ZSi

ZRi

zi,1 zi,2

Ri,1 Ri,2 Ri,3

Si,1 Si,2 Si,3

zr

Ci(r)

zr

small neighborhoodsearch!!!









ZSi

zr

Si,1 Si,2 Si,4

Ri,1 Ri,2 Ri,4 ZRi

zi,1 zi,2

Si,3

zi,3

Ri,3

zr









ZSi

zr

Si,1 Si,2 Si,4

Ri,1 Ri,2 Ri,4 ZRi

zi,1 zi,2

Si,3

zi,3

Ri,3

zr

Ci(r)









ZSi

zr

Si,1 Si,2 Si,4

Ri,1 Ri,2 Ri,4 ZRi

zi,1 zi,2

Si,3

zi,3

Ri,3

zr

Ci(r)

copy copy



Choice of partitioning values.

Each block of Ri and Si shares the same boundary so we only searcha small neighborhood and minimize communication.Goal: load balance.

Evenly partition Ri or Si .

Evenly partition Ri → O( |Ri |n

log |Si |)Evenly partition Si → O(|Ri |log |Si |)



Choice of partitioning values.

Each block of Ri and Si shares the same boundary so we only searcha small neighborhood and minimize communication.Goal: load balance.Evenly partition Ri or Si .

Evenly partition Ri → O( |Ri |n

log |Si |)Evenly partition Si → O(|Ri |log |Si |)



Computation of partitioning values.

Quantiles can be used for evenly partitioning a data set D.Sort a data set D and retrieve its (n − 1) quantiles (expensive).

We propose sampling based method to estimate quantiles.

We proved that both estimations are close enough (within εN) to theoriginal ranks with a high probability (1-e−2/ε).



H−zkNNJ algorithm can be implemented in 3 rounds of MapReduce.

Round 1: construct random shift copies for R and S , Ri and Si , i ∈ [1, α],and generate partitioning values for Ri and Si

R

S

shift by vicompute z-value

Ri

Si

ith shift

ith shift

DFS

DFS

sample

sample of ith shiftRi

sample of ith shiftSi

Map




Round 1: construct random shift copies for R and S , Ri and Si , i ∈ [1, α],and generate partitioning values for Ri and Si

R

S

shift by vicompute z-value

Ri

Si

ith shift

ith shift

DFS

DFS

sample

sample of ith shiftRi

sample of ith shiftSi

Map

shuffle

sort&

Ri

Si

estimator 1

estimator 2

DFS

Reduce




Round 2: partition Ri and Si into blocks and compute the candidatepoints for knn(r , S) for any r ∈ R.

Ri

Si

partition by Si’s ranges

partition by Ri’s ranges

Ri,1 Ri,2 Ri,n. . .

block 1block 2 block n

Si,1 Si,2 Si,n. . .


Map




Round 2: partition Ri and Si into blocks and compute the candidatepoints for knn(r , S) for any r ∈ R.

Ri

Si



Ri,1 Ri,2 Ri,n. . .


Si,1 Si,2 Si,n. . .


Map

shuffle&

sort

Ri,n

Si,n

Ri,2

Si,2

Ri,1

Si,1

. . .

B+-Tree

B+-Tree

B+-Tree

Retrieve Ci(r) for all r ∈ Ri,j, j ∈ [1, n]

DFS

DFS

DFS




Round 3: determine knn(r ,C(r)) of any r ∈ R from the (r ,Ci (r)) emittedby round 2.

Ri

Si



Ri,1 Ri,2 Ri,n. . .


Si,1 Si,2 Si,n. . .


Map

shuffle&

sort

Ri,n

Si,n

Ri,2

Si,2

Ri,1

Si,1

. . .

B+-Tree

B+-Tree

B+-Tree

Retrieve Ci(r) for all r ∈ Ri,j, j ∈ [1, n]

DFS

DFS

DFS


Outline

1 Introduction



4 Experiments

5 Conclusions


Experiments: algorithms

We implement the following methods in Hadoop 0.20.2:Exact Methods:

The baseline solution is denoted H-BNLJ,The improvement to the baseline solution is denoted H-BRJ.

Approximate Methods:

Our three-round solution is denoted by H-zkNNJ, (meaning ”Hadoopz-value kNN Join”).


Experiments: setup

Experiments are performed in a heterogeneous Hadoop cluster with17 machines:

1 9 machines with 2GB of RAM and an Intel Xeon 1.86GHz CPU2 6 machines with 4GB of RAM and an Intel Xeon 2GHz CPU

One is reserved for the master (running JobTracker and NameNode).

3 2 machines with 6GB of RAM and an Intel Xeon 2.13GHz CPU

All machines are directly connected to a 1000Mbps switch.

Each slave node has 300GB hard drive space and 1GB of RAM forHadoop daemon.

The chunk size of DFS is set to 128MB.


Experiments: datasets

OpenStreet Map dataset:

the road-networks for 50 states in U.S.160 million records.preprocessed to remove duplicationseach record consists of a 4 bytes integer id, two 4 bytes real typecoordinates representing latitude and longitude, and a descriptioninformation.the coordinates has a positive real domain (0,100000).stored in text format, 6.6GB.

Large synthetic Random-Cluster datasets:

data sets have varying dimensionality (up to 30).each record has a 4-byte id and float type d-dimensional coordinates.


Experiments: configurations and defaults

Data set configurations

(MXN) represents a data set configuration containing M records ofR and N record of S (in 10s of millions).

Default values for OpenStreet dataset:

Symbol Definition Default(MXN) data set configuration (4x4)

k # of nearest neighbor 10α # of shift copies 2ε the error rate of sampling 0.003γ the physical number of machines 16

Values for R-Cluster dataset:

(2x2) is set to be the default data set configuration.


Experiments: Approximation quality

H-zkNNJ: Hadoop z-value kNN Join

1

1.2

1.4

1.6

10 20 40 60 80

Appro

xim

atio

n r

atio

k values

OpenStreet




0.6

0.7

0.8

0.9

1

10 20 40 60 80

Rec

all

(Pre

cisi

on)

k values

OpenStreet




1

1.2

1.4

1.6

5 10 15 20 25 30

Appro

xim

atio

n r

atio

Dimensionality

R-Cluster




0

0.2

0.4

0.6

0.8

1

5 10 15 20 25 30

Rec

all

(Pre

cisi

on)

Dimensionality

R-Cluster


Experiments: Running time and communication cost

H-zkNNJ: Hadoop z-value kNN JoinH-BRJ: Hadoop Block R-tree Join

0

10

20

30

40

50

4x4 6x6 8x8 12x12 16x16

Tim

e (

seconds

×10

3)

|R|×|S|:107×10

7

H-zkNNJ

H-BRJ




0

20

40

60

80

100

4x4 6x6 8x8 12x12 16x16

Data

shuff

led (

GB

)

|R|×|S|:107×10

7

H-zkNNJ

H-BRJ


Experiments: Effect of d


103

104

105

5 10 15 20 25 30

Tim

e (s

econds)

Dimensionality

H-zkNNJ

H-BRJ




0

5

10

15

20

25

5 10 15 20 25 30

Dat

a sh

uff

led (

GB

)

Dimensionality

H-zkNNJ

H-BRJ


Conclusions

We study efficient methods to perform kNN joins in MapReduce.

Exact (H-BRJ) and approximate (H-zkNNJ) algorithms are proposed.H-zkNNJ performs orders of magnitude better than other methodswith excellent approximation quality.

We plan to investigate kNN joins on very high dimensions in thefuture.


The End

Thank You

Q and A



zkNN algorithm

Algorithm 1: zkNN(q, P, k , α)

generate {v2, . . . , vα}, v1 =−→0 , vi is a random vector in Rd ;1

Pi = P + vi (i ∈ [1, α]; ∀p ∈ P, insert p + vi in Pi );2

for i = 1, . . . , α do3

let qi = q + vi , Ci (q) = ∅, and zqi be qi ’s z-value;4

insert z−(zqi , k,Pi ) into Ci (q);5

insert z+(zqi , k,Pi ) into Ci (q);6

for any p ∈ Ci (q), update p = p − vi ;7

C (q) =⋃α

i=1 Ci (q) = C1(q) ∪ · · · ∪ Cα(q);8

return knn(q,C (q)).9




1

1.2

1.4

1.6

1.8

4x4 6x6 8x8 12x12 16x16

Appro

xim

atio

n r

atio

|R|×|S|:107×10

7

OpenStreet




1

1.2

1.4

1.6

10 20 40 60 80

Appro

xim

atio

n r

atio

k values

OpenStreet




0.6

0.7

0.8

0.9

1

40x40 80x80 120x120 160x160

Rec

all

(Pre

cisi

on)

|R|×|S|:106×10

6

OpenStreet




0.6

0.7

0.8

0.9

1

10 20 40 60 80

Rec

all

(Pre

cisi

on)

k values

OpenStreet


Experiments: Effect of ε

102

103

104

105

0.6 1 3 10 100

Tim

e (

seconds)

ε (×10-3

)

H-zkNNJ

H-BRJ


Experiments: Effect of ε

102

103

0.6 1 3 10 100

Sta

ndard

devia

tion

ε (×10-3

)

R blocks

S blocks


Experiments: Evaluation of H-BNLJ

101

102

103

104

105

106

5 10 15 20 25

Tim

e (

seconds)

# Reducers

H-zkNNJ

H-BRJ

H-BNLJ


Experiments: Evaluation of H-BNLJ

100

101

102

103

104

H-BNLJ H-BRJ H-zkNNJ

Tim

e (s

econds)

Algorithms

Phase1Phase2Phase3


Experiments: Speedup

0

5

10

15

20

25

5 10 15 20 25

Tim

e (

seco

nd

s ×

10

3)

# Reducers

H-zkNNJ

H-BRJ


Experiments: Speedup

0

2

4

6

8

5 10 15 20 25

Sp

eed

up

# Reducers

H-zkNNJ

H-BRJ




0

1

2

3

4

5

4x4 6x6 8x8 12x12 16x16

Tim

e (

seconds

×10

3)

|R|×|S|:107×10

7

zPhase1zPhase2zPhase3




0

10

20

30

40

50

4x4 6x6 8x8 12x12 16x16

Tim

e (

seconds

×10

3)

|R|×|S|:107×10

7

RPhase1RPhase2




0

10

20

30

40

50

4x4 6x6 8x8 12x12 16x16

Tim

e (

seconds

×10

3)

|R|×|S|:107×10

7

H-zkNNJ

H-BRJ




0

20

40

60

80

100

4x4 6x6 8x8 12x12 16x16

Data

shuff

led (

GB

)

|R|×|S|:107×10

7

H-zkNNJ

H-BRJ




103

104

105

5 10 15 20 25 30

Tim

e (s

econds)

Dimensionality

H-zkNNJ

H-BRJ




0

5

10

15

20

25

5 10 15 20 25 30

Dat

a sh

uff

led (

GB

)

Dimensionality

H-zkNNJ

H-BRJ




1

1.2

1.4

1.6

5 10 15 20 25 30

Appro

xim

atio

n r

atio

Dimensionality

R-Cluster




0

0.2

0.4

0.6

0.8

1

5 10 15 20 25 30

Rec

all

(Pre

cisi

on)

Dimensionality

R-Cluster


Experiments: Effect of k

0

1

2

3

4

10 20 40 60 80

Tim

e (

seconds

×10

3)

k values

zPhase1zPhase2zPhase3



0

4

8

12

16

10 20 40 60 80

Tim

e (

seconds

×10

3)

k values

RPhase1RPhase2



0

10

20

30

40

3 10 20 40 60 80

Tim

e (

seconds

×10

3)

k values

H-zkNNJ

H-BRJ



0

50

100

150

200

3 10 20 40 60 80

Data

shuff

led (

GB

)

k values

H-zkNNJ

H-BRJ


Experiments: Effect of number of shifts α

103

104

105

2 3 4 5 6

Tim

e (

seconds)

α values

H-zkNNJ H-BRJ



0

5

10

15

2 3 4 5 6

Data

shuff

led (

GB

)

α values

H-zkNNJ

H-BRJ



1

1.2

1.4

1.6

2 3 4 5 6

Appro

xim

atio

n r

atio

α values

R-Cluster



0

0.2

0.4

0.6

0.8

1

2 3 4 5 6

Recall

(P

recis

ion)

α values

R-Cluster


Efficient Parallel kNN Joins for Large Data in...

Documents

Transcript of Efficient Parallel kNN Joins for Large Data in...