Efficient Parallel kNN Joins for Large Data in...

Efficient Parallel kNN Joins for Large Data inMapReduce

Chi Zhang1 Feifei Li2 Jeffrey Jestes2

1Dept of Computer Science 2School of ComputingFlorida State University University of Utah

April 4, 2012

Outline

1 Introduction

2 Background: kNN Join

3 Parallel kNN Join for Multi-dimensional Data Using MapReduceExact kNN JoinApproximate kNN Join

4 Experiments

5 Conclusions

Chi Zhang, Feifei Li, Jeffrey Jestes Efficient Parallel kNN Joins for Large Data in MapReduce

Outline

1 Introduction

4 Experiments

5 Conclusions

k Nearest Neighbor Join

k nearest neighbor join (kNN join)

Given two data sets R and S , for every point q in R, kNN joinreturns k nearest points of q from S .

Point in R Point in S

Numerous applications: knowledge discovery, data mining, spatialdatabases, multimedia databases, etc.

(q, p1)

(q, p3)

(q, p4)

3-NN join for qp1

Find kNN in S for all points in R

Data Growth

Source: IDC

Rise of Distributed and Parallel Computing

Data sets are growing at an exponential rate.

A single machine cannot handle large data efficiently.Parallel and distributed computing is the trend.

Data sets are growing at an exponential rate.

A single machine cannot handle large data efficiently.Parallel and distributed computing is the trend.

Challenges:

Minimize communication and computation.Achieve good load balance.

Outline

1 Introduction

4 Experiments

5 Conclusions

kNN Join

Exact kNN Join

knn(r , S) = set of kNN of r from S .knnJ(R, S) = {(r , knn(r , S))| for all r ∈ R}.

Approximate kNN Joinaknn(r , S) = approximate kNN of r from S .

p = kth NN of r in knn(r , S).p′ = kth NN for r in aknn(r , S)aknn(r , S) is a c-approximation ofknn(r , S) : d(r , p) ≤ d(r , p′) ≤ c · d(r , p).

aknnJ(R,S) = {(r , aknn(r ,S))|∀r ∈ R}.

kNN Join

Exact kNN Join

knn(r , S) = set of kNN of r from S .knnJ(R, S) = {(r , knn(r , S))| for all r ∈ R}.

Approximate kNN Joinaknn(r , S) = approximate kNN of r from S .

p = kth NN of r in knn(r , S).p′ = kth NN for r in aknn(r , S)aknn(r , S) is a c-approximation ofknn(r , S) : d(r , p) ≤ d(r , p′) ≤ c · d(r , p).

aknnJ(R,S) = {(r , aknn(r ,S))|∀r ∈ R}.

Outline

1 Introduction

4 Experiments

5 Conclusions

Exact kNN join: Block Nested Loop Join

Block nested loop join (BNLJ) based method

1 Partition R and S , each into n equal-sized disjoint blocks.2 Perform (BNLJ) for each possible Ri ,Sj pairs of blocks3 Get global kNN results from n local kNN results for every record in R

Block nested loop join (BNLJ) based method1 Partition R and S , each into n equal-sized disjoint blocks.

2 Perform (BNLJ) for each possible Ri ,Sj pairs of blocks3 Get global kNN results from n local kNN results for every record in R

Block nested loop join (BNLJ) based method1 Partition R and S , each into n equal-sized disjoint blocks.2 Perform (BNLJ) for each possible Ri ,Sj pairs of blocks

3 Get global kNN results from n local kNN results for every record in R

BNLJ(R1, S1)

BNLJ(R1, S2)

BNLJ(R2, S1)

BNLJ(R2, S2)

Block nested loop join (BNLJ) based method1 Partition R and S , each into n equal-sized disjoint blocks.2 Perform (BNLJ) for each possible Ri ,Sj pairs of blocks3 Get global kNN results from n local kNN results for every record in R

BNLJ(R1, S1)

BNLJ(R1, S2)

BNLJ(R2, S1)

BNLJ(R2, S2)

BNLJ(R1, S)

BNLJ(R2, S)

Two-round MapReduce algorithm: Round 1

Mapper

(1) Divide R and S into blocks

(2) Duplicate each blocks into 2 partitions

Mapper

Shuffle

Reducer

(r1, s1, d1,1)...

File 1

File 2

(r3, s1, d3,1)

(r1, s7, d1,8)

(r3, s5, d3,5)

Mapper

partition by record ids

Mapper

(r1, s1, d1,1)

(r3, s1, d3,1)

(r1, s7, d1,8)

(r3, s5, d3,5)

(r1, s1, d1,1)...

File 1

File 2

(r3, s1, d3,1)

(r1, s7, d1,8)

(r3, s5, d3,5)

Mapper

partition by record ids

Mapper

(r1, s1, d1,1)

(r3, s1, d3,1)

(r1, s7, d1,8)

(r3, s5, d3,5)

Shuffle

(r1, s1, d1,1)

(r1, s7, d1,7)

(r3, s1, d3,1)

(r3, s5, d3,5)

sort list(s, d(r, s))get top k(= 2) results for r

Reducer

(r1, s1, d1,1)(r1, s7, d1,7)

(r3, s5, d3,5)(r3, s6, d3,6)

DFS...

Exact kNN join: Block R-tree Join

Use spatial index (R-tree) to improve performance

Build R-tree index for a block of S in a bucket to speed up kNNcomputations.Similar to BNLJ algorithm, only need to replace BNLJ with blockR-tree join (BRJ) in the first round.

Mapper

Shuffle

Reducer

Mapper

Shuffle

Reducer

Outline

1 Introduction

4 Experiments

5 Conclusions

Approximate kNN join

Problems with exact kNN join solution

Too much communication and computation (n2 buckets required)Find solution requiring O(n) buckets.

We search for approximate solutions.Space-filling curve based methods ([YLK10], dubbed zkNN)

Mapper

Shuffle

Reducer

[YLK10] B. Yao, F. Li, P. Kumar. K nearest neighbor queries and knn-joins in large relational databases (almost) for free. ICDE, 2010.

Problems with exact kNN join solutionToo much communication and computation (n2 buckets required)

Find solution requiring O(n) buckets.

Mapper

Shuffle

Reducer

BRJn2 buckets required, too much cost.

Find solution requiring O(n) buckets.

Mapper

Shuffle

Reducer

Find solution requiring O(n) buckets.We search for approximate solutions.Space-filling curve based methods ([YLK10], dubbed zkNN)

Mapper

Shuffle

Reducer

Approximate kNN join: Z-order kNN join

The idea of zkNN

Transform d-dimensional points to 1-D values using Z-value.Map d-dimensional kNN join query to to 1-D range queries.Multiple random shift copies are used to improve spatial locality.

In practice 2 copies is arleady good enough.

The idea of zkNN

: points in P

The idea of zkNN

: points in P

pi,5pi,3

: points in Pi

The idea of zkNN

: points in P

pi,5pi,3

zi,5zi,1zi,3

zi,2zi,6

: points in Pi

The idea of zkNN

: points in P

pi,5pi,3

zi,5zi,1zi,3

zi,2zi,6

: points in Pi

qi=q + vi

The idea of zkNN

: points in P

pi,5pi,3

zi,5zi,1zi,3

zi,2zi,6

: points in Pi

ZPizqi

z−(zqi, k, Pi) z+(zqi, k, Pi)

qi=q + vi

B+-tree

The idea of zkNN

: points in P

pi,5pi,3

zi,5zi,1zi,3

zi,2zi,6

: points in Pi

ZPizqi

z−(zqi, k, Pi) z+(zqi, k, Pi)

qi=q + vi

B+-tree

In our group’s previous work we derive the following guarantee forthe zkNN join:

Theorem

Given a query point q ∈ Rd , a data set P ⊂ Rd , and a small constantα ∈ Z+. We generate (α− 1) random vectors {v2, . . . , vα}, such that forany i , vi ∈ Rd , and shift P by these vectors to obtain {P1, . . . ,Pα}(P1 = P). Then, the zkNN join returns a constant approximation in anyfixed dimension for knn(q,P) in expectation.

Approximate kNN join: H-zkNNJ

Apply zkNN for join in MapReduce (H-zkNNJ)

Partition based algorithm

Partitioning policy:

To achieve linear communication and computation costs (to thenumber of blocks n in each input data set)

Partitioning by z-values:

Partition input data sets Ri and Si into {Ri,1, ...,Ri,n} and{Si,1, ..., Si,n} using (n − 1) z-values {zi,1, ..., zi,n}

zi,1 zi,2

Ri,1 Ri,2 Ri,3

zi,1 zi,2

Ri,1 Ri,2 Ri,3

? ?Si,1

zi,1 zi,2

Ri,1 Ri,2 Ri,3

Si,1 Si,2 Si,3

zi,1 zi,2

Ri,1 Ri,2 Ri,3

Si,1 Si,2 Si,3

small neighborhoodsearch!!!

Si,1 Si,2 Si,4

Ri,1 Ri,2 Ri,4 ZRi

zi,1 zi,2

Si,1 Si,2 Si,4

Ri,1 Ri,2 Ri,4 ZRi

zi,1 zi,2

Si,1 Si,2 Si,4

Ri,1 Ri,2 Ri,4 ZRi

zi,1 zi,2

copy copy

Choice of partitioning values.

Each block of Ri and Si shares the same boundary so we only searcha small neighborhood and minimize communication.Goal: load balance.

Evenly partition Ri or Si .

Evenly partition Ri → O( |Ri |n

log |Si |)Evenly partition Si → O(|Ri |log |Si |)

Each block of Ri and Si shares the same boundary so we only searcha small neighborhood and minimize communication.Goal: load balance.Evenly partition Ri or Si .

Computation of partitioning values.

Quantiles can be used for evenly partitioning a data set D.Sort a data set D and retrieve its (n − 1) quantiles (expensive).

We propose sampling based method to estimate quantiles.

We proved that both estimations are close enough (within εN) to theoriginal ranks with a high probability (1-e−2/ε).

Computation of partitioning values.

Quantiles can be used for evenly partitioning a data set D.Sort a data set D and retrieve its (n − 1) quantiles (expensive).

We propose sampling based method to estimate quantiles.

We proved that both estimations are close enough (within εN) to theoriginal ranks with a high probability (1-e−2/ε).

H−zkNNJ algorithm can be implemented in 3 rounds of MapReduce.

Round 1: construct random shift copies for R and S , Ri and Si , i ∈ [1, α],and generate partitioning values for Ri and Si

shift by vicompute z-value

ith shift

sample

sample of ith shiftRi

sample of ith shiftSi

Round 1: construct random shift copies for R and S , Ri and Si , i ∈ [1, α],and generate partitioning values for Ri and Si

shift by vicompute z-value

ith shift

sample

sample of ith shiftRi

sample of ith shiftSi

shuffle

estimator 1

estimator 2

Reduce

Round 2: partition Ri and Si into blocks and compute the candidatepoints for knn(r , S) for any r ∈ R.

partition by Si’s ranges

partition by Ri’s ranges

Ri,1 Ri,2 Ri,n. . .

block 1block 2 block n

Si,1 Si,2 Si,n. . .

Round 2: partition Ri and Si into blocks and compute the candidatepoints for knn(r , S) for any r ∈ R.

Ri,1 Ri,2 Ri,n. . .

Si,1 Si,2 Si,n. . .

shuffle&

B+-Tree

Retrieve Ci(r) for all r ∈ Ri,j, j ∈ [1, n]

Round 3: determine knn(r ,C(r)) of any r ∈ R from the (r ,Ci (r)) emittedby round 2.

Ri,1 Ri,2 Ri,n. . .

Si,1 Si,2 Si,n. . .

shuffle&

B+-Tree

Retrieve Ci(r) for all r ∈ Ri,j, j ∈ [1, n]

Outline

1 Introduction

4 Experiments

5 Conclusions

Experiments: algorithms

We implement the following methods in Hadoop 0.20.2:Exact Methods:

The baseline solution is denoted H-BNLJ,The improvement to the baseline solution is denoted H-BRJ.

Approximate Methods:

Our three-round solution is denoted by H-zkNNJ, (meaning ”Hadoopz-value kNN Join”).

Experiments: setup

Experiments are performed in a heterogeneous Hadoop cluster with17 machines:

1 9 machines with 2GB of RAM and an Intel Xeon 1.86GHz CPU2 6 machines with 4GB of RAM and an Intel Xeon 2GHz CPU

One is reserved for the master (running JobTracker and NameNode).

3 2 machines with 6GB of RAM and an Intel Xeon 2.13GHz CPU

All machines are directly connected to a 1000Mbps switch.

Each slave node has 300GB hard drive space and 1GB of RAM forHadoop daemon.

The chunk size of DFS is set to 128MB.

Experiments: setup

Experiments are performed in a heterogeneous Hadoop cluster with17 machines:

1 9 machines with 2GB of RAM and an Intel Xeon 1.86GHz CPU2 6 machines with 4GB of RAM and an Intel Xeon 2GHz CPU

One is reserved for the master (running JobTracker and NameNode).

3 2 machines with 6GB of RAM and an Intel Xeon 2.13GHz CPU

All machines are directly connected to a 1000Mbps switch.

Each slave node has 300GB hard drive space and 1GB of RAM forHadoop daemon.

The chunk size of DFS is set to 128MB.

Experiments: datasets

OpenStreet Map dataset:

the road-networks for 50 states in U.S.160 million records.preprocessed to remove duplicationseach record consists of a 4 bytes integer id, two 4 bytes real typecoordinates representing latitude and longitude, and a descriptioninformation.the coordinates has a positive real domain (0,100000).stored in text format, 6.6GB.

Large synthetic Random-Cluster datasets:

data sets have varying dimensionality (up to 30).each record has a 4-byte id and float type d-dimensional coordinates.

Experiments: datasets

OpenStreet Map dataset:

the road-networks for 50 states in U.S.160 million records.preprocessed to remove duplicationseach record consists of a 4 bytes integer id, two 4 bytes real typecoordinates representing latitude and longitude, and a descriptioninformation.the coordinates has a positive real domain (0,100000).stored in text format, 6.6GB.

Large synthetic Random-Cluster datasets:

data sets have varying dimensionality (up to 30).each record has a 4-byte id and float type d-dimensional coordinates.

Experiments: configurations and defaults

Data set configurations

(MXN) represents a data set configuration containing M records ofR and N record of S (in 10s of millions).

Default values for OpenStreet dataset:

Symbol Definition Default(MXN) data set configuration (4x4)

k # of nearest neighbor 10α # of shift copies 2ε the error rate of sampling 0.003γ the physical number of machines 16

Values for R-Cluster dataset:

(2x2) is set to be the default data set configuration.

Experiments: Approximation quality

H-zkNNJ: Hadoop z-value kNN Join

10 20 40 60 80

k values

OpenStreet

10 20 40 60 80

k values

OpenStreet

5 10 15 20 25 30

Dimensionality

R-Cluster

5 10 15 20 25 30

Dimensionality

R-Cluster

Experiments: Running time and communication cost

H-zkNNJ: Hadoop z-value kNN JoinH-BRJ: Hadoop Block R-tree Join

4x4 6x6 8x8 12x12 16x16

seconds

|R|×|S|:107×10

H-zkNNJ

4x4 6x6 8x8 12x12 16x16

|R|×|S|:107×10

H-zkNNJ

Experiments: Effect of d

5 10 15 20 25 30

econds)

Dimensionality

H-zkNNJ

5 10 15 20 25 30

Dimensionality

H-zkNNJ

Conclusions

We study efficient methods to perform kNN joins in MapReduce.

Exact (H-BRJ) and approximate (H-zkNNJ) algorithms are proposed.H-zkNNJ performs orders of magnitude better than other methodswith excellent approximation quality.

We plan to investigate kNN joins on very high dimensions in thefuture.

The End

Thank You

Q and A

zkNN algorithm

Algorithm 1: zkNN(q, P, k , α)

generate {v2, . . . , vα}, v1 =−→0 , vi is a random vector in Rd ;1

Pi = P + vi (i ∈ [1, α]; ∀p ∈ P, insert p + vi in Pi );2

for i = 1, . . . , α do3

let qi = q + vi , Ci (q) = ∅, and zqi be qi ’s z-value;4

insert z−(zqi , k,Pi ) into Ci (q);5

insert z+(zqi , k,Pi ) into Ci (q);6

for any p ∈ Ci (q), update p = p − vi ;7

C (q) =⋃α

i=1 Ci (q) = C1(q) ∪ · · · ∪ Cα(q);8

return knn(q,C (q)).9

4x4 6x6 8x8 12x12 16x16

|R|×|S|:107×10

OpenStreet

10 20 40 60 80

k values

OpenStreet

40x40 80x80 120x120 160x160

|R|×|S|:106×10

OpenStreet

10 20 40 60 80

k values

OpenStreet

Experiments: Effect of ε

0.6 1 3 10 100

seconds)

ε (×10-3

H-zkNNJ

Experiments: Effect of ε

0.6 1 3 10 100

ε (×10-3

R blocks

S blocks

Experiments: Evaluation of H-BNLJ

5 10 15 20 25

seconds)

# Reducers

H-zkNNJ

H-BNLJ

Experiments: Evaluation of H-BNLJ

H-BNLJ H-BRJ H-zkNNJ

econds)

Algorithms

Phase1Phase2Phase3

Experiments: Speedup

5 10 15 20 25

# Reducers

H-zkNNJ

Experiments: Speedup

5 10 15 20 25

# Reducers

H-zkNNJ

4x4 6x6 8x8 12x12 16x16

seconds

|R|×|S|:107×10

zPhase1zPhase2zPhase3

4x4 6x6 8x8 12x12 16x16

seconds

|R|×|S|:107×10

RPhase1RPhase2

4x4 6x6 8x8 12x12 16x16

seconds

|R|×|S|:107×10

H-zkNNJ

4x4 6x6 8x8 12x12 16x16

|R|×|S|:107×10

H-zkNNJ

5 10 15 20 25 30

econds)

Dimensionality

H-zkNNJ

5 10 15 20 25 30

Dimensionality

H-zkNNJ

5 10 15 20 25 30

Dimensionality

R-Cluster

5 10 15 20 25 30

Dimensionality

R-Cluster

Experiments: Effect of k

10 20 40 60 80

seconds

k values

zPhase1zPhase2zPhase3

10 20 40 60 80

seconds

k values

RPhase1RPhase2

3 10 20 40 60 80

seconds

k values

H-zkNNJ

3 10 20 40 60 80

k values

H-zkNNJ

Experiments: Effect of number of shifts α

2 3 4 5 6

seconds)

α values

H-zkNNJ H-BRJ

2 3 4 5 6

α values

H-zkNNJ

2 3 4 5 6

α values

R-Cluster

2 3 4 5 6

Recall

α values

R-Cluster

Efficient Parallel kNN Joins for Large Data in...

Documents

Transcript of Efficient Parallel kNN Joins for Large Data in...

KNN, ID Trees, and Neural Nets Intro to Learning Algorithms · PDF fileKNN-ID and Neural Nets KNN, ID Trees, and Neural Nets Intro to Learning Algorithms KNN, Decision trees, Neural

Joins and different types of joins dani

The MST-kNN with Paracliques (Presentation)

KNN Matting

Best Practices Transact-SQL Cont.... Combining Data from Multiple Tables Introduction to Joins Using Inner Joins Using Outer Joins Using Cross Joins.

Chi Zhang czhang@cs.fiuczhang/teaching/cop4225/slides/Process.pdf · Processes and Threads Chi Zhang czhang@cs.fiu.edu. 2 ... in turn create other processes, forming a tree of processes.

Large Relational Databases (Almost) for Free K Nearest Neighbor …lifeifei/papers/KNN.pdf · 2011-08-09 · Introduction KNN queries and KNN-Joins: spatial databases, pattern recognition,

KNN (K-Nearest Neighbors)

Weighted kNN , clustering, more plottong , Bayes

KNN and regression Tree

HOG Feature Extraction and KNN Classification for ...

C:\Documents And Settings\Czhang\Desktop\Roomba 500 Series Manual

KNN News 042715

Running Knn Mapreduce code on Amazon AWS - Virginia Techdmkd.cs.vt.edu/TUTORIAL/Bigdata/Running KNN... · Running Knn Mapreduce code on Amazon AWS Pseudo Code Inputs: Train Data D,

kNN algorithm and CNN data reduction

Lecture 11: Cross validation - Stanford UniversityScenario2 KNN!1 KNN!CV LDA Logistic QDA 0.25 0.30 0.35 0.40 0.45 SCENARIO 1 KNN!1 KNN!CV LDA Logistic QDA 0.15 0.20 0.25 0.30 SCENARIO

Efﬁcient Processing of k Nearest Neighbor Joins using ...ooibc/vldb12knnjoin.pdf · the join operation, kNN join is an expensive operation. Given the increasing volume of data,

A robust ga knn based hypothesis

Index Land Surface for Efficient kNN Query

KNN Classification and Regression using SAS R · KNN Classification and Regression using SAS R Liang Xie, The Travelers Companies, Inc. ABSTRACT K-Nearest Neighbor (KNN) classification