Research Summary: Scalable Algorithms for Nearest-Neighbor Joins on Big Trajectory Data

26
Scalable kNN Joins, Fang presented by Alex Klibisz Introduction Trajectory Joins Introduction Motivation MapReduce Introduction Problem Statement Trajectory Operations Sub-optimal Solutions Solution: kNN Join Pre-processing Phase Querying Phase Extension: kNN Load Balancing Extension: hkNN Join Results Evaluation Setup kNN Results hkNN Results Summary Conclusion Scalable Algorithms for Nearest-Neighbor Joins on Big Trajectory Data – Fang, Cheng, Tang, Maniu, Yang (2016) presented by Alex Klibisz University of Tennessee [email protected] November 17, 2016

Transcript of Research Summary: Scalable Algorithms for Nearest-Neighbor Joins on Big Trajectory Data

Scalable kNNJoins, Fang

presented byAlex Klibisz

Introduction

Trajectory JoinsIntroduction

Motivation

MapReduceIntroduction

ProblemStatement

TrajectoryOperations

Sub-optimalSolutions

Solution: kNNJoin

Pre-processingPhase

Querying Phase

Extension: kNNLoad Balancing

Extension:hkNN Join

Results

Evaluation Setup

kNN Results

hkNN ResultsSummary

Conclusion

Scalable Algorithms for Nearest-NeighborJoins on Big Trajectory Data – Fang, Cheng,

Tang, Maniu, Yang (2016)

presented by Alex Klibisz

University of Tennessee

[email protected]

November 17, 2016

Scalable kNNJoins, Fang

presented byAlex Klibisz

Introduction

Trajectory JoinsIntroduction

Motivation

MapReduceIntroduction

ProblemStatement

TrajectoryOperations

Sub-optimalSolutions

Solution: kNNJoin

Pre-processingPhase

Querying Phase

Extension: kNNLoad Balancing

Extension:hkNN Join

Results

Evaluation Setup

kNN Results

hkNN ResultsSummary

Conclusion

Contents

1 IntroductionTrajectory Joins IntroductionMotivationMapReduce IntroductionProblem StatementTrajectory Operations

2 Sub-optimal Solutions

3 Solution: kNN JoinPre-processing PhaseQuerying PhaseExtension: kNN Load BalancingExtension: hkNN Join

4 ResultsEvaluation SetupkNN ResultshkNN Results Summary

5 Conclusion

Scalable kNNJoins, Fang

presented byAlex Klibisz

Introduction

Trajectory JoinsIntroduction

Motivation

MapReduceIntroduction

ProblemStatement

TrajectoryOperations

Sub-optimalSolutions

Solution: kNNJoin

Pre-processingPhase

Querying Phase

Extension: kNNLoad Balancing

Extension:hkNN Join

Results

Evaluation Setup

kNN Results

hkNN ResultsSummary

Conclusion

Trajectory Joins Vocabulary

• Trajectory: series of locations that depicts movement ofan entity over time.

• Trajectory Object: snapshot of time and location; manytrajectory objects in a single trajectory.

• Trajectory Join: given two sets M and R of trajectories,join(M,R) returns trajectory objects from M and R withinsome proximity of space and time.

• Joining Criterion: criteria by which objects in M and R arejoined. This paper uses the k-nearest-neighbors algorithmto join objects.

Scalable kNNJoins, Fang

presented byAlex Klibisz

Introduction

Trajectory JoinsIntroduction

Motivation

MapReduceIntroduction

ProblemStatement

TrajectoryOperations

Sub-optimalSolutions

Solution: kNNJoin

Pre-processingPhase

Querying Phase

Extension: kNNLoad Balancing

Extension:hkNN Join

Results

Evaluation Setup

kNN Results

hkNN ResultsSummary

Conclusion

Example Use Case

• Hubble space telescope generates 140GB/week aboutmovements of stars and asteroids. Analysis of proximityamong trajectory objects helps to uncover behavior ofouter-space objects, discover meteors, etc. We can usetrajectory joins to find objects in some proximity to oneanother.

• Given two groups A and B of asteroids, return theidentities of asteroids from B that have been close tothose in A.

Scalable kNNJoins, Fang

presented byAlex Klibisz

Introduction

Trajectory JoinsIntroduction

Motivation

MapReduceIntroduction

ProblemStatement

TrajectoryOperations

Sub-optimalSolutions

Solution: kNNJoin

Pre-processingPhase

Querying Phase

Extension: kNNLoad Balancing

Extension:hkNN Join

Results

Evaluation Setup

kNN Results

hkNN ResultsSummary

Conclusion

MapReduce Basics

• Divide-and-conquer ”big data” on share-nothing clusters.• Master node partitions data and assigns it to map nodes.• Map performs analysis on local data.• Shuffle step redistributes data after the map step.• Reduce performs a summary operation over data from the

the Map step.• MapReduce software handles the data partitioning,

execution over distributed nodes, error recovery.

1

1https://goo.gl/0nbYhp

Scalable kNNJoins, Fang

presented byAlex Klibisz

Introduction

Trajectory JoinsIntroduction

Motivation

MapReduceIntroduction

ProblemStatement

TrajectoryOperations

Sub-optimalSolutions

Solution: kNNJoin

Pre-processingPhase

Querying Phase

Extension: kNNLoad Balancing

Extension:hkNN Join

Results

Evaluation Setup

kNN Results

hkNN ResultsSummary

Conclusion

Problem Statement

kNN JoinFind the K nearest neighbors from set R for objects in M overtime interval [ts , te ] ⊆ [Ts ,Te ].

(h,k)NN Join

Find a list of h objects from M over time interval[ts , te ] ⊆ [Ts ,Te ] that minimize function f . Then return the knearest neighbors for each of the h objects.

Scalable kNNJoins, Fang

presented byAlex Klibisz

Introduction

Trajectory JoinsIntroduction

Motivation

MapReduceIntroduction

ProblemStatement

TrajectoryOperations

Sub-optimalSolutions

Solution: kNNJoin

Pre-processingPhase

Querying Phase

Extension: kNNLoad Balancing

Extension:hkNN Join

Results

Evaluation Setup

kNN Results

hkNN ResultsSummary

Conclusion

kNN Example

Figure illustrates a kNN Join. An (h,k)NN join with h = 1, k = 2might use f (m1) = max{d1, d2} = d2 to return the k nearestneighbors of d2 = {r1, r2}.

Scalable kNNJoins, Fang

presented byAlex Klibisz

Introduction

Trajectory JoinsIntroduction

Motivation

MapReduceIntroduction

ProblemStatement

TrajectoryOperations

Sub-optimalSolutions

Solution: kNNJoin

Pre-processingPhase

Querying Phase

Extension: kNNLoad Balancing

Extension:hkNN Join

Results

Evaluation Setup

kNN Results

hkNN ResultsSummary

Conclusion

Some Fundamental Operations

• Min/max distance from point to line-segment.

• Min/max distance from point to trajectory.

• Min/max distance from trajectory to trajectory.

• kNN from trajectory object to trajectory objects.

2

2Formulas omitted for brevity, available in section 3.

Scalable kNNJoins, Fang

presented byAlex Klibisz

Introduction

Trajectory JoinsIntroduction

Motivation

MapReduceIntroduction

ProblemStatement

TrajectoryOperations

Sub-optimalSolutions

Solution: kNNJoin

Pre-processingPhase

Querying Phase

Extension: kNNLoad Balancing

Extension:hkNN Join

Results

Evaluation Setup

kNN Results

hkNN ResultsSummary

Conclusion

Sub-optimal Solutions

Single Machine Brute Force (BF)

Nested loop to compute euclidean distance between every pairof points in M and R. Worst-case O(|M||N|l) for l points intrajectory of interest tr .

Single Machine Sweep Line (SL)

Pre-sort the data based on time and compute only distances foroverlapping trajectories. Also worst-case O(|M||N|l).

Naive MapReduce

Map divides objects in M and R randomly into disjoint subsets.Reduce joins all pairs of subsets to compute distance. A secondMapReduce job selects the k nearest neighbors.

Scalable kNNJoins, Fang

presented byAlex Klibisz

Introduction

Trajectory JoinsIntroduction

Motivation

MapReduceIntroduction

ProblemStatement

TrajectoryOperations

Sub-optimalSolutions

Solution: kNNJoin

Pre-processingPhase

Querying Phase

Extension: kNNLoad Balancing

Extension:hkNN Join

Results

Evaluation Setup

kNN Results

hkNN ResultsSummary

Conclusion

Overview of kNN Join

Each of the steps is composed of its own MapReduce algorithm for atotal of 6 algorithms.

Scalable kNNJoins, Fang

presented byAlex Klibisz

Introduction

Trajectory JoinsIntroduction

Motivation

MapReduceIntroduction

ProblemStatement

TrajectoryOperations

Sub-optimalSolutions

Solution: kNNJoin

Pre-processingPhase

Querying Phase

Extension: kNNLoad Balancing

Extension:hkNN Join

Results

Evaluation Setup

kNN Results

hkNN ResultsSummary

Conclusion

Overview of kNN Join

Scalable kNNJoins, Fang

presented byAlex Klibisz

Introduction

Trajectory JoinsIntroduction

Motivation

MapReduceIntroduction

ProblemStatement

TrajectoryOperations

Sub-optimalSolutions

Solution: kNNJoin

Pre-processingPhase

Querying Phase

Extension: kNNLoad Balancing

Extension:hkNN Join

Results

Evaluation Setup

kNN Results

hkNN ResultsSummary

Conclusion

Pre-processing Phase

Algorithm 1

1 Input: non-partitioned trajectories.

2 Map splits trajectories in sets M and R into T temporalpartitions. O(l + T ) where l is the size of a trajectory.

3 Reduce splits each temporal partition into N spatialpartitions. O((|M|+ |R|)(l + N))

4 Output: trajectories partitioned by time and space.

Scalable kNNJoins, Fang

presented byAlex Klibisz

Introduction

Trajectory JoinsIntroduction

Motivation

MapReduceIntroduction

ProblemStatement

TrajectoryOperations

Sub-optimalSolutions

Solution: kNNJoin

Pre-processingPhase

Querying Phase

Extension: kNNLoad Balancing

Extension:hkNN Join

Results

Evaluation Setup

kNN Results

hkNN ResultsSummary

Conclusion

Sub-Trajectory Extraction

• An anchor trajectory must span an entire time partition.

• TrLi is object i in trajectory r in set L in time partiton T .

Algorithm 2

1 Input: trajectories partitioned by time and space.

2 Map retrieves all sub-trajectories in [ts , te ]3. Ot(log(l)),Os(l)

3 Reduce finds anchor trajectories that will be used in nextstep. Ot(|TrLi |2l), Os(|TrLi |l).

4 Output: anchor trajectories

3the queried time window

Scalable kNNJoins, Fang

presented byAlex Klibisz

Introduction

Trajectory JoinsIntroduction

Motivation

MapReduceIntroduction

ProblemStatement

TrajectoryOperations

Sub-optimalSolutions

Solution: kNNJoin

Pre-processingPhase

Querying Phase

Extension: kNNLoad Balancing

Extension:hkNN Join

Results

Evaluation Setup

kNN Results

hkNN ResultsSummary

Conclusion

Anchor Trajectories

• An anchor trajectory must span an entire time partition tsto te .

Scalable kNNJoins, Fang

presented byAlex Klibisz

Introduction

Trajectory JoinsIntroduction

Motivation

MapReduceIntroduction

ProblemStatement

TrajectoryOperations

Sub-optimalSolutions

Solution: kNNJoin

Pre-processingPhase

Querying Phase

Extension: kNNLoad Balancing

Extension:hkNN Join

Results

Evaluation Setup

kNN Results

hkNN ResultsSummary

Conclusion

Computing Time-dependent Bound (TDB)

• The TDB is a circle c(t) that bounds the k nearestneighbors of a set S of objects at time t.

• The TDB for a set S of objects can change over time.

Algorithm 4, containing Algorithm 3

1 Input: anchor trajectories

2 Map computes the maximum distance from each anchortrajectory to each central point pi in each temporalpartition T . Ot(N · l), Os(l)

3 Reduce computes the TDB of TrMi based on the maximumdistances. Ot(|R|log |R|), Os(|R|) for the set of objects R.

4 Output: Time-dependent Bounds

Scalable kNNJoins, Fang

presented byAlex Klibisz

Introduction

Trajectory JoinsIntroduction

Motivation

MapReduceIntroduction

ProblemStatement

TrajectoryOperations

Sub-optimalSolutions

Solution: kNNJoin

Pre-processingPhase

Querying Phase

Extension: kNNLoad Balancing

Extension:hkNN Join

Results

Evaluation Setup

kNN Results

hkNN ResultsSummary

Conclusion

Time-dependent Bounds

• The TDB is a circle c(t) that bounds the k nearestneighbors of a set S of objects at time t.

• The TDB for a set S of objects can change over time.

White dots are objects from M. Black dots are objects from R. c(t)needs a small circle to encompass k = 2 points. c(t ′) needs a biggercircle to encompass k = 2 points.

Scalable kNNJoins, Fang

presented byAlex Klibisz

Introduction

Trajectory JoinsIntroduction

Motivation

MapReduceIntroduction

ProblemStatement

TrajectoryOperations

Sub-optimalSolutions

Solution: kNNJoin

Pre-processingPhase

Querying Phase

Extension: kNNLoad Balancing

Extension:hkNN Join

Results

Evaluation Setup

kNN Results

hkNN ResultsSummary

Conclusion

Finding Candidate Trajectories

Algorithm 5

1 Input: partition of trajectories TrRj .

2 Map classifies each partition of trajectories TrRj as havingno candidates, all candidates, or some candidates.Ot(|Tr |Nl), Os(|Tr |l).

3 Reduce gathers the candidates for a join into CRi . Ot(1),

Os(|CRi |l).

4 Output: a set of candidate trajectories CRi .

Scalable kNNJoins, Fang

presented byAlex Klibisz

Introduction

Trajectory JoinsIntroduction

Motivation

MapReduceIntroduction

ProblemStatement

TrajectoryOperations

Sub-optimalSolutions

Solution: kNNJoin

Pre-processingPhase

Querying Phase

Extension: kNNLoad Balancing

Extension:hkNN Join

Results

Evaluation Setup

kNN Results

hkNN ResultsSummary

Conclusion

Candidate Trajectories

Finding candidates for TrRj (red). Case 1 have no overlap. Case 2have complete overlap. Case 3 have partial overlap.

Scalable kNNJoins, Fang

presented byAlex Klibisz

Introduction

Trajectory JoinsIntroduction

Motivation

MapReduceIntroduction

ProblemStatement

TrajectoryOperations

Sub-optimalSolutions

Solution: kNNJoin

Pre-processingPhase

Querying Phase

Extension: kNNLoad Balancing

Extension:hkNN Join

Results

Evaluation Setup

kNN Results

hkNN ResultsSummary

Conclusion

Trajectory Join

Algorithm 6

1 Input: candidate trajectories

2 Map joins each partition TrMi with correspondingcandidates CR

i using a single machine. O(|Tr ||CRi |l).

3 Reduce sorts each object’s neighbors and leaves only the knearest. O(kN).

4 Output: each queried object with its k nearest neighbors.

Scalable kNNJoins, Fang

presented byAlex Klibisz

Introduction

Trajectory JoinsIntroduction

Motivation

MapReduceIntroduction

ProblemStatement

TrajectoryOperations

Sub-optimalSolutions

Solution: kNNJoin

Pre-processingPhase

Querying Phase

Extension: kNNLoad Balancing

Extension:hkNN Join

Results

Evaluation Setup

kNN Results

hkNN ResultsSummary

Conclusion

Extension: kNN Load Balancing

1 Hash the trajectory objects by an ID to distribute themmore uniformly among compute nodes.

2 Requires modification in the sub-trajectory extraction,finding candidates, and trajectory join.

Scalable kNNJoins, Fang

presented byAlex Klibisz

Introduction

Trajectory JoinsIntroduction

Motivation

MapReduceIntroduction

ProblemStatement

TrajectoryOperations

Sub-optimalSolutions

Solution: kNNJoin

Pre-processingPhase

Querying Phase

Extension: kNNLoad Balancing

Extension:hkNN Join

Results

Evaluation Setup

kNN Results

hkNN ResultsSummary

Conclusion

Extension: hkNN Join

1 Review: finds the h objects from M that minimize somefunction f and returns each of their k nearest neighbors.

2 Forced to compute a smaller TDB.

3 Smaller query result hxk size. kNN query was |M|xk.

4 Time and space complexities remain the same.

Scalable kNNJoins, Fang

presented byAlex Klibisz

Introduction

Trajectory JoinsIntroduction

Motivation

MapReduceIntroduction

ProblemStatement

TrajectoryOperations

Sub-optimalSolutions

Solution: kNNJoin

Pre-processingPhase

Querying Phase

Extension: kNNLoad Balancing

Extension:hkNN Join

Results

Evaluation Setup

kNN Results

hkNN ResultsSummary

Conclusion

Evaluation Setup

• 2 Synthetic and 2 real datasets.

• Non-trivial size, up to 1.2B observations and 17.2GB.

• Hadoop cluster with 60 slave nodes, multi-core 3.40GHzand 16GB memory per node.

• Using Sweep Line (SL) for single-node parts.

• Measuring query execution time and MapReduce shufflingcost.4

• k = 10, N = 400 constant for all datasets. T and tqvaried.

4The amount of data sent from mappers to reducers.

Scalable kNNJoins, Fang

presented byAlex Klibisz

Introduction

Trajectory JoinsIntroduction

Motivation

MapReduceIntroduction

ProblemStatement

TrajectoryOperations

Sub-optimalSolutions

Solution: kNNJoin

Pre-processingPhase

Querying Phase

Extension: kNNLoad Balancing

Extension:hkNN Join

Results

Evaluation Setup

kNN Results

hkNN ResultsSummary

Conclusion

Effect of T (number of temporal partitions)

As T grows the time decreases until it hits an inflection point. Thishappens to be similar for both datasets. We are still spending themost time on single-node SL.

Scalable kNNJoins, Fang

presented byAlex Klibisz

Introduction

Trajectory JoinsIntroduction

Motivation

MapReduceIntroduction

ProblemStatement

TrajectoryOperations

Sub-optimalSolutions

Solution: kNNJoin

Pre-processingPhase

Querying Phase

Extension: kNNLoad Balancing

Extension:hkNN Join

Results

Evaluation Setup

kNN Results

hkNN ResultsSummary

Conclusion

kNN Results Summary

• Increasing N (number of temporal patitions) improvesperformance to a point of inflection. This point is differentfor the two datasets. Fig. 15.

• Balanced Sweep-Line (BL-SL) is the more efficientsingle-node algorithm. Fig. 16.5

• Adding slave nodes improves performance. Rate of changeis slow, likely due to I/O overhead. Fig. 17.

• As k increases the running time and shuffle cost increase.TDB makes a difference. Fig. 18.

• Increases in tq show a near-linear increase in running timeand shuffling cost. TDB and load balancing make adifference. Fig. 19.

• Time increases linearly with dataset size. Sharper increasein shuffling cost than time. Fig. 20.

5I think they mixed up the figure labels.

Scalable kNNJoins, Fang

presented byAlex Klibisz

Introduction

Trajectory JoinsIntroduction

Motivation

MapReduceIntroduction

ProblemStatement

TrajectoryOperations

Sub-optimalSolutions

Solution: kNNJoin

Pre-processingPhase

Querying Phase

Extension: kNNLoad Balancing

Extension:hkNN Join

Results

Evaluation Setup

kNN Results

hkNN ResultsSummary

Conclusion

hkNN Results Summary

• Time is constant as h grows (probably because k isconstant).

• (h,k)NN is 2x faster than kNN methods.

• Load-balanced is faster than non-load-balanced.

Scalable kNNJoins, Fang

presented byAlex Klibisz

Introduction

Trajectory JoinsIntroduction

Motivation

MapReduceIntroduction

ProblemStatement

TrajectoryOperations

Sub-optimalSolutions

Solution: kNNJoin

Pre-processingPhase

Querying Phase

Extension: kNNLoad Balancing

Extension:hkNN Join

Results

Evaluation Setup

kNN Results

hkNN ResultsSummary

Conclusion

Conclusion

Contributions

1 Leverage share-nothing MapReduce structure for kNNjoins, which typically rely on shared indices.

2 Introduce the TDB and load-balancing methods, whichyield tangible improvements.

Questions

1 Most of the time is still spent on the single-nodecomputation. What is the theoretical bound forimprovement via parallelization?

2 How much time does the partitioning step take?

3 The partitioning step probably has to be re-run when newdata arrives. Does this prevent a real-timeimplementation?

4 Any benefit to localize data instead of using HDFS?