Research Summary: Scalable Algorithms for Nearest-Neighbor Joins on Big Trajectory Data

Scalable kNNJoins, Fang

presented byAlex Klibisz

Introduction

Trajectory JoinsIntroduction

Motivation

MapReduceIntroduction

ProblemStatement

TrajectoryOperations

Sub-optimalSolutions

Solution: kNNJoin

Pre-processingPhase

Querying Phase

Extension: kNNLoad Balancing

Extension:hkNN Join

Results

Evaluation Setup

kNN Results

hkNN ResultsSummary

Conclusion

Scalable Algorithms for Nearest-NeighborJoins on Big Trajectory Data – Fang, Cheng,

Tang, Maniu, Yang (2016)

presented by Alex Klibisz

University of Tennessee

[email protected]

November 17, 2016



Introduction


Motivation


ProblemStatement



Solution: kNNJoin

Pre-processingPhase

Querying Phase


Extension:hkNN Join

Results

Evaluation Setup

kNN Results

hkNN ResultsSummary

Conclusion

Contents

1 IntroductionTrajectory Joins IntroductionMotivationMapReduce IntroductionProblem StatementTrajectory Operations

2 Sub-optimal Solutions

3 Solution: kNN JoinPre-processing PhaseQuerying PhaseExtension: kNN Load BalancingExtension: hkNN Join

4 ResultsEvaluation SetupkNN ResultshkNN Results Summary

5 Conclusion



Introduction


Motivation


ProblemStatement



Solution: kNNJoin

Pre-processingPhase

Querying Phase


Extension:hkNN Join

Results

Evaluation Setup

kNN Results

hkNN ResultsSummary

Conclusion

Trajectory Joins Vocabulary

• Trajectory: series of locations that depicts movement ofan entity over time.

• Trajectory Object: snapshot of time and location; manytrajectory objects in a single trajectory.

• Trajectory Join: given two sets M and R of trajectories,join(M,R) returns trajectory objects from M and R withinsome proximity of space and time.

• Joining Criterion: criteria by which objects in M and R arejoined. This paper uses the k-nearest-neighbors algorithmto join objects.



Introduction


Motivation


ProblemStatement



Solution: kNNJoin

Pre-processingPhase

Querying Phase


Extension:hkNN Join

Results

Evaluation Setup

kNN Results

hkNN ResultsSummary

Conclusion

Example Use Case

• Hubble space telescope generates 140GB/week aboutmovements of stars and asteroids. Analysis of proximityamong trajectory objects helps to uncover behavior ofouter-space objects, discover meteors, etc. We can usetrajectory joins to find objects in some proximity to oneanother.

• Given two groups A and B of asteroids, return theidentities of asteroids from B that have been close tothose in A.



Introduction


Motivation


ProblemStatement



Solution: kNNJoin

Pre-processingPhase

Querying Phase


Extension:hkNN Join

Results

Evaluation Setup

kNN Results

hkNN ResultsSummary

Conclusion

MapReduce Basics

• Divide-and-conquer ”big data” on share-nothing clusters.• Master node partitions data and assigns it to map nodes.• Map performs analysis on local data.• Shuffle step redistributes data after the map step.• Reduce performs a summary operation over data from the

the Map step.• MapReduce software handles the data partitioning,

execution over distributed nodes, error recovery.

1

1https://goo.gl/0nbYhp



Introduction


Motivation


ProblemStatement



Solution: kNNJoin

Pre-processingPhase

Querying Phase


Extension:hkNN Join

Results

Evaluation Setup

kNN Results

hkNN ResultsSummary

Conclusion

Problem Statement

kNN JoinFind the K nearest neighbors from set R for objects in M overtime interval [ts , te ] ⊆ [Ts ,Te ].

(h,k)NN Join

Find a list of h objects from M over time interval[ts , te ] ⊆ [Ts ,Te ] that minimize function f . Then return the knearest neighbors for each of the h objects.



Introduction


Motivation


ProblemStatement



Solution: kNNJoin

Pre-processingPhase

Querying Phase


Extension:hkNN Join

Results

Evaluation Setup

kNN Results

hkNN ResultsSummary

Conclusion

kNN Example

Figure illustrates a kNN Join. An (h,k)NN join with h = 1, k = 2might use f (m1) = max{d1, d2} = d2 to return the k nearestneighbors of d2 = {r1, r2}.



Introduction


Motivation


ProblemStatement



Solution: kNNJoin

Pre-processingPhase

Querying Phase


Extension:hkNN Join

Results

Evaluation Setup

kNN Results

hkNN ResultsSummary

Conclusion

Some Fundamental Operations

• Min/max distance from point to line-segment.

• Min/max distance from point to trajectory.

• Min/max distance from trajectory to trajectory.

• kNN from trajectory object to trajectory objects.

2

2Formulas omitted for brevity, available in section 3.



Introduction


Motivation


ProblemStatement



Solution: kNNJoin

Pre-processingPhase

Querying Phase


Extension:hkNN Join

Results

Evaluation Setup

kNN Results

hkNN ResultsSummary

Conclusion

Sub-optimal Solutions

Single Machine Brute Force (BF)

Nested loop to compute euclidean distance between every pairof points in M and R. Worst-case O(|M||N|l) for l points intrajectory of interest tr .

Single Machine Sweep Line (SL)

Pre-sort the data based on time and compute only distances foroverlapping trajectories. Also worst-case O(|M||N|l).

Naive MapReduce

Map divides objects in M and R randomly into disjoint subsets.Reduce joins all pairs of subsets to compute distance. A secondMapReduce job selects the k nearest neighbors.



Introduction


Motivation


ProblemStatement



Solution: kNNJoin

Pre-processingPhase

Querying Phase


Extension:hkNN Join

Results

Evaluation Setup

kNN Results

hkNN ResultsSummary

Conclusion

Overview of kNN Join

Each of the steps is composed of its own MapReduce algorithm for atotal of 6 algorithms.



Introduction


Motivation


ProblemStatement



Solution: kNNJoin

Pre-processingPhase

Querying Phase


Extension:hkNN Join

Results

Evaluation Setup

kNN Results

hkNN ResultsSummary

Conclusion

Overview of kNN Join



Introduction


Motivation


ProblemStatement



Solution: kNNJoin

Pre-processingPhase

Querying Phase


Extension:hkNN Join

Results

Evaluation Setup

kNN Results

hkNN ResultsSummary

Conclusion

Pre-processing Phase

Algorithm 1

1 Input: non-partitioned trajectories.

2 Map splits trajectories in sets M and R into T temporalpartitions. O(l + T ) where l is the size of a trajectory.

3 Reduce splits each temporal partition into N spatialpartitions. O((|M|+ |R|)(l + N))

4 Output: trajectories partitioned by time and space.



Introduction


Motivation


ProblemStatement



Solution: kNNJoin

Pre-processingPhase

Querying Phase


Extension:hkNN Join

Results

Evaluation Setup

kNN Results

hkNN ResultsSummary

Conclusion

Sub-Trajectory Extraction

• An anchor trajectory must span an entire time partition.

• TrLi is object i in trajectory r in set L in time partiton T .

Algorithm 2

1 Input: trajectories partitioned by time and space.

2 Map retrieves all sub-trajectories in [ts , te ]3. Ot(log(l)),Os(l)

3 Reduce finds anchor trajectories that will be used in nextstep. Ot(|TrLi |2l), Os(|TrLi |l).

4 Output: anchor trajectories

3the queried time window



Introduction


Motivation


ProblemStatement



Solution: kNNJoin

Pre-processingPhase

Querying Phase


Extension:hkNN Join

Results

Evaluation Setup

kNN Results

hkNN ResultsSummary

Conclusion

Anchor Trajectories

• An anchor trajectory must span an entire time partition tsto te .



Introduction


Motivation


ProblemStatement



Solution: kNNJoin

Pre-processingPhase

Querying Phase


Extension:hkNN Join

Results

Evaluation Setup

kNN Results

hkNN ResultsSummary

Conclusion

Computing Time-dependent Bound (TDB)

• The TDB is a circle c(t) that bounds the k nearestneighbors of a set S of objects at time t.

• The TDB for a set S of objects can change over time.

Algorithm 4, containing Algorithm 3

1 Input: anchor trajectories

2 Map computes the maximum distance from each anchortrajectory to each central point pi in each temporalpartition T . Ot(N · l), Os(l)

3 Reduce computes the TDB of TrMi based on the maximumdistances. Ot(|R|log |R|), Os(|R|) for the set of objects R.

4 Output: Time-dependent Bounds



Introduction


Motivation


ProblemStatement



Solution: kNNJoin

Pre-processingPhase

Querying Phase


Extension:hkNN Join

Results

Evaluation Setup

kNN Results

hkNN ResultsSummary

Conclusion

Time-dependent Bounds

• The TDB is a circle c(t) that bounds the k nearestneighbors of a set S of objects at time t.

• The TDB for a set S of objects can change over time.

White dots are objects from M. Black dots are objects from R. c(t)needs a small circle to encompass k = 2 points. c(t ′) needs a biggercircle to encompass k = 2 points.



Introduction


Motivation


ProblemStatement



Solution: kNNJoin

Pre-processingPhase

Querying Phase


Extension:hkNN Join

Results

Evaluation Setup

kNN Results

hkNN ResultsSummary

Conclusion

Finding Candidate Trajectories

Algorithm 5

1 Input: partition of trajectories TrRj .

2 Map classifies each partition of trajectories TrRj as havingno candidates, all candidates, or some candidates.Ot(|Tr |Nl), Os(|Tr |l).

3 Reduce gathers the candidates for a join into CRi . Ot(1),

Os(|CRi |l).

4 Output: a set of candidate trajectories CRi .



Introduction


Motivation


ProblemStatement



Solution: kNNJoin

Pre-processingPhase

Querying Phase


Extension:hkNN Join

Results

Evaluation Setup

kNN Results

hkNN ResultsSummary

Conclusion

Candidate Trajectories

Finding candidates for TrRj (red). Case 1 have no overlap. Case 2have complete overlap. Case 3 have partial overlap.



Introduction


Motivation


ProblemStatement



Solution: kNNJoin

Pre-processingPhase

Querying Phase


Extension:hkNN Join

Results

Evaluation Setup

kNN Results

hkNN ResultsSummary

Conclusion

Trajectory Join

Algorithm 6

1 Input: candidate trajectories

2 Map joins each partition TrMi with correspondingcandidates CR

i using a single machine. O(|Tr ||CRi |l).

3 Reduce sorts each object’s neighbors and leaves only the knearest. O(kN).

4 Output: each queried object with its k nearest neighbors.



Introduction


Motivation


ProblemStatement



Solution: kNNJoin

Pre-processingPhase

Querying Phase


Extension:hkNN Join

Results

Evaluation Setup

kNN Results

hkNN ResultsSummary

Conclusion

Extension: kNN Load Balancing

1 Hash the trajectory objects by an ID to distribute themmore uniformly among compute nodes.

2 Requires modification in the sub-trajectory extraction,finding candidates, and trajectory join.



Introduction


Motivation


ProblemStatement



Solution: kNNJoin

Pre-processingPhase

Querying Phase


Extension:hkNN Join

Results

Evaluation Setup

kNN Results

hkNN ResultsSummary

Conclusion

Extension: hkNN Join

1 Review: finds the h objects from M that minimize somefunction f and returns each of their k nearest neighbors.

2 Forced to compute a smaller TDB.

3 Smaller query result hxk size. kNN query was |M|xk.

4 Time and space complexities remain the same.



Introduction


Motivation


ProblemStatement



Solution: kNNJoin

Pre-processingPhase

Querying Phase


Extension:hkNN Join

Results

Evaluation Setup

kNN Results

hkNN ResultsSummary

Conclusion

Evaluation Setup

• 2 Synthetic and 2 real datasets.

• Non-trivial size, up to 1.2B observations and 17.2GB.

• Hadoop cluster with 60 slave nodes, multi-core 3.40GHzand 16GB memory per node.

• Using Sweep Line (SL) for single-node parts.

• Measuring query execution time and MapReduce shufflingcost.4

• k = 10, N = 400 constant for all datasets. T and tqvaried.

4The amount of data sent from mappers to reducers.



Introduction


Motivation


ProblemStatement



Solution: kNNJoin

Pre-processingPhase

Querying Phase


Extension:hkNN Join

Results

Evaluation Setup

kNN Results

hkNN ResultsSummary

Conclusion

Effect of T (number of temporal partitions)

As T grows the time decreases until it hits an inflection point. Thishappens to be similar for both datasets. We are still spending themost time on single-node SL.



Introduction


Motivation


ProblemStatement



Solution: kNNJoin

Pre-processingPhase

Querying Phase


Extension:hkNN Join

Results

Evaluation Setup

kNN Results

hkNN ResultsSummary

Conclusion

kNN Results Summary

• Increasing N (number of temporal patitions) improvesperformance to a point of inflection. This point is differentfor the two datasets. Fig. 15.

• Balanced Sweep-Line (BL-SL) is the more efficientsingle-node algorithm. Fig. 16.5

• Adding slave nodes improves performance. Rate of changeis slow, likely due to I/O overhead. Fig. 17.

• As k increases the running time and shuffle cost increase.TDB makes a difference. Fig. 18.

• Increases in tq show a near-linear increase in running timeand shuffling cost. TDB and load balancing make adifference. Fig. 19.

• Time increases linearly with dataset size. Sharper increasein shuffling cost than time. Fig. 20.

5I think they mixed up the figure labels.



Introduction


Motivation


ProblemStatement



Solution: kNNJoin

Pre-processingPhase

Querying Phase


Extension:hkNN Join

Results

Evaluation Setup

kNN Results

hkNN ResultsSummary

Conclusion

hkNN Results Summary

• Time is constant as h grows (probably because k isconstant).

• (h,k)NN is 2x faster than kNN methods.

• Load-balanced is faster than non-load-balanced.



Introduction


Motivation


ProblemStatement



Solution: kNNJoin

Pre-processingPhase

Querying Phase


Extension:hkNN Join

Results

Evaluation Setup

kNN Results

hkNN ResultsSummary

Conclusion

Conclusion

Contributions

1 Leverage share-nothing MapReduce structure for kNNjoins, which typically rely on shared indices.

2 Introduce the TDB and load-balancing methods, whichyield tangible improvements.

Questions

1 Most of the time is still spent on the single-nodecomputation. What is the theoretical bound forimprovement via parallelization?

2 How much time does the partitioning step take?

3 The partitioning step probably has to be re-run when newdata arrives. Does this prevent a real-timeimplementation?

4 Any benefit to localize data instead of using HDFS?

Research Summary: Scalable Algorithms for Nearest-Neighbor Joins on Big Trajectory Data

Technology

Transcript of Research Summary: Scalable Algorithms for Nearest-Neighbor Joins on Big Trajectory Data