Research Summary: Scalable Algorithms for Nearest-Neighbor Joins on Big Trajectory Data
-
Upload
alex-klibisz -
Category
Technology
-
view
122 -
download
3
Transcript of Research Summary: Scalable Algorithms for Nearest-Neighbor Joins on Big Trajectory Data
Scalable kNNJoins, Fang
presented byAlex Klibisz
Introduction
Trajectory JoinsIntroduction
Motivation
MapReduceIntroduction
ProblemStatement
TrajectoryOperations
Sub-optimalSolutions
Solution: kNNJoin
Pre-processingPhase
Querying Phase
Extension: kNNLoad Balancing
Extension:hkNN Join
Results
Evaluation Setup
kNN Results
hkNN ResultsSummary
Conclusion
Scalable Algorithms for Nearest-NeighborJoins on Big Trajectory Data – Fang, Cheng,
Tang, Maniu, Yang (2016)
presented by Alex Klibisz
University of Tennessee
November 17, 2016
Scalable kNNJoins, Fang
presented byAlex Klibisz
Introduction
Trajectory JoinsIntroduction
Motivation
MapReduceIntroduction
ProblemStatement
TrajectoryOperations
Sub-optimalSolutions
Solution: kNNJoin
Pre-processingPhase
Querying Phase
Extension: kNNLoad Balancing
Extension:hkNN Join
Results
Evaluation Setup
kNN Results
hkNN ResultsSummary
Conclusion
Contents
1 IntroductionTrajectory Joins IntroductionMotivationMapReduce IntroductionProblem StatementTrajectory Operations
2 Sub-optimal Solutions
3 Solution: kNN JoinPre-processing PhaseQuerying PhaseExtension: kNN Load BalancingExtension: hkNN Join
4 ResultsEvaluation SetupkNN ResultshkNN Results Summary
5 Conclusion
Scalable kNNJoins, Fang
presented byAlex Klibisz
Introduction
Trajectory JoinsIntroduction
Motivation
MapReduceIntroduction
ProblemStatement
TrajectoryOperations
Sub-optimalSolutions
Solution: kNNJoin
Pre-processingPhase
Querying Phase
Extension: kNNLoad Balancing
Extension:hkNN Join
Results
Evaluation Setup
kNN Results
hkNN ResultsSummary
Conclusion
Trajectory Joins Vocabulary
• Trajectory: series of locations that depicts movement ofan entity over time.
• Trajectory Object: snapshot of time and location; manytrajectory objects in a single trajectory.
• Trajectory Join: given two sets M and R of trajectories,join(M,R) returns trajectory objects from M and R withinsome proximity of space and time.
• Joining Criterion: criteria by which objects in M and R arejoined. This paper uses the k-nearest-neighbors algorithmto join objects.
Scalable kNNJoins, Fang
presented byAlex Klibisz
Introduction
Trajectory JoinsIntroduction
Motivation
MapReduceIntroduction
ProblemStatement
TrajectoryOperations
Sub-optimalSolutions
Solution: kNNJoin
Pre-processingPhase
Querying Phase
Extension: kNNLoad Balancing
Extension:hkNN Join
Results
Evaluation Setup
kNN Results
hkNN ResultsSummary
Conclusion
Example Use Case
• Hubble space telescope generates 140GB/week aboutmovements of stars and asteroids. Analysis of proximityamong trajectory objects helps to uncover behavior ofouter-space objects, discover meteors, etc. We can usetrajectory joins to find objects in some proximity to oneanother.
• Given two groups A and B of asteroids, return theidentities of asteroids from B that have been close tothose in A.
Scalable kNNJoins, Fang
presented byAlex Klibisz
Introduction
Trajectory JoinsIntroduction
Motivation
MapReduceIntroduction
ProblemStatement
TrajectoryOperations
Sub-optimalSolutions
Solution: kNNJoin
Pre-processingPhase
Querying Phase
Extension: kNNLoad Balancing
Extension:hkNN Join
Results
Evaluation Setup
kNN Results
hkNN ResultsSummary
Conclusion
MapReduce Basics
• Divide-and-conquer ”big data” on share-nothing clusters.• Master node partitions data and assigns it to map nodes.• Map performs analysis on local data.• Shuffle step redistributes data after the map step.• Reduce performs a summary operation over data from the
the Map step.• MapReduce software handles the data partitioning,
execution over distributed nodes, error recovery.
1
1https://goo.gl/0nbYhp
Scalable kNNJoins, Fang
presented byAlex Klibisz
Introduction
Trajectory JoinsIntroduction
Motivation
MapReduceIntroduction
ProblemStatement
TrajectoryOperations
Sub-optimalSolutions
Solution: kNNJoin
Pre-processingPhase
Querying Phase
Extension: kNNLoad Balancing
Extension:hkNN Join
Results
Evaluation Setup
kNN Results
hkNN ResultsSummary
Conclusion
Problem Statement
kNN JoinFind the K nearest neighbors from set R for objects in M overtime interval [ts , te ] ⊆ [Ts ,Te ].
(h,k)NN Join
Find a list of h objects from M over time interval[ts , te ] ⊆ [Ts ,Te ] that minimize function f . Then return the knearest neighbors for each of the h objects.
Scalable kNNJoins, Fang
presented byAlex Klibisz
Introduction
Trajectory JoinsIntroduction
Motivation
MapReduceIntroduction
ProblemStatement
TrajectoryOperations
Sub-optimalSolutions
Solution: kNNJoin
Pre-processingPhase
Querying Phase
Extension: kNNLoad Balancing
Extension:hkNN Join
Results
Evaluation Setup
kNN Results
hkNN ResultsSummary
Conclusion
kNN Example
Figure illustrates a kNN Join. An (h,k)NN join with h = 1, k = 2might use f (m1) = max{d1, d2} = d2 to return the k nearestneighbors of d2 = {r1, r2}.
Scalable kNNJoins, Fang
presented byAlex Klibisz
Introduction
Trajectory JoinsIntroduction
Motivation
MapReduceIntroduction
ProblemStatement
TrajectoryOperations
Sub-optimalSolutions
Solution: kNNJoin
Pre-processingPhase
Querying Phase
Extension: kNNLoad Balancing
Extension:hkNN Join
Results
Evaluation Setup
kNN Results
hkNN ResultsSummary
Conclusion
Some Fundamental Operations
• Min/max distance from point to line-segment.
• Min/max distance from point to trajectory.
• Min/max distance from trajectory to trajectory.
• kNN from trajectory object to trajectory objects.
2
2Formulas omitted for brevity, available in section 3.
Scalable kNNJoins, Fang
presented byAlex Klibisz
Introduction
Trajectory JoinsIntroduction
Motivation
MapReduceIntroduction
ProblemStatement
TrajectoryOperations
Sub-optimalSolutions
Solution: kNNJoin
Pre-processingPhase
Querying Phase
Extension: kNNLoad Balancing
Extension:hkNN Join
Results
Evaluation Setup
kNN Results
hkNN ResultsSummary
Conclusion
Sub-optimal Solutions
Single Machine Brute Force (BF)
Nested loop to compute euclidean distance between every pairof points in M and R. Worst-case O(|M||N|l) for l points intrajectory of interest tr .
Single Machine Sweep Line (SL)
Pre-sort the data based on time and compute only distances foroverlapping trajectories. Also worst-case O(|M||N|l).
Naive MapReduce
Map divides objects in M and R randomly into disjoint subsets.Reduce joins all pairs of subsets to compute distance. A secondMapReduce job selects the k nearest neighbors.
Scalable kNNJoins, Fang
presented byAlex Klibisz
Introduction
Trajectory JoinsIntroduction
Motivation
MapReduceIntroduction
ProblemStatement
TrajectoryOperations
Sub-optimalSolutions
Solution: kNNJoin
Pre-processingPhase
Querying Phase
Extension: kNNLoad Balancing
Extension:hkNN Join
Results
Evaluation Setup
kNN Results
hkNN ResultsSummary
Conclusion
Overview of kNN Join
Each of the steps is composed of its own MapReduce algorithm for atotal of 6 algorithms.
Scalable kNNJoins, Fang
presented byAlex Klibisz
Introduction
Trajectory JoinsIntroduction
Motivation
MapReduceIntroduction
ProblemStatement
TrajectoryOperations
Sub-optimalSolutions
Solution: kNNJoin
Pre-processingPhase
Querying Phase
Extension: kNNLoad Balancing
Extension:hkNN Join
Results
Evaluation Setup
kNN Results
hkNN ResultsSummary
Conclusion
Overview of kNN Join
Scalable kNNJoins, Fang
presented byAlex Klibisz
Introduction
Trajectory JoinsIntroduction
Motivation
MapReduceIntroduction
ProblemStatement
TrajectoryOperations
Sub-optimalSolutions
Solution: kNNJoin
Pre-processingPhase
Querying Phase
Extension: kNNLoad Balancing
Extension:hkNN Join
Results
Evaluation Setup
kNN Results
hkNN ResultsSummary
Conclusion
Pre-processing Phase
Algorithm 1
1 Input: non-partitioned trajectories.
2 Map splits trajectories in sets M and R into T temporalpartitions. O(l + T ) where l is the size of a trajectory.
3 Reduce splits each temporal partition into N spatialpartitions. O((|M|+ |R|)(l + N))
4 Output: trajectories partitioned by time and space.
Scalable kNNJoins, Fang
presented byAlex Klibisz
Introduction
Trajectory JoinsIntroduction
Motivation
MapReduceIntroduction
ProblemStatement
TrajectoryOperations
Sub-optimalSolutions
Solution: kNNJoin
Pre-processingPhase
Querying Phase
Extension: kNNLoad Balancing
Extension:hkNN Join
Results
Evaluation Setup
kNN Results
hkNN ResultsSummary
Conclusion
Sub-Trajectory Extraction
• An anchor trajectory must span an entire time partition.
• TrLi is object i in trajectory r in set L in time partiton T .
Algorithm 2
1 Input: trajectories partitioned by time and space.
2 Map retrieves all sub-trajectories in [ts , te ]3. Ot(log(l)),Os(l)
3 Reduce finds anchor trajectories that will be used in nextstep. Ot(|TrLi |2l), Os(|TrLi |l).
4 Output: anchor trajectories
3the queried time window
Scalable kNNJoins, Fang
presented byAlex Klibisz
Introduction
Trajectory JoinsIntroduction
Motivation
MapReduceIntroduction
ProblemStatement
TrajectoryOperations
Sub-optimalSolutions
Solution: kNNJoin
Pre-processingPhase
Querying Phase
Extension: kNNLoad Balancing
Extension:hkNN Join
Results
Evaluation Setup
kNN Results
hkNN ResultsSummary
Conclusion
Anchor Trajectories
• An anchor trajectory must span an entire time partition tsto te .
Scalable kNNJoins, Fang
presented byAlex Klibisz
Introduction
Trajectory JoinsIntroduction
Motivation
MapReduceIntroduction
ProblemStatement
TrajectoryOperations
Sub-optimalSolutions
Solution: kNNJoin
Pre-processingPhase
Querying Phase
Extension: kNNLoad Balancing
Extension:hkNN Join
Results
Evaluation Setup
kNN Results
hkNN ResultsSummary
Conclusion
Computing Time-dependent Bound (TDB)
• The TDB is a circle c(t) that bounds the k nearestneighbors of a set S of objects at time t.
• The TDB for a set S of objects can change over time.
Algorithm 4, containing Algorithm 3
1 Input: anchor trajectories
2 Map computes the maximum distance from each anchortrajectory to each central point pi in each temporalpartition T . Ot(N · l), Os(l)
3 Reduce computes the TDB of TrMi based on the maximumdistances. Ot(|R|log |R|), Os(|R|) for the set of objects R.
4 Output: Time-dependent Bounds
Scalable kNNJoins, Fang
presented byAlex Klibisz
Introduction
Trajectory JoinsIntroduction
Motivation
MapReduceIntroduction
ProblemStatement
TrajectoryOperations
Sub-optimalSolutions
Solution: kNNJoin
Pre-processingPhase
Querying Phase
Extension: kNNLoad Balancing
Extension:hkNN Join
Results
Evaluation Setup
kNN Results
hkNN ResultsSummary
Conclusion
Time-dependent Bounds
• The TDB is a circle c(t) that bounds the k nearestneighbors of a set S of objects at time t.
• The TDB for a set S of objects can change over time.
White dots are objects from M. Black dots are objects from R. c(t)needs a small circle to encompass k = 2 points. c(t ′) needs a biggercircle to encompass k = 2 points.
Scalable kNNJoins, Fang
presented byAlex Klibisz
Introduction
Trajectory JoinsIntroduction
Motivation
MapReduceIntroduction
ProblemStatement
TrajectoryOperations
Sub-optimalSolutions
Solution: kNNJoin
Pre-processingPhase
Querying Phase
Extension: kNNLoad Balancing
Extension:hkNN Join
Results
Evaluation Setup
kNN Results
hkNN ResultsSummary
Conclusion
Finding Candidate Trajectories
Algorithm 5
1 Input: partition of trajectories TrRj .
2 Map classifies each partition of trajectories TrRj as havingno candidates, all candidates, or some candidates.Ot(|Tr |Nl), Os(|Tr |l).
3 Reduce gathers the candidates for a join into CRi . Ot(1),
Os(|CRi |l).
4 Output: a set of candidate trajectories CRi .
Scalable kNNJoins, Fang
presented byAlex Klibisz
Introduction
Trajectory JoinsIntroduction
Motivation
MapReduceIntroduction
ProblemStatement
TrajectoryOperations
Sub-optimalSolutions
Solution: kNNJoin
Pre-processingPhase
Querying Phase
Extension: kNNLoad Balancing
Extension:hkNN Join
Results
Evaluation Setup
kNN Results
hkNN ResultsSummary
Conclusion
Candidate Trajectories
Finding candidates for TrRj (red). Case 1 have no overlap. Case 2have complete overlap. Case 3 have partial overlap.
Scalable kNNJoins, Fang
presented byAlex Klibisz
Introduction
Trajectory JoinsIntroduction
Motivation
MapReduceIntroduction
ProblemStatement
TrajectoryOperations
Sub-optimalSolutions
Solution: kNNJoin
Pre-processingPhase
Querying Phase
Extension: kNNLoad Balancing
Extension:hkNN Join
Results
Evaluation Setup
kNN Results
hkNN ResultsSummary
Conclusion
Trajectory Join
Algorithm 6
1 Input: candidate trajectories
2 Map joins each partition TrMi with correspondingcandidates CR
i using a single machine. O(|Tr ||CRi |l).
3 Reduce sorts each object’s neighbors and leaves only the knearest. O(kN).
4 Output: each queried object with its k nearest neighbors.
Scalable kNNJoins, Fang
presented byAlex Klibisz
Introduction
Trajectory JoinsIntroduction
Motivation
MapReduceIntroduction
ProblemStatement
TrajectoryOperations
Sub-optimalSolutions
Solution: kNNJoin
Pre-processingPhase
Querying Phase
Extension: kNNLoad Balancing
Extension:hkNN Join
Results
Evaluation Setup
kNN Results
hkNN ResultsSummary
Conclusion
Extension: kNN Load Balancing
1 Hash the trajectory objects by an ID to distribute themmore uniformly among compute nodes.
2 Requires modification in the sub-trajectory extraction,finding candidates, and trajectory join.
Scalable kNNJoins, Fang
presented byAlex Klibisz
Introduction
Trajectory JoinsIntroduction
Motivation
MapReduceIntroduction
ProblemStatement
TrajectoryOperations
Sub-optimalSolutions
Solution: kNNJoin
Pre-processingPhase
Querying Phase
Extension: kNNLoad Balancing
Extension:hkNN Join
Results
Evaluation Setup
kNN Results
hkNN ResultsSummary
Conclusion
Extension: hkNN Join
1 Review: finds the h objects from M that minimize somefunction f and returns each of their k nearest neighbors.
2 Forced to compute a smaller TDB.
3 Smaller query result hxk size. kNN query was |M|xk.
4 Time and space complexities remain the same.
Scalable kNNJoins, Fang
presented byAlex Klibisz
Introduction
Trajectory JoinsIntroduction
Motivation
MapReduceIntroduction
ProblemStatement
TrajectoryOperations
Sub-optimalSolutions
Solution: kNNJoin
Pre-processingPhase
Querying Phase
Extension: kNNLoad Balancing
Extension:hkNN Join
Results
Evaluation Setup
kNN Results
hkNN ResultsSummary
Conclusion
Evaluation Setup
• 2 Synthetic and 2 real datasets.
• Non-trivial size, up to 1.2B observations and 17.2GB.
• Hadoop cluster with 60 slave nodes, multi-core 3.40GHzand 16GB memory per node.
• Using Sweep Line (SL) for single-node parts.
• Measuring query execution time and MapReduce shufflingcost.4
• k = 10, N = 400 constant for all datasets. T and tqvaried.
4The amount of data sent from mappers to reducers.
Scalable kNNJoins, Fang
presented byAlex Klibisz
Introduction
Trajectory JoinsIntroduction
Motivation
MapReduceIntroduction
ProblemStatement
TrajectoryOperations
Sub-optimalSolutions
Solution: kNNJoin
Pre-processingPhase
Querying Phase
Extension: kNNLoad Balancing
Extension:hkNN Join
Results
Evaluation Setup
kNN Results
hkNN ResultsSummary
Conclusion
Effect of T (number of temporal partitions)
As T grows the time decreases until it hits an inflection point. Thishappens to be similar for both datasets. We are still spending themost time on single-node SL.
Scalable kNNJoins, Fang
presented byAlex Klibisz
Introduction
Trajectory JoinsIntroduction
Motivation
MapReduceIntroduction
ProblemStatement
TrajectoryOperations
Sub-optimalSolutions
Solution: kNNJoin
Pre-processingPhase
Querying Phase
Extension: kNNLoad Balancing
Extension:hkNN Join
Results
Evaluation Setup
kNN Results
hkNN ResultsSummary
Conclusion
kNN Results Summary
• Increasing N (number of temporal patitions) improvesperformance to a point of inflection. This point is differentfor the two datasets. Fig. 15.
• Balanced Sweep-Line (BL-SL) is the more efficientsingle-node algorithm. Fig. 16.5
• Adding slave nodes improves performance. Rate of changeis slow, likely due to I/O overhead. Fig. 17.
• As k increases the running time and shuffle cost increase.TDB makes a difference. Fig. 18.
• Increases in tq show a near-linear increase in running timeand shuffling cost. TDB and load balancing make adifference. Fig. 19.
• Time increases linearly with dataset size. Sharper increasein shuffling cost than time. Fig. 20.
5I think they mixed up the figure labels.
Scalable kNNJoins, Fang
presented byAlex Klibisz
Introduction
Trajectory JoinsIntroduction
Motivation
MapReduceIntroduction
ProblemStatement
TrajectoryOperations
Sub-optimalSolutions
Solution: kNNJoin
Pre-processingPhase
Querying Phase
Extension: kNNLoad Balancing
Extension:hkNN Join
Results
Evaluation Setup
kNN Results
hkNN ResultsSummary
Conclusion
hkNN Results Summary
• Time is constant as h grows (probably because k isconstant).
• (h,k)NN is 2x faster than kNN methods.
• Load-balanced is faster than non-load-balanced.
Scalable kNNJoins, Fang
presented byAlex Klibisz
Introduction
Trajectory JoinsIntroduction
Motivation
MapReduceIntroduction
ProblemStatement
TrajectoryOperations
Sub-optimalSolutions
Solution: kNNJoin
Pre-processingPhase
Querying Phase
Extension: kNNLoad Balancing
Extension:hkNN Join
Results
Evaluation Setup
kNN Results
hkNN ResultsSummary
Conclusion
Conclusion
Contributions
1 Leverage share-nothing MapReduce structure for kNNjoins, which typically rely on shared indices.
2 Introduce the TDB and load-balancing methods, whichyield tangible improvements.
Questions
1 Most of the time is still spent on the single-nodecomputation. What is the theoretical bound forimprovement via parallelization?
2 How much time does the partitioning step take?
3 The partitioning step probably has to be re-run when newdata arrives. Does this prevent a real-timeimplementation?
4 Any benefit to localize data instead of using HDFS?