Algorithms for analyzing spatio-temporal data
PhD defenseAbhinandan Nath
Department of Computer ScienceDuke University
Committee :Pankaj K. Agarwal (supervisor) Kamesh MunagalaRong Ge Yusu Wang
2
Introduction
3
Introduction
4
The Data Deluge
“Mankind created 150 exabytes (billion gigabytes) of data in 2005. This year, it will create 1,200 exabytes.”
- The Economist, 2010
https://www.economist.com/node/15579717
5
Some (more) numbers ...
USGS National Elevation data (10 metre resolution)[Dewberry, 2012]
NYC taxi pickup and dropoff data, 2009-2016 : 1.3 billion points[towardsdatascience.com]
6
Geometric flavor of data
● Many data sets geometric in nature
● Problems in other domains can be mapped to geometric domain
– e.g., SELECT query in relational databases
NAME AGE SALARY
Alice 26 30,000
Bob 30 35,000
Charlie 28 25,000
... ... ….
7
Challenges
Massive data sets that are -
Noisy [towardsdatascience.com]
Have outliers
Incomplete Time-varying, e.g., trajectories
8
My Research
● Use techniques from computational geometry and topology to tackle some of these challenges in geometric data sets
● Design algorithms that are– Practical– Have provable performance guarantees
9
Broad themes
● Distributed algorithms– Inspired by frameworks like MapReduce [Dean
& Ghemawat, 2008] and Spark [Zaharia et al., 2010]
● Succinct descriptors– Concisely encode desired properties of big
data sets– Noise-robust proxies for data sets– Clustering
10
At a glance
Distributedalgorithms
Succinctdescriptors
Indices to answer range and nearest-neighbor queries [AFMN, 2016]
Triangulation & contour tree of massive terrains [AFMN, 2016]
Comparing merge trees of real-valued functions [AFNSW, 2015]
Common movement patterns from trajectory data [AFMNPT, 2018]
11
At a glance
Distributedalgorithms
Succinctdescriptors
Indices to answer range and nearest-neighbor queries [AFMN, 2016]
Triangulation & contour tree of massive terrains [AFMN, 2016]
Comparing merge trees of real-valued functions [AFNSW, 2015]
Common movement patterns from trajectory data [AFMNPT, 2018]
12
Distributed model of computation
● Massively Parallel Communication (MPC) model [Beame et al., 2013]
● Captures salient features of modern frameworks like MapReduce [Dean & Ghemawat, 2008]
13
MPC model of computation
● : no. of machines● : input distributed
across machines● :
each machine has storage
Assume ,
for
Communication Medium
Input size n
O(s) O(s) O(s) O(s) O(s) O(s)
14
MPC model of computation
● Computation proceeds in rounds– In each round, each machine computes on
local data
● Communication between machines occurs between rounds
● No. of messages sent/received by any machine in a round bounded by
15
Performance measures
● No. of rounds of computation :
● Running time : – : running time of machine in round
–
● Total work :
16
At a glance
Distributedalgorithms
Succinctdescriptors
Indices to answer range and nearest-neighbor queries [AFMN, 2016]
Triangulation & contour tree of massive terrains [AFMN, 2016]
Comparing merge trees of real-valued functions [AFNSW, 2015]
Common movement patterns from trajectory data [AFMNPT, 2018]
Joint work with Pankaj K. Agarwal,Kyle Fox & Kamesh Munagala
17
Indexing big data
● Query big data sets faster, but how?
– Build an index !
● Consider geometric queries– Orthogonal range queries– Nearest-neighbor queries
18
Previous work
● Work on conjunctive and join queries, graph processing in MapReduce and its variants [Lee et al., 2012; Qin et al., 2014; Malewicz et al., 2010; Beame et al., 2013; Koutris et al.,2018; ...]
● Geometric queries - MapReduce implementations for analyzing and querying spatial and geometric data [Eldawy et al., 2013, 2015; Arabi et
al.,2014; …] - no provable performance guarantees!!
19
Our work
Build and query distributed variants of the following classical data structures, with provable performance guarantees
– Orthogonal range searching● Kd-tree [Bentley, 1975]
● Range tree [Bentley, 1980]
20
Our work
Build and query distributed variants of the following classical data structures, with provable performance guarantees
– Orthogonal range searching● Kd-tree [Bentley, 1975]
● Range tree [Bentley, 1980]
21
Our work
Build and query distributed variants of the following classical data structures, with provable performance guarantees
– Orthogonal range searching● Kd-tree [Bentley, 1975]
● Range tree [Bentley, 1980]
22
Our work
Build and query distributed variants of the following classical data structures, with provable performance guarantees
– Orthogonal range searching● Kd-tree [Bentley, 1975]
● Range tree [Bentley, 1980]
– Nearest-neighbor searching● Balanced Box Decomposition
(BBD)-tree [Arya et al., 1998]
23
Our work
Build and query distributed variants of the following classical data structures, with provable performance guarantees
– Orthogonal range searching● Kd-tree [Bentley, 1975]
● Range tree [Bentley, 1980]
– Nearest-neighbor searching● Balanced Box Decomposition
(BBD)-tree [Arya et al., 1998]
24
Our work
Build and query distributed variants of the following classical data structures, with provable performance guarantees
– Orthogonal range searching● Kd-tree [Bentley, 1975]
● Range tree [Bentley, 1980]
– Nearest-neighbor searching● Balanced Box Decomposition
(BBD)-tree [Arya et al., 1998]
25
Our work
Build and query distributed variants of the following classical data structures, with provable performance guarantees
– Orthogonal range searching● Kd-tree [Bentley, 1975]
● Range tree [Bentley, 1980]
– Nearest-neighbor searching● Balanced Box Decomposition
(BBD)-tree [Arya et al., 1998]
26
Our results
: total no. of input points in
: total no. of points reported for a range query
: max no. of points reported by a machine for a range query
27
Our results
● Kd-tree :– Construction : rounds, time,
work
– Query : rounds, time, work – optimal if each point can be stored exactly once
Also extends to partition trees [Chan 2012] for simplex range searching
28
Our results
● Range tree :– Construction : rounds, time,
work
– Query : rounds, time, and work
● BBD-tree :– Construction : rounds, time,
work
– Query : rounds, time and work
29
Key idea : random sampling
● Data structures based on balanced hierarchical partitioning of input points represented as a tree
30
Key idea : random sampling
● Data structures based on balanced hierarchical partitioning of input points represented as a tree
● Approximate this partitioning using a small random sample of input!
31
32
33
34
Balanced partitioning on random sample leads to balanced partitioning on entire set!!
35
At a glance
Distributedalgorithms
Succinctdescriptors
Indices to answer range and nearest-neighbor queries [AFMN, 2016]
Triangulation & contour tree of massive terrains [AFMN, 2016]
Comparing merge trees of real-valued functions [AFNSW, 2015]
Common movement patterns from trajectory data [AFMNPT, 2018]
36
At a glance
Distributedalgorithms
Succinctdescriptors
Indices to answer range and nearest-neighbor queries [AFMN, 2016]
Triangulation & contour tree of massive terrains [AFMN, 2016]
Comparing merge trees of real-valued functions [AFNSW, 2015]
Common movement patterns from trajectory data [AFMNPT, 2018]
Joint work with Pankaj K. Agarwal,Kyle Fox & Kamesh Munagala
37
Terrain modeling
Airborne LiDAR scanning[http://www.lgs.ie/airborne-lidar.shtml]
Raw elevation data (3D point cloud)
[kellylab.berkeley.edu]
Digital Elevation Model (DEM)[gisgeography.com/free-global-dem-data-sources/]
38
From 3D point cloud to DEM
● Terrain – xy-monotone surface in
● Graph of a height function
● Often stored as a triangulated irregular network (TIN)
● How to build TINs and perform terrain analysis in the MPC model ?
39
Our Work
● Build TIN model, using Delaunay triangulation
● Compute the contour tree to succinctly encode all contours of terrain
Input points in
Build terrain model
Build contour tree
Use contour tree Many applications, e.g., waterflow
prediction, climate model viz.
40
Prior Work
● Delaunay triangulation– RAM and I/O model [Crauser et al., 2001]
– PRAM algorithms [Blelloch et al., 1999]
– Goodrich's algorithm [Goodrich, 1997] can be adapted to MPC model – too complicated
– SpatialHadoop [Eldawy et al., 2015] – no theoretical bounds
● Contour tree– RAM and I/O model [Carr et al., 2003; Pascucci and Cole-McLaughlin, 2002; Agarwal
et al., 2010; …]
– Distributed and parallel algorithms [Morozov and Weber, 2013, 2014;
Pascucci and Cole-McLaughlin, 2003; Acharya and Natarajan, 2015; ...]
41
Our results
● Given points, compute its Delaunay triangulation in rounds, time, and work, with high probability
● Given a terrain of size , compute its contour tree in rounds, time, and work
42
Build terrain model
Input points in
Build terrain model
Build contour tree
Use contour tree
43
Delaunay Triangulation
● Given points in , a triangulation of is Delaunay if– No triangle contains
any point of in interior of its circumcircle
● Many useful properties, e.g., avoids skinny triangles
[gamedev.stackexchange.com]
44
Basic idea
● Randomly sample small set of points and compute triangulation of
● Use triangulation of to split input into smaller chunks
● Recurse on each chunk in parallel
45
Algorithm
1. Given points stored across many machines, randomly sample of size and send to one machine
46
Algorithm
1. Given points stored across many machines, randomly sample of size and send to one machine
47
Algorithm
2. Compute , and use it to distribute to disjoint machines
48
Algorithm
2. Compute , and use it to distribute to disjoint machines
49
Algorithm
2. Compute , and use it to distribute to disjoint machines
50
Algorithm
2. Compute , and use it to distribute to disjoint machines
With slight changes, it can be shown that each chunk has size with high probability
51
Algorithm
3. Recursively compute for each chunk in parallel. Can filter unnecessary triangles by simple geometric tests to get
52
Analysis
● No. of levels of recursion is
● Each level takes rounds, time, and work
53
Build contour tree
Input points in
Build TIN DEM
Build contour tree
Use contour tree
54
Level sets and contours
● : triangulation of
● Height function – Defined on each vertex
– Linearly interpolated within each face(triangle)
● Level set
● Contour : connected component of a level set
55
Topology changes at saddle points
Image from [Agarwal et al., 2015]
56
Contour tree
● Obtained by contracting each contour of to a point
Agarwal et al., 2015
57
Our contribution
A simple and efficient divide-and-conquer algorithm to build and store the contour tree of a massive triangulated terrain in MPC model
58
Storage
● Contour tree stored in a distributed fashion
59
Storage
● Contour tree stored in a distributed fashion
– Top subtree : a sized subtree stored on one machine
α2
y2
α3
60
Storage
● Contour tree stored in a distributed fashion
– Top subtree : a sized subtree stored on one machine
– Remaining subtrees stored on other machines, pointers to which stored with
α4
α5
y1y2
x4α3
α2
y2
α3
α2
α1
x1 x2
x3
61
Algorithm (divide step)
1. Split into smaller chunks● Each chunk has same no. of points, goes to
disjoint set of machines
62
Algorithm (divide step)
1. Split into smaller chunks● Each chunk has same no. of points, goes to
disjoint set of machines
63
Algorithm (divide step)
1. Split into smaller chunks● Each chunk has same no. of points, goes to
disjoint set of machines
64
Algorithm (conquer step)
2. Compute distributed contour trees of each chunk recursively in parallel
65
Algorithm (conquer step)
2. Compute distributed contour trees of each chunk recursively in parallel
66
Algorithm (merge step)
3. Combine contour trees to get
67
Algorithm (merge step)
3. Combine contour trees to get – Minimize interaction b/w neighboring chunks– Take advantage of data distribution and
triangulation
68
Our main result
Given a terrain of size , designed algorithm to compute its contour tree in rounds, time, and work
● These bounds are worst-case optimal !
69
At a glance
Distributedalgorithms
Succinctdescriptors
Indices to answer range and nearest-neighbor queries [AFMN, 2016]
Triangulation & contour tree of massive terrains [AFMN, 2016]
Comparing merge trees of real-valued functions [AFNSW, 2015]
Common movement patterns from trajectory data [AFMNPT, 2018]
70
At a glance
Distributedalgorithms
Succinctdescriptors
Indices to answer range and nearest-neighbor queries [AFMN, 2016]
Triangulation & contour tree of massive terrains [AFMN, 2016]
Comparing merge trees of real-valued functions [AFNSW, 2015]
Common movement patterns from trajectory data [AFMNPT, 2018]
Joint work with Pankaj K. Agarwal,Kyle Fox, Tasos Sidiropoulos &
Yusu Wang
Gonna skip!!
71
At a glance
Distributedalgorithms
Succinctdescriptors
Indices to answer range and nearest-neighbor queries [AFMN, 2016]
Triangulation & contour tree of massive terrains [AFMN, 2016]
Comparing merge trees of real-valued functions [AFNSW, 2015]
Common movement patterns from trajectory data [AFMNPT, 2018]
Joint work with Pankaj K. Agarwal,Kyle Fox, Kamesh Munagala,
Jiangwei Pan & Erin Taylor
72
Trajectory data
● Huge data available
– Improve decision making
– Gain insights
● Noisy and incomplete
● Several computational challenges
[https://www.sundried.com]
[developer.huawei.com]
73
Motivation
74
Motivation
75
Motivation
● Subtrajectory clusters capture common portions● Different from clustering trajectories as a whole
76
Motivation
● Extract high-level shared structure from large trajectory data sets
77
Motivation
● Extract high-level shared structure from large trajectory data sets
78
Pathlet
Representative pathlet for each cluster– Cluster “center”– Pathlet is a curve, not necessarily part of the
input
79
Application of pathlets
● Compression of large trajectory data [Chen et al. 2013]
– Hope that each trajectory can be reconstructed with small no. of pathlets
– Small pathlet dictionary - non-linear dimension reduction
● Reconstructing road network from trajectory data [Li et al. 2013; Buchin et al. 2017]
80
Our contribution
● Model for subtrajectory clustering– Robust to noise and missing data
– Data-driven clusters and pathlets
● NP-hardness of subtrajectory clustering problem
● Provably-efficient approximation algorithms– Faster algorithms for realistic inputs
● Experimental results
81
Previous work
● Graph setting – no noise or gaps [Chen et al. 2013]
● Based only on point density [Panagiotakis et al. 2012]
● Restricted to line segments [Lee et al. 2007]
● Search for pre-defined patterns [Fan et al. 2016; Tang et al. 2013; Wang et al. 2015; Zheng et al. 2013]
None of these have provable performance guarantees!!
82
Model and problem formulation
Model inputs :– Trajectories :
– Each trajectory is sequence of points in
● Subtrajectory is subsequence of traj.
– Let be all trajectory points
83
Objective function
84
Objective function
85
Objective function
Need small# pathlets Measure of cluster quality
86
Objective function
Need small# pathlets Measure of cluster quality
87
Objective function
Need small# pathlets Measure of cluster quality
Fraction of pointsunassigned for
each trajectory : “gaps”
88
Objective function
89
A note on the distance
We use discrete Fréchet distance
Given and
● Correspondence s.t. every pt. in at least one pair
● is monotone if for all ,
90
Discrete Fréchet distance
: Set of all monotonone correspondencess b/w ,
91
Choosing pathlets
Given , goal is to choose from set of candidate pathlets to minimize objective function
● If is given as input : pathlet-cover problem
● If not given but assumed to be (uncountably) infinite set of all trajectories in plane : subtrajectory-clustering problem
92
Basic idea
● Reduce to set-cover
● Solve using greedy algorithm : gives approximation
● Challenge : implementing greedy step efficiently
93
Set-cover
Input :● Set system● Cost
Goal is to find of minimum total cost such that
94
From pathlet-cover to set-cover
●
● has two kinds of sets :– For all , with
where
95
From pathlet-cover to set-cover
●
● has two kinds of sets :– For all , with
where
Corresponds to treating as a gap in pathlet cover
96
From pathlet-cover to set-cover
●
● has two kinds of sets :– For all and for any set of subtraj. ,
with
97
From pathlet-cover to set-cover
●
● has two kinds of sets :– For all and for any set of subtraj. ,
with
Corresponds to assigningsubtraj. in to
98
From pathlet-cover to set-cover
●
● has two kinds of sets :– For all and for any set of subtraj. ,
with
Exponential # sets : cannot construct explicitly!!
99
From pathlet-cover to set-cover
Theorem : There exists bijection between feasible solutions of and with same cost across bijection
100
Greedy algorithm for set-cover
Initialize
● At each step add to the set in that maximizes the coverage-to-cost ratio
● Stop when all points are covered
101
Coverage-to-cost ratio
● For let denote coverage-to-cost ratio
102
Coverage-to-cost ratio
● For let denote coverage-to-cost ratio
where is set of uncovered pts. of
103
Coverage-to-cost ratio
● For let denote coverage-cost ratio
, if is not yet covered
, otherwise
104
Implementing greedy step
For each need to compute that maximizes – Tricky, since we do not construct these sets at all !
105
Implementing greedy step
For each need to compute that maximizes – Tricky, since we do not construct these sets at all !
● Best set for can be found in poly-time without explicitly constructing all the sets !!
– Can decompose into contribution corresponding to each traj.
– Independently chose “best” subtraj. from each traj.
106
Our result
Let ,
● Theorem : The greedy algorithm computes a -approximate solution to the pathlet-cover problem in time
107
Subtrajectory clustering
Set of candidate pathlets not given, assumed to be all possible trajectories
108
Reducing # candidate pathlets
● satisfies triangle inequality :– Let candidate pathlets be subtraj. of input traj.– # candidate pathlets is – Optimal solution cost increases by factor of 2
109
Reducing # candidate pathlets
● satisfies triangle inequality :– Let candidate pathlets be subtraj. of input traj.– # candidate pathlets is – Optimal solution cost increases by factor of 2
● :– Can reduce # candidate pathlets to – Cost increases by factor of
110
Improved running time
● For realistic inputs can achieve more speed-up– For each pathlet only subtraj. assigned from
each traj.
● Theorem : For realistic curves using Fréchet distance, can compute -approximate solution to the subtrajectory clustering problem in time
111
Experiments : data sets
Real data sets :● Beijing taxi data [Tsinghua University]
– 28,000 cabs over 4 days
– 9 mil. points
– Incomplete and sparse
112
Experiments : data sets
Real data sets :● GeoLife [Microsoft Research Asia]
– Pedestrian data of 182 users over 4 years
– ~2,600 trajs.
– ~1.5 mill. pts.
● Cycling– 37 traj.
– 106,000 pts.
– Has self-intersections and loops
113
Experiments : data sets
Synthetic data sets :● RTP
– Traffic data generated by web-based tool [http://mntg.cs.umn.edu/tg/index.php]
– Research Triangle in NC
– ~20,000 traj.
– ~1 mill. pts.
114
Dense & popular regions
115
Common trajectory portions
116
Handling noise
117
Gaps
118
Data-driven pathlets
119
Summary
● Indexing big data
● Massive terrain analysis
● Comparing merge trees - briefly
● Extracting common movement patterns from trajectories
120
Future directions
● MPC model– Point location queries, multiway separators for
planar graphs ...
– Big open problem – general graph connectivity in rounds
– Other open problems in parallel query processing in databases [Koutris et al. 2018]
● Gromov-Hausdorff distance– Big gap b/w upper and lower bound :(
– More research into additive distortion of metric embeddings
121
Future directions
● Trajectory clustering– Efficient -approx. to k-center, k-median, k-
means for say Frechet distance
– Stumbling block – infinite doubling dimension
– Work by [Driemel et al. 2016] on clustering time-series data● Running time is exponential in complexity of cluster
centers – assumed to be constant● Is it a good assumption??
– What are good assumptions? Perturbation resilience? Stability?
● Can anything interesting be proved ?
122
Acknowledgements
123
Committee
Pankaj
Kamesh Rong Yusu
124
Collaborators
Pankaj Kamesh YusuKyle Tasos
Jiangwei Erin
125
Theory group
126
CS@Duke
● Ergys and Cassie; other students ...
● Marilyn, Pam, Celeste, Alison, Kathleen …
● CS Lab staff
127
Outside Duke
128
129
Top Related