Download - Algorithms for analyzing spatio-temporal dataabhinath/defense_slides.pdfAlgorithms for analyzing spatio-temporal data PhD defense Abhinandan Nath Department of Computer Science Duke

Algorithms for analyzing spatio-temporal data

PhD defenseAbhinandan Nath

Department of Computer ScienceDuke University

Committee :Pankaj K. Agarwal (supervisor) Kamesh MunagalaRong Ge Yusu Wang

2

Introduction

3

Introduction

4

The Data Deluge

“Mankind created 150 exabytes (billion gigabytes) of data in 2005. This year, it will create 1,200 exabytes.”

- The Economist, 2010

https://www.economist.com/node/15579717

5

Some (more) numbers ...

USGS National Elevation data (10 metre resolution)[Dewberry, 2012]

NYC taxi pickup and dropoff data, 2009-2016 : 1.3 billion points[towardsdatascience.com]

6

Geometric flavor of data

● Many data sets geometric in nature

● Problems in other domains can be mapped to geometric domain

– e.g., SELECT query in relational databases

NAME AGE SALARY

Alice 26 30,000

Bob 30 35,000

Charlie 28 25,000

... ... ….

7

Challenges

Massive data sets that are -

Noisy [towardsdatascience.com]

Have outliers

Incomplete Time-varying, e.g., trajectories

8

My Research

● Use techniques from computational geometry and topology to tackle some of these challenges in geometric data sets

● Design algorithms that are– Practical– Have provable performance guarantees

9

Broad themes

● Distributed algorithms– Inspired by frameworks like MapReduce [Dean

& Ghemawat, 2008] and Spark [Zaharia et al., 2010]

● Succinct descriptors– Concisely encode desired properties of big

data sets– Noise-robust proxies for data sets– Clustering

10

At a glance

Distributedalgorithms

Succinctdescriptors

Indices to answer range and nearest-neighbor queries [AFMN, 2016]

Triangulation & contour tree of massive terrains [AFMN, 2016]

Comparing merge trees of real-valued functions [AFNSW, 2015]

Common movement patterns from trajectory data [AFMNPT, 2018]

11

At a glance


Succinctdescriptors





12

Distributed model of computation

● Massively Parallel Communication (MPC) model [Beame et al., 2013]

● Captures salient features of modern frameworks like MapReduce [Dean & Ghemawat, 2008]

13

MPC model of computation

● : no. of machines● : input distributed

across machines● :

each machine has storage

Assume ,

for

Communication Medium

Input size n

O(s) O(s) O(s) O(s) O(s) O(s)

14

MPC model of computation

● Computation proceeds in rounds– In each round, each machine computes on

local data

● Communication between machines occurs between rounds

● No. of messages sent/received by any machine in a round bounded by

15

Performance measures

● No. of rounds of computation :

● Running time : – : running time of machine in round

–

● Total work :

16

At a glance


Succinctdescriptors





Joint work with Pankaj K. Agarwal,Kyle Fox & Kamesh Munagala

17

Indexing big data

● Query big data sets faster, but how?

– Build an index !

● Consider geometric queries– Orthogonal range queries– Nearest-neighbor queries

18

Previous work

● Work on conjunctive and join queries, graph processing in MapReduce and its variants [Lee et al., 2012; Qin et al., 2014; Malewicz et al., 2010; Beame et al., 2013; Koutris et al.,2018; ...]

● Geometric queries - MapReduce implementations for analyzing and querying spatial and geometric data [Eldawy et al., 2013, 2015; Arabi et

al.,2014; …] - no provable performance guarantees!!

19

Our work

Build and query distributed variants of the following classical data structures, with provable performance guarantees

– Orthogonal range searching● Kd-tree [Bentley, 1975]

● Range tree [Bentley, 1980]

20

Our work




21

Our work




22

Our work




– Nearest-neighbor searching● Balanced Box Decomposition

(BBD)-tree [Arya et al., 1998]

23

Our work






24

Our work






25

Our work






26

Our results

: total no. of input points in

: total no. of points reported for a range query

: max no. of points reported by a machine for a range query

27

Our results

● Kd-tree :– Construction : rounds, time,

work

– Query : rounds, time, work – optimal if each point can be stored exactly once

Also extends to partition trees [Chan 2012] for simplex range searching

28

Our results

● Range tree :– Construction : rounds, time,

work

– Query : rounds, time, and work

● BBD-tree :– Construction : rounds, time,

work

– Query : rounds, time and work

29

Key idea : random sampling

● Data structures based on balanced hierarchical partitioning of input points represented as a tree

30

Key idea : random sampling

● Data structures based on balanced hierarchical partitioning of input points represented as a tree

● Approximate this partitioning using a small random sample of input!

34

Balanced partitioning on random sample leads to balanced partitioning on entire set!!

35

At a glance


Succinctdescriptors





36

At a glance


Succinctdescriptors





Joint work with Pankaj K. Agarwal,Kyle Fox & Kamesh Munagala

37

Terrain modeling

Airborne LiDAR scanning[http://www.lgs.ie/airborne-lidar.shtml]

Raw elevation data (3D point cloud)

[kellylab.berkeley.edu]

Digital Elevation Model (DEM)[gisgeography.com/free-global-dem-data-sources/]

38

From 3D point cloud to DEM

● Terrain – xy-monotone surface in

● Graph of a height function

● Often stored as a triangulated irregular network (TIN)

● How to build TINs and perform terrain analysis in the MPC model ?

39

Our Work

● Build TIN model, using Delaunay triangulation

● Compute the contour tree to succinctly encode all contours of terrain

Input points in

Build terrain model

Build contour tree

Use contour tree Many applications, e.g., waterflow

prediction, climate model viz.

40

Prior Work

● Delaunay triangulation– RAM and I/O model [Crauser et al., 2001]

– PRAM algorithms [Blelloch et al., 1999]

– Goodrich's algorithm [Goodrich, 1997] can be adapted to MPC model – too complicated

– SpatialHadoop [Eldawy et al., 2015] – no theoretical bounds

● Contour tree– RAM and I/O model [Carr et al., 2003; Pascucci and Cole-McLaughlin, 2002; Agarwal

et al., 2010; …]

– Distributed and parallel algorithms [Morozov and Weber, 2013, 2014;

Pascucci and Cole-McLaughlin, 2003; Acharya and Natarajan, 2015; ...]

41

Our results

● Given points, compute its Delaunay triangulation in rounds, time, and work, with high probability

● Given a terrain of size , compute its contour tree in rounds, time, and work

42

Build terrain model

Input points in

Build terrain model

Build contour tree

Use contour tree

43

Delaunay Triangulation

● Given points in , a triangulation of is Delaunay if– No triangle contains

any point of in interior of its circumcircle

● Many useful properties, e.g., avoids skinny triangles

[gamedev.stackexchange.com]

44

Basic idea

● Randomly sample small set of points and compute triangulation of

● Use triangulation of to split input into smaller chunks

● Recurse on each chunk in parallel

45

Algorithm

1. Given points stored across many machines, randomly sample of size and send to one machine

46

Algorithm

1. Given points stored across many machines, randomly sample of size and send to one machine

47

Algorithm

2. Compute , and use it to distribute to disjoint machines

48

Algorithm


49

Algorithm


50

Algorithm


With slight changes, it can be shown that each chunk has size with high probability

51

Algorithm

3. Recursively compute for each chunk in parallel. Can filter unnecessary triangles by simple geometric tests to get

52

Analysis

● No. of levels of recursion is

● Each level takes rounds, time, and work

53

Build contour tree

Input points in

Build TIN DEM

Build contour tree

Use contour tree

54

Level sets and contours

● : triangulation of

● Height function – Defined on each vertex

– Linearly interpolated within each face(triangle)

● Level set

● Contour : connected component of a level set

55

Topology changes at saddle points

Image from [Agarwal et al., 2015]

56

Contour tree

● Obtained by contracting each contour of to a point

Agarwal et al., 2015

57

Our contribution

A simple and efficient divide-and-conquer algorithm to build and store the contour tree of a massive triangulated terrain in MPC model

58

Storage

● Contour tree stored in a distributed fashion

59

Storage


– Top subtree : a sized subtree stored on one machine

α2

y2

α3

60

Storage


– Top subtree : a sized subtree stored on one machine

– Remaining subtrees stored on other machines, pointers to which stored with

α4

α5

y1y2

x4α3

α2

y2

α3

α2

α1

x1 x2

x3

61

Algorithm (divide step)

1. Split into smaller chunks● Each chunk has same no. of points, goes to

disjoint set of machines

62




63




64

Algorithm (conquer step)

2. Compute distributed contour trees of each chunk recursively in parallel

65

Algorithm (conquer step)

2. Compute distributed contour trees of each chunk recursively in parallel

66

Algorithm (merge step)

3. Combine contour trees to get

67

Algorithm (merge step)

3. Combine contour trees to get – Minimize interaction b/w neighboring chunks– Take advantage of data distribution and

triangulation

68

Our main result

Given a terrain of size , designed algorithm to compute its contour tree in rounds, time, and work

● These bounds are worst-case optimal !

69

At a glance


Succinctdescriptors





70

At a glance


Succinctdescriptors





Joint work with Pankaj K. Agarwal,Kyle Fox, Tasos Sidiropoulos &

Yusu Wang

Gonna skip!!

71

At a glance


Succinctdescriptors





Joint work with Pankaj K. Agarwal,Kyle Fox, Kamesh Munagala,

Jiangwei Pan & Erin Taylor

72

Trajectory data

● Huge data available

– Improve decision making

– Gain insights

● Noisy and incomplete

● Several computational challenges

[https://www.sundried.com]

[developer.huawei.com]

73

Motivation

74

Motivation

75

Motivation

● Subtrajectory clusters capture common portions● Different from clustering trajectories as a whole

76

Motivation

● Extract high-level shared structure from large trajectory data sets

77

Motivation

● Extract high-level shared structure from large trajectory data sets

78

Pathlet

Representative pathlet for each cluster– Cluster “center”– Pathlet is a curve, not necessarily part of the

input

79

Application of pathlets

● Compression of large trajectory data [Chen et al. 2013]

– Hope that each trajectory can be reconstructed with small no. of pathlets

– Small pathlet dictionary - non-linear dimension reduction

● Reconstructing road network from trajectory data [Li et al. 2013; Buchin et al. 2017]

80

Our contribution

● Model for subtrajectory clustering– Robust to noise and missing data

– Data-driven clusters and pathlets

● NP-hardness of subtrajectory clustering problem

● Provably-efficient approximation algorithms– Faster algorithms for realistic inputs

● Experimental results

81

Previous work

● Graph setting – no noise or gaps [Chen et al. 2013]

● Based only on point density [Panagiotakis et al. 2012]

● Restricted to line segments [Lee et al. 2007]

● Search for pre-defined patterns [Fan et al. 2016; Tang et al. 2013; Wang et al. 2015; Zheng et al. 2013]

None of these have provable performance guarantees!!

82

Model and problem formulation

Model inputs :– Trajectories :

– Each trajectory is sequence of points in

● Subtrajectory is subsequence of traj.

– Let be all trajectory points

83

Objective function

84

Objective function

85

Objective function

Need small# pathlets Measure of cluster quality

86

Objective function


87

Objective function


Fraction of pointsunassigned for

each trajectory : “gaps”

88

Objective function

89

A note on the distance

We use discrete Fréchet distance

Given and

● Correspondence s.t. every pt. in at least one pair

● is monotone if for all ,

90

Discrete Fréchet distance

: Set of all monotonone correspondencess b/w ,

91

Choosing pathlets

Given , goal is to choose from set of candidate pathlets to minimize objective function

● If is given as input : pathlet-cover problem

● If not given but assumed to be (uncountably) infinite set of all trajectories in plane : subtrajectory-clustering problem

92

Basic idea

● Reduce to set-cover

● Solve using greedy algorithm : gives approximation

● Challenge : implementing greedy step efficiently

93

Set-cover

Input :● Set system● Cost

Goal is to find of minimum total cost such that

94

From pathlet-cover to set-cover

●

● has two kinds of sets :– For all , with

where

95


●

● has two kinds of sets :– For all , with

where

Corresponds to treating as a gap in pathlet cover

96


●

● has two kinds of sets :– For all and for any set of subtraj. ,

with

97


●


with

Corresponds to assigningsubtraj. in to

98


●


with

Exponential # sets : cannot construct explicitly!!

99


Theorem : There exists bijection between feasible solutions of and with same cost across bijection

100

Greedy algorithm for set-cover

Initialize

● At each step add to the set in that maximizes the coverage-to-cost ratio

● Stop when all points are covered

101

Coverage-to-cost ratio

● For let denote coverage-to-cost ratio

102


● For let denote coverage-to-cost ratio

where is set of uncovered pts. of

103


● For let denote coverage-cost ratio

, if is not yet covered

, otherwise

104

Implementing greedy step

For each need to compute that maximizes – Tricky, since we do not construct these sets at all !

105

Implementing greedy step

For each need to compute that maximizes – Tricky, since we do not construct these sets at all !

● Best set for can be found in poly-time without explicitly constructing all the sets !!

– Can decompose into contribution corresponding to each traj.

– Independently chose “best” subtraj. from each traj.

106

Our result

Let ,

● Theorem : The greedy algorithm computes a -approximate solution to the pathlet-cover problem in time

107

Subtrajectory clustering

Set of candidate pathlets not given, assumed to be all possible trajectories

108

Reducing # candidate pathlets

● satisfies triangle inequality :– Let candidate pathlets be subtraj. of input traj.– # candidate pathlets is – Optimal solution cost increases by factor of 2

109

Reducing # candidate pathlets

● satisfies triangle inequality :– Let candidate pathlets be subtraj. of input traj.– # candidate pathlets is – Optimal solution cost increases by factor of 2

● :– Can reduce # candidate pathlets to – Cost increases by factor of

110

Improved running time

● For realistic inputs can achieve more speed-up– For each pathlet only subtraj. assigned from

each traj.

● Theorem : For realistic curves using Fréchet distance, can compute -approximate solution to the subtrajectory clustering problem in time

111

Experiments : data sets

Real data sets :● Beijing taxi data [Tsinghua University]

– 28,000 cabs over 4 days

– 9 mil. points

– Incomplete and sparse

112


Real data sets :● GeoLife [Microsoft Research Asia]

– Pedestrian data of 182 users over 4 years

– ~2,600 trajs.

– ~1.5 mill. pts.

● Cycling– 37 traj.

– 106,000 pts.

– Has self-intersections and loops

113


Synthetic data sets :● RTP

– Traffic data generated by web-based tool [http://mntg.cs.umn.edu/tg/index.php]

– Research Triangle in NC

– ~20,000 traj.

– ~1 mill. pts.

114

Dense & popular regions

115

Common trajectory portions

116

Handling noise

117

Gaps

118

Data-driven pathlets

119

Summary

● Indexing big data

● Massive terrain analysis

● Comparing merge trees - briefly

● Extracting common movement patterns from trajectories

120

Future directions

● MPC model– Point location queries, multiway separators for

planar graphs ...

– Big open problem – general graph connectivity in rounds

– Other open problems in parallel query processing in databases [Koutris et al. 2018]

● Gromov-Hausdorff distance– Big gap b/w upper and lower bound :(

– More research into additive distortion of metric embeddings

121

Future directions

● Trajectory clustering– Efficient -approx. to k-center, k-median, k-

means for say Frechet distance

– Stumbling block – infinite doubling dimension

– Work by [Driemel et al. 2016] on clustering time-series data● Running time is exponential in complexity of cluster

centers – assumed to be constant● Is it a good assumption??

– What are good assumptions? Perturbation resilience? Stability?

● Can anything interesting be proved ?

122

Acknowledgements

123

Committee

Pankaj

Kamesh Rong Yusu

124

Collaborators

Pankaj Kamesh YusuKyle Tasos

Jiangwei Erin

125

Theory group

126

CS@Duke

● Ergys and Cassie; other students ...

● Marilyn, Pam, Celeste, Alison, Kathleen …

● CS Lab staff

127

Outside Duke