Christian jensen advanced routing in spatial networks using big data

Advanced Routing in Spatial

Networks Using Big Data

Christian S. Jensen

www.cs.aau.dk/~csj

Roadmap

• Big data, motivation, setting, overview

• Modeling and computation of weights

Spatial network lifting

EcoMark

Time-varying, uncertain weights

Complete weight annotation using incomplete information

Near-future travel cost inference

• Routing

Stochastic skyline routing

Routing based on local-driver behavior

• Closing

Demos, the future, challenges, acknowledgments, readings

Big Data

Roadmap

• Big data

Hype or substance?

Instrumentation of reality and digitization

The digital universe

Moore’s Law generalized

Big data challenges

Hype or Substance?

• We have been pushing the boundaries for decades

How much data we can handle

How fast

Data integration

• Examples

VLDB: International Conference on Very Large Database

TODS: ACM Transactions on Database Systems

• So is it all hype?

No

Instrumentation and Digitization

• Instrumentation of reality

Notably, smartphones

• Digitization of processes

E.g., e-commerce, public services, communications, social

interactions

The Vatican, 2005

The Vatican, 2013

2005 vs. 2013

Big Data

Every day, we create 2.5 quintillion bytes of data — so much

that 90% of the data in the world today has been created in

the last two years alone. This data comes from everywhere:

sensors used to gather climate information, posts to social

media sites, digital pictures and videos, purchase transaction

records, and cell phone GPS signals to name a few. This

data is big data. http://www-01.ibm.com/software/data/bigdata/

The Digital Universe

• The digital universe

Doubling every ~18-24 months

Grew 60% in 2009, 50% in 2010

2009: 0.8 zettabyte, 2010: 1.2 zettabyte, 2020: 35 zettabytes

2009-20: growth by a factor of 44

http://www.emc.com/collateral/demos/microsites/idc-digital-universe/iview.htm

1 zettabyte = 1024 exabytes

1 exabyte = 1024 petabytes =

260 = 1,152,921,504,606,846,976 ≈ 1018 bytes

Moore’s Law – The Bicycle Analogy

• Moore’s Law: computers double in speed every 24

months.

Applies also to quality-adjusted microprocessor prices, memory

capacity, disks, networks, sensors, and the number and size of

pixels in cameras.

• How fast would a bicyclist be if Moore’s Law applied?

50 years of doubling every 24 months

30 km/h originally

~1 billion km/h now

• Three lessons

Growth rates in computing are dramatic and difficult to imagine.

Hardware advances are important information technology drivers.

Humans don’t really improve – they are the constants.

Big Data – Synthesis

• The result is new opportunity.

• Lots of data and unprecedented computing infrastructure

combine to offer potentials for value creation from data.

• To be competitive, society and businesses must be able to

create value from data.

• Data-based decisions and data-driven processes

Decisions based on good data beat decisions based on feelings or

opinions.

• A finer granularity of services

• Entirely new services

Big Data – Data-Driven Society, Business

Christian S. Jensen

[email protected]

Big Data in Transportation:

Overview

Motivation – ITS

• A safer, greener, and more efficient and cost-effective

transportation infrastructure

• Greenhouse gas emissions reductions via eco-routing

• Congestion, greater Copenhagen region

~10 billion DKK/year (2004)

• Bad setting of signalized intersections in Denmark

~9,3 billion DKK/year (2012)

• The transportation sector is the second largest

greenhouse gas (GHG) emitting sector and also causes

substantial pollution.

By 2025, it’s predicted that 6.2 billion commutes will be made

every day, worldwide.

• The reduction of greenhouse gas (GHG) emissions from

transportation is essential to combat global climate

change.

EU: reduce GHG emissions by 30% by 2020.

G8: a 50% GHG reduction by 2050.

China: a 17% GHG reduction by 2015.

• Eco-routing can reduce vehicular impact by up to 20%.

• General context: Smart City

Motivation – Eco-Routing

System of Sensors Model

• The setting may be modeled as a system of (logical)

streams, one per edge.

Data is emitted from the stream of an edge when a vehicle

traverses the edge

Spatial

Spatio-temporally correlated

Sparse

• Real, unlike early, envisioned smart dust applications!

Methodology

• The same methodology underlies the studies covered.

• Define precisely a problem of (perceived) real-world

interest.

• Develop solutions

Concepts, data structures, algorithms

• Carry out mathematical analyses

Correctness, complexity, storage size

• Prototype the solutions and perform empirical studies

Often, real data is needed

Offers detailed insight in the design properties of the solutions

If you cannot measure it, you cannot improve it

• Iterate!

Data Foundation

• A road network model

E.g., OpenStreetMap

Germany: 10 million edges (rough estimate)

• Aerial laser scan data

For lifting, optional

• Tracking data from vehicles

E.g., GPS data

• Fuel consumption data from vehicles

Often called CAN bus data

GPS Data – DW Statistics

• Number of data sources (and NDAs): ca. 20

Daily data from 4 sources; data in batches from 4 sources (2 small,

2 big); data from ca. 10 finished projects

• Daily data

4 sources

~3 million rows

3500 vehicles

• Number of fact table rows: More than 4 billion

• Number of vehicles: ca. 25,000

• Rows with fuel data: 500 million (estimate; ~140,000 per

day)

• Rows from EVs: 100-200 million

Software and Hardware

• Experimental infrastructure

• Software

Have complete software stack for handling traffic data

Map-matching, data cleansing, multiple map support

• Hardware

Very modern server farm

Newest machine has 2TB main memory

Basic Eco Routing – CPH to the Train

Good 17:08 8.06 km 0.80 l

Bad 18:06 13.64 km 1.23 l

Roadmap to Achieving Eco-Routing

EcoMark 3D Road Networks

1. Understand how to quantify environmental impact

2. Assign Eco-Weights to a Road Network

3. Novel Routing Algorithms

Understand Vehicular Environmental Impact

• EcoMark and EcoMark 2.0

An evaluation framework that compares and analyzes 11 state-of-

the-art vehicular environmental impact models using GPS data

and a 3D road network.

What do the models need to quantify vehicular environmental

impact?

Speeds (instantaneous or average), and accelerations: from GPS data

Road slopes: from 3D road network

Model categorization

Instantaneous models: appropriate for eco-driving

Aggregated models: appropriate for eco-routing

It is possible to assign eco-weights to a large extent using GPS

data, 3D maps, and appropriate models.

An extended evaluation framework, EcoMark 2.0, is developed on

top of EcoMark using actual fuel consumption data that is obtained

from CAN bus data.

a1'a11'

a1

a11

a5'

a5

Elevation matters – 3D Road Network

• Efficient methods that can build a 3D road network from

3D laser scan data and a 2D road network.

• Use 3D laser scan data to formulate a terrain that is

represented as a Triangulated Irregular Network (TIN).

• Project 2D polylines to the TIN in order to obtain 3D

polylines, and thus result in a 3D road network.



Simple vs. Advanced

Eco-Weights

Static vs. Dynamic

Eco-Weights




Eco-Weights – Simple vs. Advanced

• Simple: time dependent, deterministic.

Each interval has a deterministic weight.

Ex: (0:00, 7:00]: 10 mg; (7:00, 9:00]: 18 mg; (9:00, 15:00]: 12 mg; …

• Advanced: time dependent, uncertain.

Each interval has an uncertain weight, i.e., a random variable that

follows some distribution (e.g., a uniform/normal distribution)

Ex: (0:00, 7:00]: 8 to 12 mg, N(9, 10); (7:00, 9:00]: 13 to 23 mg,

N(18, 10); (9:00, 15:00]: 8 to 12 mg, N(10, 20); …

Representation

Coverage Temporal

Granularity

Accuracy

Simple

Weights

Intervals, each

with a single

value

High

(all edges)

Coarse

(e.g., pre-

defined PEAK)

Good (better than

speed limits)

Advanced

Weights

Intervals, each

with a histogram

or a probability

density function

Low

(only ―hot‖

edges)

Fine

(e.g., 15-min

intervals)

High level of

detail (capture

the travel cost

distributions)

Eco-Weights – Static vs. Dynamic

• Static Weights

Obtained based on historical GPS data and remain static

Capture typical traffic patterns

• Dynamic Weights

Updated as the real-time GPS data streams in

Capture dynamic and instant traffic patterns

• Our solutions

Static Dynamic

Simple Graph Weight

Annotation

Spatio-Temporal

Hidden Markov

Model

Advanced Time-Dependent

Histograms

Spatio-Temporal

Hidden Markov

Model

Assigning Eco-Weights

• Static, simple weights, all edges

Challenge: infer weights for ―cold‖ edges – the edges that are not

covered by any GPS data

Graph weight annotation

Quantify edge similarities based on traffic flows that are derived from the

topology of a road network

Propagate the weights on hot edges to similar cold edges

• Static, advanced weights, hot edges

Capturing the weights of hot edges using time dependent histograms

Challenge: compact representation of the histograms

• Dynamic, simple and advanced weights, hot edges

Inferring near-future weights of hot edges as GPS data streams in

Challenge: sparseness, dependency, and heterogenity

Time-Dependent Histograms

• We need to contend with three challenges.

• Multiple travel costs

Consider not only GHG emissions, but also distance, travel time.

• Time-dependence

Some travel costs are time-dependent.

E.g., travel times on the same road may vary during peak vs. off-

peak periods.

• Uncertainty

Some travel costs are uncertain.

E.g., aggressive and moderate driving on the same road may

produce different GHG emissions.

Road Network Model: MTUG

• MTUG: Multi-cost, Time-dependent, Uncertain Graph

• N different costs are of interest.

Distance (DI), travel time (TT), GHG emissions (GE).

• An MTUG is a graph G = (V, E, MM, W)

V is the vertex set, and E is the edge set.

MM = <MM(1), …, MM(N)>

Function MM(i) maps an edge to the minimum and maximum i-th

cost of using the edge.

MM(TT) (ei) = (150 seconds, 500 seconds)

MM(GE) (ei) = (10 ml, 85 ml)

W = <W(1), …, W(N)>

Function W(i) maps an edge to a set of (interval, random variable)

pairs of the i-th cost type.

W(TT) (ej) = {([0:00, 7:15), N(300, 120)), ([7:15, 8:45), N(450, 100)), …}

W(GE) (ej) = {([0:00, 7:00), N(30, 100)), ([7:00, 9:00), N(50, 80)), …}

Instantiation of MM in an MTUG

• GPS records are map matched to edges.

• Each edge is associated with a set of edge records of the

form (e, t, C)

An edge record indicates that a traversal on edge e at time t takes

costs C, where C is a vector of all costs of interest.

(e1, 8:08, <55 seconds, 80 ml>)

(e1, 9:08, < 45 seconds , 63 ml>)

(e1, 9:10, < 43 seconds , 60 ml>)

(e1, 10:03, < 45 seconds , 62 ml>)

• MM(TT) (e1) = (43 seconds, 55 seconds)

• MM(GE) (e1) = (60 ml, 80 ml)

Instantiation of W in an MTUG

• Partition a day into 96 15-min intervals.

• For each edge and interval, we obtain a multi-set containing

the costs on the edge during the interval.

ms = {(10 s, 3), (20 s, 10), (25 s, 20), (30 s, 10), (40 s, 5), …}

• Estimate a random variable (RV) based on the multi-set.

• Combine adjacent intervals with similar RVs and estimate a

new RV. The long intervals along with the RVs instantiate W.

Route Costs in MTUG

• Given a route Ri = <r1, r2, …, rX>, where ri ∈ E is an edge.

• RC(Ri, t) indicates the costs of using route Ri at time t

RC(Ri, t) = <RVDI, RVTT, RVGE> is a vector of RVs, where each RV

corresponds to a travel cost.

• RVDI is deterministic and equals the sum of the length of

each edge in route Ri.

• RVTT is the convolution of the travel time RV of each edge

in route Ri.

The travel time RV of the first edge r1 is dependent on the trip start

time t.

The travel time RV of the k-th edge rk is dependent on the travel

time of the previous k-1 edges, which are generally uncertain.

• RVGE is the convolution of the GHG emissions RV of each

edge in route Ri.



Simple vs. Advanced

Eco-Weights

Static vs. Dynamic

Eco-Weights




Skyline

Eco-Routing

Continuous

Eco-Routing

Personalized

Eco-Routing

Novel Routing Algorithms

• Stochastic skyline route planning under time-varying

uncertainty.

Identify pareto-optimal routes considering multiple travel costs,

e.g., travel times, GHG emissions, distances.

R1 (15 min, 500 ml, 25 km)

R2 (13 min, 520 ml, 26 km)

R3 (16 min, 530 ml, 30 km)

• Continuous routing that supports real-time updates on

eco-weights (i.e., dynamic eco-weights).

Real-time re-navigate vehicles according to recently updated

eco-weights and the current locations of vehicles.

• Context-aware, personalized routing

Identify context-aware preferences for each driver (e.g., time-

efficient driving or fuel-efficient driving).

Choose a single route that best satisfies the driver’s

preferences.

Deterministic Skyline Routes

• Route cost: cost(Ri) = <DI, TT, GE>

A vector of deterministic values.

Each value corresponds to a travel cost.

• Dominance relationship

Ri dominates Rj iff all the costs of Ri are no greater than those of Rj,

and there is at lest one cost of Ri is smaller than that of Rj.

• Consider multiple routes for the same source-destination.

R1: 3.5 km, 230 mg, 10 min;

R2: 5.1 km, 250 mg, 11 min;

R3: 5.1 km, 200 mg, 12 min;

• The skyline routes are the non-dominated routes.

Since R2 is dominated by R1, R1 and R3 are the skyline routes.

Stochastic Dominance

• Route cost: RC(Ri) = <RVDI, RVTT, RVGE>

A vector of random variables (RVs), where each RV represents the

distribution of a travel cost.

• Stochastic Dominance between two RVs

Given two RVs X and Y, if cdfX(a) >= cdfY(a), for all possible value

a in R+, we say ―X stochastically dominates Y‖.

Cost(R1).RVTT stochastically dominates Cost(R2). RVTT.

No stochastic dominance between Cost(R3). RVTT and Cost(R4).

RVTT.

Stochastic Skyline Routes

• Dominance between two routes Ri and Rj

If each RV of cost(Ri) stochastically dominates the corresponding

RV of cost(Rj), then Ri dominates Rj.

• Stochastic Skyline Routes

Given a source-destination pair, and a trip starting time.

Stochastic skyline routes are the routes that are not dominated by

any other routes.

• Brute force stochastic skyline route planning

Enumerate all possible routes, compute the route costs, and check

whether one route dominates another.

Very inefficient and works only for small road networks.

• Efficient stochastic skyline route planning

Prune early some routes that cannot be skyline routes.

Efficient stochastic dominance checking.

Example Result

• Skyline routes R1, R2, and R3, identified by our algorithm

• R1: 94,849 m; R2: 106,216 m; R3: 91,382 m;

DI: R3 dominates R1 and R2.

TT: R1 dominates R2 and R3

GE: R2 dominates R1 and R3.

Personalized Routing

• Different drivers may take different routes

because they may have quite different

preferences.

• The same drivers may take different

routes in different contexts.

Morning: try to save time to avoid being late.

Weekend afternoon: try to save fuel

consumption.

• Challenges

Identify contexts for drivers and identify driving

preference in each context.

Deal with time-dependent uncertain travel

costs, e.g., travel time and fuel consumption,

while considering individual drivers’ driving

behaviors, e.g., aggressive vs. moderate

driving.

Drivers’

Trajectories

Framework

Context and Preference

Identification Contexts &

Preferences

Driver,

Source,

Destination,

Time

Routing M

odule

Op

tim

al R

oute

s

Offline Phase Online Phase

Example Results

Dark, bold routes: actual routes used by drivers.

Red routes: shortest routes.

Green routes: fastest routes.

Blue routes: predicted routes using the identified contexts and

driving preferences.

Summary

• Eco-Routing is able to effectively reduce environmental

impact from transportation section.

• Eco-routing is achieved by

Reliably and accurately quantifying vehicles’ environmental impact

using GPS data and 3D road networks;

Assigning various types of eco-weights to edges;

Developing novel routing algorithms that enable effective and

practical eco-routing service.

Building Accurate 3D

Spatial Networks

Outline

• Motivation

• Solution

Phase A: Filtering

Phase B: Lifting

• Experiments

• Summary

High-Res 3D Model Usage

Eco-Routing •USD 6 Billion per annum in Fuel

savings with 3D model –

TomTom

•EcoMark: Benchmark shows

better CO2 emissions estimation.

The Idea

2D Road Network

Aerial Laser Scan

3D Road Network

Classifying 3D Models

• Need for accurate 3D spatial networks by

applications.

• Ease of availability of rich terrain datasets.

3D Road Model

High-Res Model

•± 20 cm Accuracy

•Higher Storage Cost

Low-Res Model

•± 2 mts Accuracy

•Lower Storage Cost

Current 3D Map Generation Methods

• Van fitted with Differential GPS driven on

roads

Current 3D Map Generation Methods

• Crowd-sourcing from ―Personal

Navigation Devices‖ (PNDs).

Solution

GOAL: Design two methods to

generate High and Low resolution

3D maps.

Problem Definition

Input

A 2D spatial network, where an edge is modeled as a 2D polyline

A laser scan point cloud containing a set of 3D points

Output

A 3D spatial network, where an edge is modeled as a 3D polyline

2D Spatial Network

+ 3D points

3D Spatial Network

3D Points

ε-Neighborhood

Framework

2D Spatial Network Laser Scan Point Cloud

One pass

Filtering

External Memory

based Filtering

Filtering

Delaunay Triangulation

or

Lifting

Interpolation

3D Spatial Network

A: Filtering

One-Pass Filter (OPF)

•3D laser scan points are ―Inherently sorted spatially‖!

•Data Guarantee: A 1m x 1m cell contains two 3-D points.

•Assumption: The road is much smaller and fits in main memory.

Build a grid-index Hg from the road

segments, mapping road points to

cell IDs.

Scan the massive 3D point cloud in a

single pass and build a 3D point to

cell mapping Hp ”seeded” by Hg

If 3D points exceed a threshold cell

occupancy of α x δ x δ where α is

the user defined threshold in [0,1]

and δ is the grid cell width -> Lift cell

contents and release cell memory!

One-Pass Filter (OPF)

Advantages of OPF

Needs no pre-processing of massive 3D point cloud for sorting or

indexing.

Builds parts of the 3D road network as 3D points are streaming

through.

Faster runtimes, lower resolution. (± 2m accuracy)

External Memory based Filtering (EMF)

• No assumption about road network size.

Neither road points nor 3D points fit in

memory. Hence, stored on disk.

• Limited memory budget available to the

method.

Moore Neighborhood: 3,5,8-cell

neighborhood. Ex: grey shaded

region around cell (1,1).


Locality Preserving Space-Filling Curves


Read road point disk blocks Br from

disk into main memory.

For each block Br - read 3D points

Moore neighborhood blocks from

disk into a limited LRU cache

(size given in multiples of the grid

width c).

Continue this process in a space-

filling curve (Z-order or Row-Major)

fashion to avoid re-reading 3D point

blocks from disk.

Z-Order vs. Row-Major Disk Access


Why is Row-Major doing well at 3c?

LRU Cache

Block to process

Neighbor Block needed in memory

Blocks in cache



LRU Cache (Filled up)

Block to process


Blocks in cache



Answer: Disk Blocks are read into memory exactly once !

LRU Cache (Filled up)

Block to process


Blocks in cache

Blocks evicted


Advantages of EMF

Requires little main memory

Slower runtimes, higher resolution (± 20 cm accuracy)

Can work to ―Zoom-in‖ on OPF created maps to improve

resolution on mobile devices

B: Lifting

Triangulation

• Delaunay Triangulation

After filtering, a set of 3D points is associated with a cell of road

segments.

The set of 3D points is triangulated.

3D Points

2D Road Points

OPF vs. EMF

Interpolation

a1'a11'

a1

a11

a5'

a5

Project the TIN triangles

to the x-y plane.

Compute the

intersections of triangles

and road 2D polyline.

Re-project road points +

intersections back to TIN

surface again.

Interpolate road points

on TIN to get elevations.

Experimental Study

Dataset Statistics

•Aalborg (AA) : 7km x 5km

•North Jutland (NJ) : 185km × 130km

•NJ approx. 120 times larger than AA.

Experimental Study (Accuracy)

OPF vs. EMF

Experimental Study (Scalability)

• North Jutland (NJ) covering a region of 185km × 130km

• EMF

Accuracy ±20 cm in 21 hrs with memory use of 230 MB

• OPF

Accuracy ±2 m in 2 hrs, which is an order of magnitude

faster than EMF, using 2.4 GB of main memory

Summary

• Augment a 2D road network with elevation values from 3D

point cloud

• Propose two methods that work with limited memory budgets

• Extensive experiments to offer insight into both methods

** 3D Spatial Network available from

UCI Machine Learning Repository

(http://archive.ics.uci.edu/ml/)

EcoMark: Evaluating Models of Vehicular

Environmental Impact

Outline

• Motivation

• EcoMark overview

• Model analysis

• Experiments

• Context

• Summary

Motivation

• The reduction of greenhouse gas (GHG) emissions from

transportation is essential to combat global climate

change.




• Eco-driving and eco-routing can reduce GHG emissions.

Eco-driving: driving behavior

Eco-routing: routes that minimize fuel usage

• We need to be able to measure vehicular environmental

impact.

Several impact models have been developed, but a

comprehensive comparison of these is missing.

The utility of the models for eco-driving and eco-routing is not well

understood.

EcoMark

• EcoMark aims to enable a better understanding of how to

compute the environmental impact of vehicles.

• EcoMark is a benchmark that compares and analyzes 11

state-of-the-art vehicular environmental impact models.

No existing such comprehensive benchmark.

• Key characteristics of EcoMark

Uses GPS data from vehicles and a 3D spatial network

Categorizes the 11 models into two major categories

Compares and analyzes the models in the same categories, and

the models between the two categories

EcoMark Framework

GPS observations 2D Spatial Network Laser Scan Points

Input

Instantaneous Impact

6 Instantaneous Models 5 Aggregated Models

Aggregated Impact

Comparison and Analysis

Insights

EcoMark

Map Matching

GPS Trajectories

Spatial Network Lifting

3D Spatial Network

3D Spatial Network

• Spatial network lifting

2D spatial network: OpenStreetMap

Aerial laser scan of Denmark (1+ point per m2; 2.5 TB for

Denmark)

• A 3D spatial network is a directed, weighted graph

denoted as G = (V, E, F, H), where

V and E are the vertex and edge sets.

F maps vertices to 3D coordinates.

H maps edges to grades.

Benchmark Experiments

• Exp1: Eco driving: Comparisons of instantaneous models

Compare the behaviors of the 6 models.

Identify models that behave similarly.

• Exp2: Eco routing: Comparisons of aggregated models

Compare the behaviors of the 5 models.

Identify models that behave similarly.

• Exp3: Cross-comparisons of instantaneous and

aggregated models

Identity the similarity of the aggregation of instantaneous models

and aggregated models.

• Exp4: Study of models that support grade information

How important is 3D – effects of using a 3D spatial network.

Insights

• Instantaneous models

Five of six models behaved consistently.

They can be used to measure the impact of driving behaviors.

They can be used to classify good and bad driving behaviors.

• Aggregated models

All the five models behaved consistently.

Not suitable to identify impact at operational levels.

They can be used to assign eco-weights to road segments, and

are helpful for eco-routing.

• Importance of 3D Spatial Networks

Road grades affect all models substantially.

Eco-driving and eco-routing both benefit.

Summary

• EcoMark

Evaluates empirically 11 existing models of vehicular

environmental impact using a large trajectory database and a

lifted, 3D OpenStreetMap

Findings: Instantaneous models support eco driving, while

aggregated models support eco-routing.

• Next steps

Accuracy comparison with ground truth CAN bus data (EcoMark

2.0)

Requires access to corresponding GPS and CAN bus data.

Eco driving and eco routing

Time-Dependent Uncertain Eco-

Weights for Road Networks

Outline

• Motivation

• Problem definition

• Framework

• ERN building

• GHG emissions estimation

• Empirical study

• Summary

Motivation – Advanced Eco-Weights

• Eco-routing is an effective way to reduce vehicular

environmental impact.

• The capture of the environmental costs of traversing

road network edges is key to eco-routing.

Eco-weights are uncertain.

Eco-weights are time-dependent.

Once obtained, existing routing algorithms can be

applied to compute eco-routes.

Problem Definition

• Goal: Obtain an Eco Road Network G = (V, E, F)

V: Vertex set. Each vertex indicates a road intersection.

E: Edge set. Each edge indicates a road segment.

Function F assigns a time-dependent, uncertain eco-weight to

each edge in E.

• Input

A set of map-matched trajectories TR.

An accompanying road network G' = (V, E, Null).

• Output

The Eco Road Network G = (V, E, F).

Approach

• Time-dependent uncertain histograms.

A vector of (period, histogram) tuples <Ti, Hi>.

Hi is the histogram describing the distribution of cost values

observed in period Ti.

• Time-dependent uncertain histograms are used to

represent the eco-weight of each road network edge.

• Several data compression techniques are applied to

reduce the storage space while retaining accuracy.

[10 a.m., 11 a.m.) [9 a.m., 10 a.m.) [8 a.m., 9 a.m.)

0

0,5

1

0

0,5

1

0

0,5

1

Framework

Trajectories TR Road Network

G = (V, E, NULL)

Traversal Record Analysis

Initial Histogram Construction

Histogram Merging

Bucket Reduction

Eco Road Network

G = (V, E, F)

Traversal Record Analysis

• GPS records are map-matched to the corresponding

edges.

• Map-matched records are transformed into traversal

records.

A traversal record r = (e, t, tt, ge) indicates that edge e is traversed

by a trajectory trj starting at time t and has travel time tt and GHG

emissions ge.

The VT-micro environmental impact model is used to estimate the

GHG emissions of each traversal record.

Initial Histogram Building

• Each edge is associated with a set of traversal records

• Divide the time space into intervals with equal width

The default value is 1 hour, (24 intervals in total).

• For each edge e

Build equi-width histograms for each time interval.

The number of buckets per time interval is configurable.

The histograms are isomorphic.

0.1

0.3

0.5

0.7

[0, 20) [20, 40) [40, 60) [60, 80]

Pro

bab

ilit

y

GHG emissions (mL)

[8 a.m., 9 a.m.)

[9 a.m., 10 a.m.)

Histogram Merging

• For each edge, merge two temporally adjacent histograms

if they are sufficiently similar.

• Use cosine similarity to quantify similarity.

• We use a merge threshold Tmerge to decide when to stop

merging.

sim(Hi , H j ) =V(H i ) ÄV(H j )

V(H i ) · V(H j )

0.1

0.3

0.5

0.7

[0,20) [20,40) [40,60) [60,80]

Pro

bab

ilit

y

GHG emissions (mL)

[8 a.m., 9 a.m.)

[9 a.m., 10 a.m.)

0.1

0.3

0.5

0.7

[0,20) [20,40) [40,60) [60,80]

Pro

bab

ilit

y

GHG emissions (mL)

[8 a.m., 10 a.m.)

Histogram Bucket Reduction

• Further reduce the storage size of an individual histogram

by merging adjacent buckets. Use SSE to measure the merge cost (accuracy loss).

Merge buckets when the cost does not exceed threshold Tred.

Iteratively merge adjacent buckets in all the histograms of a road

segment.

0.1

0.3

0.5

0.7

[0,20) [20,40) [40,60) [60,80]

Pro

bab

ilit

y

GHG emissions (mL)

[8 a.m., 10 a.m.)

0.1

0.3

0.5

0.7

[0,40) [40,60) [60,80]

Pro

bab

ilit

y

GHG emissions (mL)

[8 a.m., 10 a.m.)

GHG Emissions Estimation

• For a route

Estimate the distribution of GHG emissions as a histogram.

Aggregate the histograms of the edges in the route.

• Given two histograms H1 and H2 for adjacent edges

A histogram H′ is computed that represents the aggregated GHG

emissions distribution for traversing both edges.

21

' HHH

0.1

0.3

0.5

0.7

[0,20) [20,40) [40,60)

Pro

bab

ilit

y

GHG emissions (mL)

e1e2

0.1

0.3

0.5

0.7

[0,40) [40,80)[80,120)

Pro

bab

ilit

y

GHG emissions (mL)

e1 + e2

Histogram Aggregation

• Given buckets bj in H1 and bk in H2

• A bucket bi in H’ is generated as

• Enumerate all bucket pairs of H1[j] and H2[k].

• Rearrange the buckets in H′ to obtain an equi-width

histogram.

H '[i]. p= H1[ j ]. p×å H2[k]. p

Range(H '[i]) = Range(H1[ j ])+ Range(H2[k])

Experiments

• Experimental settings

Data Set: 200 million GPS records.

Road network of Denmark: 414K vertices and 818K edges.

GHG emissions: VT-micro is applied to evaluate vehicular

environmental impact.

• Evaluation

Baseline: without any compression.

State-of-the-art: T-drive.

Accuracy, storage, and efficiency.

Storage Study

• Histogram Merging

• Bucket Reduction

MCRm =M init - Mmerge

M init

0.6

0.7

0.8

0.9

1

0.9 0.92 0.94 0.96 0.98

MC

R

Merge Threshold

MCRr =Mmerge - M redu

Mmerge 0.6

0.7

0.8

0.9

1

35 40 45 50 55 60

MC

R

Bucket Limit

TMerge = 0.98

TMerge = 0.95

Accuracy Study

• Histogram Aggregation

• MCR vs. Err

0.8

0.9

1

2 6 10 14 18

HS

imil

arit

y

Number of Edges

0

0.1

0.2

0.3

0.4

35 40 45 50 55 60

Err

Bucket Limit

Tmerge = 0.98

Tmerge = 0.95

0.2

0.4

0.6

0.8

0.1 0.12 0.14 0.16 0.18

MC

R

Err

ERN

T-drive

• Error Metric – Err

The relative accumulative deviation of all the buckets in a

histograms from the original distribution.

0

0.1

0.2

0.9 0.92 0.94 0.96 0.98

Err

Merge Threshold

After Merging

Equi-width

• Histogram Merging

• Bucket Reduction

Summary

• We propose technique to estimate the time-dependent

environmental travel costs in a road network based on

histograms.

• An empirical study suggests that we can efficiently

compute eco-weights of road segments while retaining

accuracy.

• The proposed eco-weights can be used for eco routing

and real-time traffic monitoring.

Using Incomplete Information for

Complete Weight Annotation

of Road Networks

Overview

• Motivation

• Problem definition

• Cost modeling

• Objective functions

Residual Sum of Squares

Topology constraint

Adjacency constraint

• Experiments

Motivation

• Why weight annotation?

Vehicle routing relies on a weighted-graph representation of the

underlying road network.

Given a graph with appropriate weights, different types of routes

can be computed.

• Why incomplete information?

Weights that capture travel costs such as travel times can be

extracted from GPS trajectory data.

GPS trajectory data typically does not cover all edges.

• Why complete weight annotation?

To use a graph for routing, all edges must have weights.

Problem Definition

GPS Records

Pre-Processing Module

A set of (trip, cost) pairs

{(t(i), c(i))} G''(V, E, null)

Weight Annotation Module

G(V, E, H)

Temporal Road Network Cost Modeling

A road network is modeled as a directed graph.

Temporal tags: during the same temporal tag, the traffic has similar

behavior, e.g., PEAK and OFFPEAK.

Each edge has 2 cost values, which indicate the cost per unit

length for traversing the corresponding road segment during PEAK

and OFFPEAK, respectively.

d(AB, PEAK) indicates the cost per unit length for traversing the road

segment AB during PEAK tag.

d: a vector of size |E| * |TAGS| (e.g., 6 * 2 =12)

A

B C

D

d(AB, PEAK)

d(AB, OFFPEAK)

d(BC, PEAK)

d(BC, OFFPEAK)

d(BA, PEAK)

d(BA, OFFPEAK)

d(CB, PEAK)

d(CB, OFFPEAK) d(BD, PEAK)

d(BD, OFFPEAK)

E d(CE, PEAK)

d(CE, OFFPEAK)

Trip Cost Modeling

• GPS trajectory

A sequence of records of the form (x,y,t).

• Trip

A sequence of records of the form (edge, ts, te).

• The cost of a trip is the sum of the cost of each record.

E.g., t1= <AB, 7:55, 8:00>, <BC, 8:00, 8:10>

cost_model(t1) = d(AB, PEAK) * length(AB) + d(BC, OFFPEAK) *

length(BC)

• If we know all cost variables in d, we can compute the

estimated cost of any trip.

• For every trip t in T’

The square of difference between the actual cost of t and the

estimated cost using the proposed model, i.e., cost_model(t).

E.g., t1= <AB, 7:55, 8:00>, <BC, 8:00, 8:10>, actual_cost(t1) =

100.

RSS(t1) = (100 – cost_model(t1))2

Problem: Training data T’ may not cover the whole road network.

By minimizing the RSS function, we can only get cost variables for

the road segments traversed by T’.

d(BC, OFFPEAK)

Objective Functions: Residual Sum of Squares

A

B C

D

d(AB, OFFPEAK)

d(BC, PEAK)

d(BA, PEAK)

d(BA, OFFPEAK)

d(CB, PEAK)


d(BD, OFFPEAK)

E d(CE, PEAK)

d(CE, OFFPEAK)

d(AB, PEAK)

Objective Function: Topological Constraint

• If two road segments locate in similar topology in a road

network, they tend to have similar cost during the same

tag.

• Given a pair of road segments ei and ej

The weighted square loss between the cost variables of ei and ej.

weight_TC(ei, ej) * (d(ei, PEAK) – d(ej, PEAK))2

If weight_TC (ei, ej) is big, d(ei, tag) and d(ej, tag) tend to be

similar in order to make (d(ei, PEAK) – d(ej, PEAK))2

small.

A

B C

D

d(AB, PEAK)

d(AB, OFFPEAK)

d(BC, PEAK)

d(BC, OFFPEAK)

d(BA, PEAK)

d(BA, OFFPEAK)

d(CB, PEAK)


d(BD, OFFPEAK)

E d(CE, OFFPEAK)

d(CE, PEAK)

Weighted PageRank Computation

• PageRank gives a value to each vertex in a graph.

• Primal graph -> dual graph

Vertex -> Edge

Edge -> Vertex

A

B C

D

E

BA CB

BD

AB BC

CE

Weighted PageRank Computation

• Original PageRank

A vertex evenly gives its PR value to all its adjacent vertices.

Vertex AB gives ½ * PR(AB) to BC and BD, respectively.

• Weighted PageRank

Weight(AB, BD) = 𝑇𝑟𝑖𝑝𝑁𝑢𝑚 𝐴𝐵,𝐵𝐷 +1

𝑇𝑟𝑖𝑝𝑁𝑢𝑚 𝐴𝐵,𝐵𝐷 +𝑇𝑟𝑖𝑝𝑁𝑢𝑚 𝐴𝐵,𝐵𝐶 +|𝑑𝑒𝑔𝑟𝑒𝑒(𝐴𝐵)|

100 trips from AB to BC, 50 trips from AB to BD

weight(AB, BD) = 50+1

50+100+2 =

51

152

weight(AB, BC) = 100+1

50+100+2 =

101

152

No trips information

weight(AB, BD) = weight(AB, BC) = 1

2

BA CB

BD

AB BC

CE

Weighted PageRank based Topological Constraint

• Use weighted PageRank value to quantify the topological

similarity.

If two edges have similar weighted PageRank values, they also

tend to have similar cost.

weight_TC(ei, ej) = m𝑖𝑛(𝑃𝑎𝑔𝑒𝑅𝑎𝑛𝑘 𝑒𝑖 ,𝑃𝑎𝑔𝑒𝑅𝑎𝑛𝑘(𝑒𝑗))

max(𝑃𝑎𝑔𝑒𝑅𝑎𝑛𝑘 𝑒𝑖 ,𝑃𝑎𝑔𝑒𝑅𝑎𝑛𝑘(𝑒𝑗))

Objective Function: Adjacency Constraint

If two road segments are directionally adjacent in a road network,

they tend to have similar cost during the same tag.

d(AB, PEAK) may be similar to d(BC, PEAK)

But not necessary to be similar to d(CB, PEAK) or d(BC, OFFPEAK)

Given a pair of directionally adjacent road segments ei and ej

The weighted square loss between the cost variables of ei and ej.

weight (ei, ej) * (d(ei, PEAK) – d(ej, PEAK))2

A

B C

D

d(AB, PEAK)

d(AB, OFFPEAK)

d(BC, PEAK)

d(BC, OFFPEAK)

d(BA, PEAK)

d(BA, OFFPEAK)

d(CB, PEAK)

d(CB, OFFPEAK) d(BD, OFFPEAK)

E d(CE, PEAK)

d(CE, OFFPEAK)

d(BD, PEAK)

Objective Functions

• Residual Sum of Squares

RSS(d) = (actual_cost(t) – cost_model(t))2t∈T′

• Topological Constraint

PRTC(d) = 𝑤𝑒𝑖𝑔𝑕𝑡_𝑇𝐶 𝑒𝑖, 𝑒𝑗 ∗ (𝑑 𝑒𝑖, 𝑡𝑎𝑔 − 𝑑(𝑒𝑗, 𝑡𝑎𝑔))2𝑖,𝑗

• Adjacency Constraint

ACTC(d) = 𝑤𝑒𝑖𝑔𝑕𝑡 𝑒𝑖, 𝑒𝑗 ∗ (𝑑 𝑒𝑖, 𝑡𝑎𝑔 − 𝑑(𝑒𝑗, 𝑡𝑎𝑔))2𝑖,𝑗

• Overall objective function

The sum of the above three (plus a regularizer term)

• Differentiating the objective function w.r.t. vector d and

setting it to 0 yields an equation that can be solved using

standard techniques to obtain a solution for d.

Experimental Results

• Two weeks GPS trajectories from SPF

• 18K trips in total, and 20% as training, 80% as testing

• Two road networks: Skagen and North Jutland

• Objective functions used

F1: RSS; F2: RSS + PRTC; F3: RSS + ACTC

F4: RSS + PRTC + ACTC

Travel Cost Inference from Sparse,

Spatio-Temporally Correlated Time

Series Using Markov Models

Outline

• Motivation

• Travel-cost time series

• Modeling travel-cost time series in a road network

• Learning a spatio-temporal hidden Markov model

• Empirical study

• Summary

• Vehicle routing plays an important role in transportation.

E.g., fastest routes, eco-routes.

• Effective vehicle routing relies on a weighted graph

representation of a road network.

The weight of an edge should accurately capture the travel cost of

traversing the edge.

E.g., travel times, greenhouse gas (GHG) emissions.

• Edge weights should be dynamically changing due to the

dynamics of traffic.

To support dynamic edge weights, we propose

techniques that are able to infer near-future travel costs

based on incoming GPS data.

Motivation

Travel-Cost Time Series

• The travel costs on edge ei are modeled as a

time series TSi = <Ci(1), Ci

(2), …, Ci(T)>.

• Travel cost set Ci(j) contains travel costs

observed in the j-th time interval on edge ei.

An interval is the finest time interval of interest.

E.g., 15 minutes, yielding 96 intervals per day.

A travel cost is the travel time or GHG emissions of a

traversal of edge ei during the j-th interval.

E.g., during the j-th interval [8:00, 8:15), if 3 vehicles

traversed ei taking 30, 32, 40 s, Ci(j) = {30, 32, 40}.

e1

e2

e3

TS1

TS2

TS3

Time Series Characteristics

• Dependent

Travel costs on adjacent edges may be dependent.

• Heterogeneous

The traffic on different edges is often different.

Some edges have clear peak/off-peak periods while others do not.

• Sparse

Some intervals do not have any travel cost observations.

Modeling Travel-Cost Time Series

• We use a Spatio-Temporal Hidden Markov Model

(STHMM) to model the interactions among multiple travel-

cost time series in a road network.

State transition

on edge e1

s1t-1=

cong

s1t=

cong s1

t+1

= clr

C1t-1 C1

t C1t+1 Time Series:

TS1

TP TP

OP OP OP

An edge ei has a set of hidden states Si, where each

state indicates a particular traffic status on the edge.

S1={cong, clr}

During an interval, the traffic of an edge is in a state.

Intervals: t-1 (previous) t (current) t+1 (next)

The traffic on an edge evolves over time, transiting

from one state to another according to transition

probabilities (TP).

In an interval, the travel costs observed on an edge

is dependent on the state of the edge according to

output probabilities (OP).

Modeling Multiple Time Series using STHMM

e1

e2

e3

e1

e2

e3

s1t-1=

cong

s1t =

cong

s1t+1=

clr

s2t-1=

clr

s2t =

clr

s2t+1=

cong

s3t-1=

cong

s3t =

clr

s3t+1=

clr

Intervals: t-1 (previous) t (current) t+1 (next)

TP: P(S1t|S1

t-1, S2t-1)

TP: P(S2t|S1

t-1, S2t-1, S3

t-1)

TP: P(S3t|S2

t-1, S3t-1)

Capture the traffic dependency among edges.

Travel-Cost Time

Series Derived from

Historical GPS Data

Framework Overview

State Formulation &

Parameter Learning

Spatio-Temporal

Hidden Markov Model

Routing Algorithms

Near Future Edge

Weights

Real-Time

GPS Data

Eco-routes &

Fastest-routes

State Formulation – Intuitions

• Since edges are heterogeneous, we need to identify a

unique state set for each edge.

Edges with heavy and complicated traffic: more states

Edges with light and simple traffic: fewer states

• Target properties of the state set of an edge

The traffic in the same state behaves similarly.

The current state depends on its previous state.

State Formulation – Methods

• For each edge, identify a Gaussian Mixture Model (GMM)

with K Gaussian components to capture the distribution of

all the travel costs on the edge.

Due to heterogeneity, different edges may have different Ks.

• Transform the travel costs in each interval in the edge’s

time series into a K+1 dimensional point.

The first K dimensions correspond to the travel cost distribution in

the interval, still using the K Gaussian components.

The K+1st dimension captures the difference between the travel

cost distributions in the interval and its previous interval.

• Cluster these K+1 dimensional points into several clusters,

and each cluster is a state.

Parameter Learning

• Output probabilities

Since each state is a cluster, we use the cluster center to derive a

GMM as the output probability for the state.

• Transition probabilities

For each edge, we need the edge’s travel-cost time series, along

with its adjacent edges’ travel-cost time series.

Based on the obtained output probabilities, transition probabilities

are then estimated using maximum likelihood estimation.

Travel Cost Inference based on an STHMM

e1

e2

e3

Est.

s1t

Est.

s2t

Est.

s2t+1

Est.

s3t

Intervals: t (current) t+1 (next)

Real Time

GPS data

On e3

Real Time

GPS data

On e2

Real Time

GPS data

On e1

OP +

Bayes

OP +

Bayes

OP +

Bayes

OP

Est.

Travel

Costs

TP

TP

TP

Empirical Settings

• Data

181 million GPS records from North Jutland, Denmark from April

2007 to March 2008.

The data is split into training and testing sets.

• Periods of interest

Weekdays during [6:00, 20:00).

• Road network

1,916 edges in North Jutland with more than 500 records.

• Travel costs

Travel time and GHG emissions.

• Accuracy measurements

Average sum of squared loss (ASSL) between estimated costs and

real costs.

Experimental Study

• Aspects studied

Effect of the length of interval.

Effect of the sizes of training data

Effect of recent data

Effect of seasonality

• Results

Efficient; able to support real-time weight updates

Accurate; outperforms a baseline and the state-of-the-art

Summary

• Described an STHMM that is able to model multiple

correlated, sparse time series.

Dependent: STHMM couples the states from adjacent edges.

Heterogeneous: State formulation identifies a unique state set for

each edge.

Sparse: Smoothing techniques are used to contend with sparse

data.

• The described framework is able to support real-time

weight updates and thus provide more accurate routing

services, e.g., eco-routing.

Stochastic Skyline Route Planning

Under Time-Varying Uncertainty

Outline

• Motivation and challenges

• Road network modeling

• Stochastic skyline route planning

• Empirical studies

• Summary

• Reduction in greenhouse gas (GHG) emissions is

important to the whole society.




• Eco-routing is a simple yet effective way to achieve the

reduction from transportation section.

Of interest to both fleet owners and individual drivers.

For example, FlexDanmark is interested in using most eco-friendly

routes while considering travel times and distances.

• We aim to provide useful eco-routing services.

Motivation

Challenges

• To provide practically useful eco-routing, we need to

contend with three challenges.

• Multiple travel costs

Consider not only GHG emissions, but also distances and travel

times.

• Time dependence

Some travel costs are time-dependent.

For example, travel times on the same road may vary during peak

vs. off-peak periods.

• Uncertainty

Some travel costs are uncertain.

For example, aggressive and moderate driving on the same road

may produce different GHG emissions.

A novel road network model - MTUG

• MTUG is a Multi-cost, Time-dependent, Uncertain Graph.

• Assume N different costs are of interest in total.

Distance (DI), travel time (TT), GHG emissions (GE).

• G=(V, E, MM, W)

V is the vertex set, and E is the edge set.

MM = <MM(1), … , MM(N)>

Function MM(i) maps an edge to the minimum and maximum i-th

cost of using the edge.

MM(TT) (ea) = (150 seconds, 500 seconds)

MM(GE) (eb) = (10 ml, 85 ml)

W = <W(1), … , W(N)>

Function W(i) maps an edge to a set of (interval, random variable)

pairs of the i-th cost type.

W(TT) (ea) = {([0:00, 7:15), N(300, 120)), ([7:15, 8:45), N(450, 100)), …}

W(GE) (eb) = {([0:00, 7:00), N(30, 100)), ([7:00, 9:00), N(50, 80)), …}

Instantiation of MM in an MTUG

• MM and W are instantiated using GPS records.

• GPS records are map matched to edges.

• Each edge is associated with a set of edge records in the

form of (e, t, C).

An edge record indicates that a traversal on edge e at time t takes

costs C, where C is a vector of all costs of interest.

(e1, 8:08, <55 seconds, 80 ml>)

(e1, 9:18, < 45 seconds , 63 ml>)

(e1, 10:10, < 43 seconds , 60 ml>)

(e1, 21:03, < 45 seconds , 62 ml>)

• Based on the edge records on an edge, functions MM on

the edge can be instantiated.

MM(TT) (e1) = (43 seconds, 55 seconds)

MM(GE) (e1) = (60 ml, 80 ml)

Instantiation of W in an MTUG

• Partition a day into 96 15-min intervals.

• For each (edge, interval) pair, we obtain a multi-set containing

the costs on the edge during the interval.

ms={(10 s, 3), (20 s, 10), (25 s, 20), (30 s, 10), (40 s, 5), …}

• Estimate a random variable (RV) based on the multi-set.

Use a Gaussian Mixture Model (GMM) to represent an RV.

GMMs can approximate arbitrary distributions.

A GMM is a weighted sum of K Gaussian distributions.

GMM(x) = 𝑚𝑘 ∙ 𝑁(𝑥|µ𝑘 , δ𝑘2)𝐾

𝑘=1

Instantiation of W in an MTUG (cont.)

• If two RVs in two adjacent intervals are similar, we combine

the two intervals into a long interval.

Use KL-divergence to measure the similarity between two RVs.

• Re-estimate a new RV for the long interval using the costs in

the long interval.

• The whole procedure works iteratively until no RVs from

consecutive intervals are similar enough to be combined.

• The long intervals along with their RVs instantiate W.

Route Costs in MTUG

• Given a route Ri=<r1, r2, …, rX>, where ri ∈ E is an edge.

• RC(Ri, t) indicates the costs of using route Ri at time t

RC(Ri, t) = <RVDI, RVTT, RVGE> is a vector of RVs, and each RV

corresponds to a travel cost.

• RVDI is a deterministic value, which equals to the sum of

the length of each edge in route Ri.

• RVTT is the convolution of the corresponding travel time

RV of each edge in route Ri.

Deciding the travel time RV of the first edge r1 is dependent on the

trip start time t.

Deciding the travel time RV of the k-th edge rk is dependent on the

travel time of the previous k-1 edges, which may be uncertain.

• RVGE is the convolution of the corresponding GHG

emission RV of each edge in route Ri.

Stochastic Dominance

• Stochastic Dominance between two RVs

Given two RVs X and Y, if cdfX(a) >= cdfY(a), for all possible value

a in R+, we say ―X stochastically dominates Y‖.

RC(R1, t).RVTT stochastically dominates RC(R2, t). RVTT.

No stochastic dominance between RC(R3, t). RVTT and RC(R4, t).

RVTT.

Stochastic Skyline Routes

• Route Ri dominates route Rj iff

No RV of cost(Rj) stochastically dominates the corresponding RV

of RC(Ri).

At least one RV of RC(Ri, t) stochastically dominates the

corresponding RV of RC(Rj, t).

• Stochastic skyline route

Given a source-destination pair, and a trip starting time.

Stochastic skyline routes are the routes that are not dominated by

any other routes.

Trajectories

Framework Overview

Pre-Processing

Road Network

MTUG

Source,

Destination,

Time

Routing M

odule

Sto

cha

stic S

kylin

e R

ou

tes

Cost Records

Offline Phase Online Phase

Instantiating MM and W

Stochastic Skyline Route Planning

• A brute force method

Enumerate all possible routes, compute the route costs, and check

whether one route dominates another

Very inefficient, and only works for small road networks

• An efficient method

Prune some routes that cannot become skyline routes early.

Efficient stochastic dominance checking

Early Pruning Strategy

• Do the following for all travel cost types of interest.

We use travel time as an example.

We maintain a graph where each edge is associated with the

minimum travel time, which is recorded in MM.

From the destination, run Dijkstra’s algorithm on the graph.

As each vertex is associated with the minimum travel time, we get the

―fastest‖ possible route from the source to the destination.

Compute the route cost of the ―fastest‖ possible route and add the

route as a candidate Skyline route.

vs vd

Each vertex vx has the

shortest distance

least travel time

least GHG emissions

to the destination vd

v1

v2

v3

Candidate skyline route 1



Early Pruning Strategy (cont.)

• Explore routes from source, until no more routes can be

explored.

Estimate the least possible travel costs for a partially explored

route R’.

If the partially explored route with its estimated least possible costs

is dominated by an existing candidate skyline route, there is no

need to explore the route any further.

Otherwise, continue exploring.

Update candidate skyline route if necessary.

vs vd

v1

v2

v3



Candidate skyline route 3 R’


Stochastic Dominance Checking

• Naïve way: check according to the definition of stochastic

dominance.

For each value a, check whether cdfX(a) >= cdfY(a).

• An efficient way

Consider one cost type at a time.

Compute the minimum and maximum possible travel costs of a

route.

• Distinguish among three cases based on the min and max

travel costs of two routes.

Disjoint case: dominance.

Stochastic Dominance Checking (cont.)

• Covered case: non-dominance.

• Overlapping case (needs further checking)

Both dominance and none-dominance may occur.

Empirical Settings

• GPS Data

More than 180 million GPS records collected from Denmark in

2007 and 2008.

1 Hz sampling rate: one GPS record per second.

• Road networks (from OpenStreetMap)

Aalborg (AA): 4,981 vertices and 12,614 edges.

North Jutland (NJ): 68,318 vertices and 163,546 edges.

Jutland (JU): 667,876 vertices and 1,622,974 edges.

AA is the largest city in NJ, and NJ is part of JU.

• Travel costs

Distances (DI): obtained from OpenStreetMap.

Travel time (TT): computed from the timestamps in the GPS

records.

GHG emissions (GE): computed from the instantaneous speeds

and accelerations from the GPS records.

MTUG Generation

• Hot edges: edges covered by the GPS data set.

• Instantiate MM and W based on the GPS data.

• Cardinalities of TT and GE intervals.

Largest cardinalities: 65 for TT and 72 for GE.

Average cardinalities:

• Cold edges: edges not covered by the GPS data set.

Derive corresponding travel costs using speed limits.

The traffic in AA is much more

dynamic that that in other parts.

Stochastic Skyline Route Queries

• Query settings:

• 50 random source-destination pairs with a starting time in

each setting.

• Study the cardinality of the stochastic skyline routes and

the run time.

Cardinalities of Skyline Routes

• As the number of travel cost types considered increases,

the cardinality of the skyline routes also increases.

• When distance between source and destination increases,

the cardinality of the skyline routes also increases.

Run Time

• The run-time for a query is below 5 seconds, and most

queries are computed in less than 2 seconds.

• Advanced dominance checking method is effective and

efficient.

Benefits of using the advanced dominance checking is more

substantial on JU since long routes require more dominance

checks.

Summary

• Described a framework that enables stochastic skyline

route planning in road networks with multiple, time-

dependent, and uncertain travel costs.

• Enables eco-routing in a realistic setting.

Vehicle Routing with

User-Generated Trajectory Data

Motivating Example

Fastest route: Route #1

Shopping

Mall

Food

Store

50

70

S

E

B

F

G A

I

H

J

K D Route #1

Motivating Example

Shortest route: Route #2

Shopping

Mall

Food

Store

50

70

S

E

B

F

G A

I

H

J

K D

Route

#2

Motivating Example

Route #3 avoids parking lots

Shopping

Mall

Food

Store

50

70

S

E

B

F

G A

I

H

J

K D

Route

#3

Motivating Example

Route #4 is attractive when stores are closed

Shopping

Mall

Food

Store

50

70

S

E

B

F

G A

I

H

J

K D

Route #4

Motivating Example

Which route from source S to destination D is preferable?

Shopping

Mall

Food

Store

50

70

S

E

B

F

G A

I

H

J

K D Route #1

Route

#3

Route

#2

Route #4

The fastest – takes the least travel time?

The shortest – smallest distance?

The most ―direct‖– fewest turns?

The one most constrained to main roads –

better road quality?

The ―safest,‖ avoiding bad neighborhoods?

…

The one most preferred by locals?

Introduction

• Local travel

Knowledge of the surroundings

Follow familiar routes

• Travel in unfamiliar surroundings to unknown destinations

Depend on available routing services

Expect that the provided route is the best

• Idea: Use GPS data to let those who travel in unfamiliar

surroundings benefit from the insights of local travelers

Motivation

• Can’t we just use a standard routing service?

• How often do drivers follow routes recommended by

routing a service?

V. Čeikutė and C.S. Jensen. Routing service quality—local driver behavior versus routing services. MDM 2013, pp. 97–106

0

500

1000

1500

2000

2500

0.5-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-220

# r

oute

s

route length (km)

Routes

5000 routes

Motivation

• How often do drivers follow routes recommended by

routing a service?

• Breakdown by distance traveled 7

3%

56

%

44

%

36

%

19

%

9%

18

%

20

%

11%

0

20

40

60

80

100

0.5-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-220

on a

vg

ma

tch

(%

)

route length (km)

V. Čeikutė and C.S. Jensen. Routing service quality—local driver behavior versus routing services. MDM 2013, pp. 97–106

• ~60% of routes match the

route provided by the routing

service; ~40% do not!

• Use of previous user’s trips

can improve routing quality

Goal of the Study

• Propose a routing framework that

Utilizes GPS data volunteered by local drivers

Exploits possibly hard-to-formalize insight into local conditions

Takes into account temporal variation in driver behavior

Recommends routes based on popularity and temporal aspects

• Evaluate the quality of proposed routes

Study based on trip length and pre-selected drivers

Quality comparison with existing routing service and route

recommendation approaches

Outline

• Introduction and motivation

• Framework

• Data preparation methodology

• Scoring of routes

• Empirical study

Data foundation

Routing quality evaluation and comparison

• Related work

• Conclusion and research directions

Framework

offline

GPS logs map-

matching trajectory

DB

Routing

service

online

trajectory

data

user’s

input data

top route

user

GPS trace

• source

• destination

• time

Trips used:

• Start or end at, or go through,

source and destination locations

• Trips that start during the issued

time period are preferred

When the source and

destination are not covered

by available GPS data, an

existing routing service is

used.

Data Preparation Methodology

• A trip is a sequence of

timestamped GPS points

between stay points.

• Stay point

A location where the user stays

for a certain period time (no

GPS data is received)

A location reported repeatedly

for a certain period of time (the

engine remained on)

• In practice, trips are

consolidated

Short trips are dropped and

some consecutive trips are

merged

x

y

t

same

location


• Trips are map-matched and

transformed into sequences of

road segments with entry and

exit timestamps

• Example

Trip 𝑡𝑟1is map-matched to a

sequence of road segments:

(𝑠1, 𝑡𝑠1 , 𝑡𝑒1), (𝑠2, 𝑡𝑠2 , 𝑡𝑒2), (𝑠3, 𝑡𝑠3 , 𝑡𝑒3)

𝑡𝑟3

𝑡𝑟2

𝑡𝑟1

𝑠1

𝑠2

𝑠3


The similarity of two trips is

determined by applying

Longest Common

Subsequence (LCSS) to the

trajectories represented as

sequences of road segments.

s1

s2 s3 s4

s1

s2 s3

s4 s11 s9 s10

s10 s9

s5

s6 s7

s8

Similar trips belong to the same route.

𝑆1 𝑝𝑙𝑠, 𝑝𝑙𝑠′ =𝑙𝑒𝑛𝑔𝑡ℎ(𝐿𝐶𝑆𝑆(𝑝𝑙𝑠,𝑝𝑙𝑠′))

𝑙𝑒𝑛𝑔𝑡ℎ(𝑝𝑙𝑠)∙ 100%

𝐶1 𝑝𝑙𝑠, 𝑝𝑙𝑠′ = 𝑡𝑟𝑢𝑒𝑖𝑓𝑆1 𝑝𝑙𝑠, 𝑝𝑙𝑠′ ≥ 𝑔𝑒𝑡𝑀𝑎𝑡𝑐𝑕(𝑝𝑙𝑠, 𝑝𝑙𝑠′)𝑓𝑎𝑙𝑠𝑒𝑜𝑡𝑕𝑒𝑟𝑤𝑖𝑠𝑒

Similarity of two

trajectories

The similarity threshold

depends on the trip length.

Longer trips, higher

similarity thresholds,

interval [90-100].

Do trips follow the same route?


• Trips that follow the same sequence of road segments are

grouped into route usage objects.

user route # traversals

𝑢1 𝑟1: 𝑠𝐵𝐶 , 𝑠𝐶𝐷, 𝑠𝐷𝐹 , 𝑠𝐹𝐺 - 10off

𝑢2 𝑟2: 𝑠𝐶𝐷, 𝑠𝐷𝐹 , 𝑠𝐹𝐺 3peak 5off

𝑢3 𝑟3: 𝑠𝐸𝐷 , 𝑠𝐷𝐹 , 𝑠𝐹𝐼 , 𝑠𝐼𝐺 7peak 4off

𝑢4 𝑟2: 𝑠𝐶𝐷, 𝑠𝐷𝐹 , 𝑠𝐹𝐺 5peak 6off

A B

𝑟1

𝑟2

𝑟3 E

D F

I

G

C

• Route 𝑟2 is taken by users 𝑢2 and 𝑢4 3 and 5 times during

peak hours and 5 and 6 times during off-peak hours.

(𝑟2, 𝑝𝑙, 𝑝𝑒𝑎𝑘, 𝑢2, 3 , 𝑢4, 5 , 𝑜𝑓𝑓, 𝑢2, 5 , 𝑢4, 6 )

Scoring of Routes

Preferred routes

• Popular among drivers

• Taken by many distinct drivers

• Popular on the time of the day

and day of the week of the query

peak hours

off-peak hours

destination

source

B

A

Final route score:

𝑠𝑐𝑜𝑟𝑒 𝑟 = 𝛽 ∙ 𝑝𝑟𝑒𝑓𝑀 𝑟 + (1 − 𝛽) ∙ 𝑝𝑟𝑒𝑓𝑁(𝑟)

Scoring of Routes

Route preference value:

𝑝𝑟𝑒𝑓 𝑟 = 𝛼 ∙ 𝑢𝑠𝑒𝑟𝑠 𝑟 + (1 − 𝛼) ∙ 𝑡𝑟𝑎𝑣𝑒𝑟𝑠𝑎𝑙𝑠(𝑟)

# distinct drivers

taking the route

# of traversals

of the route

Considers trips taken during

the query temporal pattern

Considers trips taken during

other temporal patterns

Empirical Study: Data

• Monitoring period: 2 years

• Number of drivers: 285

• Number of GPS points (raw data): ~182,700,000

• Number of trips: ~275,000

• ―Pay as You Speed‖ project (http://www.sparpaafarten.dk)

0

20000

40000

60000

80000

100000

120000

140000

2-10 10-20 20-30 30-40 40-50 50-350

# tra

ve

rsals

traversal length (km)

Afternoon peak

Morning peak

Off-peak

Routing Quality Evaluation: Data

For this study, we randomly selected equal amounts of trips

for different trip length intervals

0

100

200

300

400

500

600

700

2-10 10-20 20-350

# tra

ve

rsals

traversal length (km)

Off-peak

Morning peak

Afternoon peak

Routing Quality: Match

A match is identified using LCSS

Trajectory

DBtest

Routing

Service

source and

destination

Compare

route with

trajectory

preferred

route

trajectory

Total

number of

matches match! C

A

B

Trajectory

DB

Routing Quality Evaluation: Results

Our Proposal Google Directions API (top-1)

TPMFP (SIGMOD’13)

0

20

40

60

80

100

2-10 10-20 20-350

%

length (km)

Unmatched

Matched

0

20

40

60

80

100

2-10 10-20 20-350

%

length (km)

Unmatched

Matched

0

20

40

60

80

100

2-10 10-20 20-350

%

length (km)

EmptyUnmatchedMatched

Routing Quality Evaluation: Data

For this study, we considered the five drivers with the most

trips.

0

500

1000

1500

2000

2500

3000

3500

1 2 3 4 5

# tra

ve

rsals

driver

2-10 km

10-30 km

30-300 km

0

1000

2000

3000

4000

5000

1 2 3 4 5# tra

ve

rsals

driver

Off-peak

Afternoon peak

Morning peak

Traversals per Length Traversals per Temporal Pattern

Routing Quality Evaluation: Results

Our proposal

0%

20%

40%

60%

80%

100%

1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

source and dest. source destination subroutes

Different Types of Route Calculation

Matched Unmatched Empty

0

20

40

60

80

100

1 2 3 4 5

%

driver

unmatched matched

Google Directions API (top-1)

Related Work

[1] K.-P. Chang, L.-Y. Wei, M.-Y. Yeh, and W.-C. Peng. Discovering personalized routes from trajectories. LBSN 2011, pp. 33–40

[2] Z. Chen, H. T. Shen, and X. Zhou. Discovering popular routes from trajectories. ICDE 2011, pp. 900–911

[3] W. Lou. H. Tan, L. Chen, and L.M. Ni. Finding time period-based most frequent path in big trajectory data. In SIGMOD 2013,

pp. 713–724

[4] J. Yuan, Y. Zheng, C. Zhang, W. Xie, X. Xie, G. Sun, and Y. Huang. T-Drive: Driving directions based on taxi trajectories. In GIS

2010, pp. 99–108

• Four existing routing techniques use drivers’ trajectories

[1],[2],[3],[4]

The road network is formed from the road segments that are

covered by the trajectory data set

A route is formed by

Prioritizing parts of roads that are followed the most by a specific

driver (personalized routes) [1]

Prioritizing parts of roads that are taken by other drivers [2],[4]

Possibly using sub-routes from multiple routes [3]

Trajectories used for scoring must contain the destination and

must start and end during the provided time interval. [3]

Suggested routes are formed from the most popular routes or

route parts in the available data set.

Conclusion and Research Directions

Conclusions

• The proposed framework utilizes trajectory data collected from local

drivers for routing.

• A preferred route is selected using a flexible scoring function that

considers

The number of traversals of the route

The number of distinct drivers taking the route

The time periods when the traversals occurred

• Use of travel histories of local drivers can increase routing quality

• More details in the paper!

Research directions

• Additional aspects of the framework can be considered

Efficiency of route identification process (LCSS technique)

Inclusion of personalized routes

Better support for routes that are constructed from sub-routes

Closing

Demos and Prototype Systems

• EcoTour: http://daisy.aau.dk/its/

Computes and compares the shortest, the fastest, and the most

eco-friendly routes for arbitrary source-destination pairs in DK.

Best demo award at IEEE MDM 2013.

• EcoSky: http://daisy.aau.dk/its/eco/

Supports skyline eco-routing and personalized eco-routing

• Sheafs: http://daisy.aau.dk/its/sheaf

Trajectory based traffic sheafs

• Strict-Path Queries: http://daisy.aau.dk/its/spqdemo

Trajectory based, Strict-Path Queries

Trips (historical travel-time), route choice, Napoleon (road usage)

• iPark: identifying parking spaces from GPS trajectories

On-street parking lanes vs. parking zones

http://daisy.aau.dk/its/



http://daisy.aau.dk/its/eco/



http://daisy.aau.dk/its/sheaf

http://daisy.aau.dk/its/sheaf

http://daisy.aau.dk/its/spqdemo

http://daisy.aau.dk/its/spqdemo

Napoleon’s Russian Campaign

The Future

• Much more data

Inductive loop detectors

Bus data

Rejsekortet

• Much more connected vehicles

• New services

Routing

Safety and warnings

Parking, fees, insurance, road pricing

Car sharing, multi-modality

• Driver-less vehicles

Challenges

• Detect ―black spots‖ before they occur.

• Modeling spatio-temporal congestion from data

• Characterize the effects of events

Accidents, malfunctioning of traffic signals, rain, a concert

• Real-time traffic management

In response to current or predicted situation, actuate traffic signals

and drivers (via their smartphones or navigation devices) to

optimize the use of the infrastructure and driver experience

• Automated trade-off between weight level of detail and

available data.

• Stochastic routing at 20 milliseconds.

• Integrate with ―point‖ data.

Acknowledgments

• Colleagues at Aalborg University, Aarhus University, and

beyond.

• The EU FP7 project, Reduction: http://www.reduction-

project.eu/

• The Obel Family Foundation: http://www.obel.com/en

http://www.reduction-project.eu/




http://www.obel.com/en

Readings

• V. Ceikute, C. S. Jensen: Vehicle Routing with User-Generated Trajectory Data. MDM (1) 2015

• C. Guo, B. Yang, O. Andersen, C. S. Jensen, K. Torp: EcoMark 2.0: empowering eco-routing with vehicular environmental models and actual vehicle fuel consumption data. GeoInformatica 19(3):567-599 (2015)

• C. Guo, Bin Y., O. Andersen, C. S. Jensen, K. Torp: EcoSky: Reducing vehicular environmental impact through eco-routing. ICDE 2015:1412-1415

• B. Yang, C. Guo, Y. Ma, C. S. Jensen: Toward personalized, context-aware routing. VLDB J. 24(2):297-318 (2015)

• Y. Ma, B. Yang, C. S. Jensen: Enabling Time-Dependent Uncertain Eco-Weights For Road Networks. GeoRich@SIGMOD 2014:1:1-1:6

• B. Yang, C. Guo, C. S. Jensen, M. Kaul, S. Shang: Stochastic skyline route planning under time-varying uncertainty. ICDE 2014:136-147

• B.- Yang, M. Kaul, C. S. Jensen: Using Incomplete Information for Complete Weight Annotation of Road Networks. IEEE TKDE 26(5):1267-1279 (2014)

• C. Guo, C. S. Jensen, B. Yang: Towards Total Traffic Awareness. SIGMOD Record 43(3):18-23 (2014)

Readings

• V. Ceikute, C. S. Jensen: Routing Service Quality - Local Driver Behavior Versus Routing Services. MDM (1) 2013: 97-106

• M. Kaul, B. Yang, C. S. Jensen: Building Accurate 3D Spatial Networks to Enable Next Generation Intelligent Transportation Systems. MDM 2013: 137-146, best paper award.

• Ove Andersen, Christian S. Jensen, Kristian Torp, Bin Yang: EcoTour: Reducing the Environmental Footprint of Vehicles Using Eco-routes. MDM 2013: 338-340, Demo paper, best demo award.

• B. Yang, C. Guo, C. S. Jensen: Travel Cost Inference from Sparse, Spatio-Temporally Correlated Time Series Using Markov Models. PVLDB 6(9): 769-780 (2013)

• C. Guo, Y. Ma, B. Yang, C. S. Jensen, Manohar Kaul: EcoMark: evaluating models of vehicular environmental impact. SIGSPATIAL/GIS 2012: 269-278

• L. Speicys, C. S. Jensen: Enabling Location-based Services - Multi-Graph Representation of Transportation Networks. GeoInformatica 12(2): 219-253 (2008)

• A. Brilingaite, C. S. Jensen: Enabling Routes of Road Network Constrained Movements as Mobile Service Context. GeoInformatica 11(1): 55-102 (2007)

• C. Hage, C. S. Jensen, T. B. Pedersen, L. Speicys, I.Timko: Integrated Data Management for Mobile Services in the Real World. VLDB 2003: 1019-1030

Christian jensen advanced routing in spatial networks using big data

Data & Analytics

Transcript of Christian jensen advanced routing in spatial networks using big data