Christian jensen advanced routing in spatial networks using big data
-
Upload
jins0618 -
Category
Data & Analytics
-
view
280 -
download
0
Transcript of Christian jensen advanced routing in spatial networks using big data
Roadmap
• Big data, motivation, setting, overview
• Modeling and computation of weights
Spatial network lifting
EcoMark
Time-varying, uncertain weights
Complete weight annotation using incomplete information
Near-future travel cost inference
• Routing
Stochastic skyline routing
Routing based on local-driver behavior
• Closing
Demos, the future, challenges, acknowledgments, readings
Roadmap
• Big data
Hype or substance?
Instrumentation of reality and digitization
The digital universe
Moore’s Law generalized
Big data challenges
Hype or Substance?
• We have been pushing the boundaries for decades
How much data we can handle
How fast
Data integration
• Examples
VLDB: International Conference on Very Large Database
TODS: ACM Transactions on Database Systems
• So is it all hype?
No
Instrumentation and Digitization
• Instrumentation of reality
Notably, smartphones
• Digitization of processes
E.g., e-commerce, public services, communications, social
interactions
Big Data
Every day, we create 2.5 quintillion bytes of data — so much
that 90% of the data in the world today has been created in
the last two years alone. This data comes from everywhere:
sensors used to gather climate information, posts to social
media sites, digital pictures and videos, purchase transaction
records, and cell phone GPS signals to name a few. This
data is big data. http://www-01.ibm.com/software/data/bigdata/
The Digital Universe
• The digital universe
Doubling every ~18-24 months
Grew 60% in 2009, 50% in 2010
2009: 0.8 zettabyte, 2010: 1.2 zettabyte, 2020: 35 zettabytes
2009-20: growth by a factor of 44
http://www.emc.com/collateral/demos/microsites/idc-digital-universe/iview.htm
1 zettabyte = 1024 exabytes
1 exabyte = 1024 petabytes =
260 = 1,152,921,504,606,846,976 ≈ 1018 bytes
Moore’s Law – The Bicycle Analogy
• Moore’s Law: computers double in speed every 24
months.
Applies also to quality-adjusted microprocessor prices, memory
capacity, disks, networks, sensors, and the number and size of
pixels in cameras.
• How fast would a bicyclist be if Moore’s Law applied?
50 years of doubling every 24 months
30 km/h originally
~1 billion km/h now
• Three lessons
Growth rates in computing are dramatic and difficult to imagine.
Hardware advances are important information technology drivers.
Humans don’t really improve – they are the constants.
Big Data – Synthesis
• The result is new opportunity.
• Lots of data and unprecedented computing infrastructure
combine to offer potentials for value creation from data.
• To be competitive, society and businesses must be able to
create value from data.
• Data-based decisions and data-driven processes
Decisions based on good data beat decisions based on feelings or
opinions.
• A finer granularity of services
• Entirely new services
Motivation – ITS
• A safer, greener, and more efficient and cost-effective
transportation infrastructure
• Greenhouse gas emissions reductions via eco-routing
• Congestion, greater Copenhagen region
~10 billion DKK/year (2004)
• Bad setting of signalized intersections in Denmark
~9,3 billion DKK/year (2012)
• The transportation sector is the second largest
greenhouse gas (GHG) emitting sector and also causes
substantial pollution.
By 2025, it’s predicted that 6.2 billion commutes will be made
every day, worldwide.
• The reduction of greenhouse gas (GHG) emissions from
transportation is essential to combat global climate
change.
EU: reduce GHG emissions by 30% by 2020.
G8: a 50% GHG reduction by 2050.
China: a 17% GHG reduction by 2015.
• Eco-routing can reduce vehicular impact by up to 20%.
• General context: Smart City
Motivation – Eco-Routing
System of Sensors Model
• The setting may be modeled as a system of (logical)
streams, one per edge.
Data is emitted from the stream of an edge when a vehicle
traverses the edge
Spatial
Spatio-temporally correlated
Sparse
• Real, unlike early, envisioned smart dust applications!
Methodology
• The same methodology underlies the studies covered.
• Define precisely a problem of (perceived) real-world
interest.
• Develop solutions
Concepts, data structures, algorithms
• Carry out mathematical analyses
Correctness, complexity, storage size
• Prototype the solutions and perform empirical studies
Often, real data is needed
Offers detailed insight in the design properties of the solutions
If you cannot measure it, you cannot improve it
• Iterate!
Data Foundation
• A road network model
E.g., OpenStreetMap
Germany: 10 million edges (rough estimate)
• Aerial laser scan data
For lifting, optional
• Tracking data from vehicles
E.g., GPS data
• Fuel consumption data from vehicles
Often called CAN bus data
GPS Data – DW Statistics
• Number of data sources (and NDAs): ca. 20
Daily data from 4 sources; data in batches from 4 sources (2 small,
2 big); data from ca. 10 finished projects
• Daily data
4 sources
~3 million rows
3500 vehicles
• Number of fact table rows: More than 4 billion
• Number of vehicles: ca. 25,000
• Rows with fuel data: 500 million (estimate; ~140,000 per
day)
• Rows from EVs: 100-200 million
Software and Hardware
• Experimental infrastructure
• Software
Have complete software stack for handling traffic data
Map-matching, data cleansing, multiple map support
• Hardware
Very modern server farm
Newest machine has 2TB main memory
Roadmap to Achieving Eco-Routing
EcoMark 3D Road Networks
1. Understand how to quantify environmental impact
2. Assign Eco-Weights to a Road Network
3. Novel Routing Algorithms
Understand Vehicular Environmental Impact
• EcoMark and EcoMark 2.0
An evaluation framework that compares and analyzes 11 state-of-
the-art vehicular environmental impact models using GPS data
and a 3D road network.
What do the models need to quantify vehicular environmental
impact?
Speeds (instantaneous or average), and accelerations: from GPS data
Road slopes: from 3D road network
Model categorization
Instantaneous models: appropriate for eco-driving
Aggregated models: appropriate for eco-routing
It is possible to assign eco-weights to a large extent using GPS
data, 3D maps, and appropriate models.
An extended evaluation framework, EcoMark 2.0, is developed on
top of EcoMark using actual fuel consumption data that is obtained
from CAN bus data.
a1'a11'
a1
a11
a5'
a5
Elevation matters – 3D Road Network
• Efficient methods that can build a 3D road network from
3D laser scan data and a 2D road network.
• Use 3D laser scan data to formulate a terrain that is
represented as a Triangulated Irregular Network (TIN).
• Project 2D polylines to the TIN in order to obtain 3D
polylines, and thus result in a 3D road network.
Roadmap to Achieving Eco-Routing
EcoMark 3D Road Networks
Simple vs. Advanced
Eco-Weights
Static vs. Dynamic
Eco-Weights
1. Understand how to quantify environmental impact
2. Assign Eco-Weights to a Road Network
3. Novel Routing Algorithms
Eco-Weights – Simple vs. Advanced
• Simple: time dependent, deterministic.
Each interval has a deterministic weight.
Ex: (0:00, 7:00]: 10 mg; (7:00, 9:00]: 18 mg; (9:00, 15:00]: 12 mg; …
• Advanced: time dependent, uncertain.
Each interval has an uncertain weight, i.e., a random variable that
follows some distribution (e.g., a uniform/normal distribution)
Ex: (0:00, 7:00]: 8 to 12 mg, N(9, 10); (7:00, 9:00]: 13 to 23 mg,
N(18, 10); (9:00, 15:00]: 8 to 12 mg, N(10, 20); …
Representation
Coverage Temporal
Granularity
Accuracy
Simple
Weights
Intervals, each
with a single
value
High
(all edges)
Coarse
(e.g., pre-
defined PEAK)
Good (better than
speed limits)
Advanced
Weights
Intervals, each
with a histogram
or a probability
density function
Low
(only ―hot‖
edges)
Fine
(e.g., 15-min
intervals)
High level of
detail (capture
the travel cost
distributions)
Eco-Weights – Static vs. Dynamic
• Static Weights
Obtained based on historical GPS data and remain static
Capture typical traffic patterns
• Dynamic Weights
Updated as the real-time GPS data streams in
Capture dynamic and instant traffic patterns
• Our solutions
Static Dynamic
Simple Graph Weight
Annotation
Spatio-Temporal
Hidden Markov
Model
Advanced Time-Dependent
Histograms
Spatio-Temporal
Hidden Markov
Model
Assigning Eco-Weights
• Static, simple weights, all edges
Challenge: infer weights for ―cold‖ edges – the edges that are not
covered by any GPS data
Graph weight annotation
Quantify edge similarities based on traffic flows that are derived from the
topology of a road network
Propagate the weights on hot edges to similar cold edges
• Static, advanced weights, hot edges
Capturing the weights of hot edges using time dependent histograms
Challenge: compact representation of the histograms
• Dynamic, simple and advanced weights, hot edges
Inferring near-future weights of hot edges as GPS data streams in
Challenge: sparseness, dependency, and heterogenity
Time-Dependent Histograms
• We need to contend with three challenges.
• Multiple travel costs
Consider not only GHG emissions, but also distance, travel time.
• Time-dependence
Some travel costs are time-dependent.
E.g., travel times on the same road may vary during peak vs. off-
peak periods.
• Uncertainty
Some travel costs are uncertain.
E.g., aggressive and moderate driving on the same road may
produce different GHG emissions.
Road Network Model: MTUG
• MTUG: Multi-cost, Time-dependent, Uncertain Graph
• N different costs are of interest.
Distance (DI), travel time (TT), GHG emissions (GE).
• An MTUG is a graph G = (V, E, MM, W)
V is the vertex set, and E is the edge set.
MM = <MM(1), …, MM(N)>
Function MM(i) maps an edge to the minimum and maximum i-th
cost of using the edge.
MM(TT) (ei) = (150 seconds, 500 seconds)
MM(GE) (ei) = (10 ml, 85 ml)
W = <W(1), …, W(N)>
Function W(i) maps an edge to a set of (interval, random variable)
pairs of the i-th cost type.
W(TT) (ej) = {([0:00, 7:15), N(300, 120)), ([7:15, 8:45), N(450, 100)), …}
W(GE) (ej) = {([0:00, 7:00), N(30, 100)), ([7:00, 9:00), N(50, 80)), …}
Instantiation of MM in an MTUG
• GPS records are map matched to edges.
• Each edge is associated with a set of edge records of the
form (e, t, C)
An edge record indicates that a traversal on edge e at time t takes
costs C, where C is a vector of all costs of interest.
(e1, 8:08, <55 seconds, 80 ml>)
(e1, 9:08, < 45 seconds , 63 ml>)
(e1, 9:10, < 43 seconds , 60 ml>)
(e1, 10:03, < 45 seconds , 62 ml>)
• MM(TT) (e1) = (43 seconds, 55 seconds)
• MM(GE) (e1) = (60 ml, 80 ml)
Instantiation of W in an MTUG
• Partition a day into 96 15-min intervals.
• For each edge and interval, we obtain a multi-set containing
the costs on the edge during the interval.
ms = {(10 s, 3), (20 s, 10), (25 s, 20), (30 s, 10), (40 s, 5), …}
• Estimate a random variable (RV) based on the multi-set.
• Combine adjacent intervals with similar RVs and estimate a
new RV. The long intervals along with the RVs instantiate W.
Route Costs in MTUG
• Given a route Ri = <r1, r2, …, rX>, where ri ∈ E is an edge.
• RC(Ri, t) indicates the costs of using route Ri at time t
RC(Ri, t) = <RVDI, RVTT, RVGE> is a vector of RVs, where each RV
corresponds to a travel cost.
• RVDI is deterministic and equals the sum of the length of
each edge in route Ri.
• RVTT is the convolution of the travel time RV of each edge
in route Ri.
The travel time RV of the first edge r1 is dependent on the trip start
time t.
The travel time RV of the k-th edge rk is dependent on the travel
time of the previous k-1 edges, which are generally uncertain.
• RVGE is the convolution of the GHG emissions RV of each
edge in route Ri.
Roadmap to Achieving Eco-Routing
EcoMark 3D Road Networks
Simple vs. Advanced
Eco-Weights
Static vs. Dynamic
Eco-Weights
1. Understand how to quantify environmental impact
2. Assign Eco-Weights to a Road Network
3. Novel Routing Algorithms
Skyline
Eco-Routing
Continuous
Eco-Routing
Personalized
Eco-Routing
Novel Routing Algorithms
• Stochastic skyline route planning under time-varying
uncertainty.
Identify pareto-optimal routes considering multiple travel costs,
e.g., travel times, GHG emissions, distances.
R1 (15 min, 500 ml, 25 km)
R2 (13 min, 520 ml, 26 km)
R3 (16 min, 530 ml, 30 km)
• Continuous routing that supports real-time updates on
eco-weights (i.e., dynamic eco-weights).
Real-time re-navigate vehicles according to recently updated
eco-weights and the current locations of vehicles.
• Context-aware, personalized routing
Identify context-aware preferences for each driver (e.g., time-
efficient driving or fuel-efficient driving).
Choose a single route that best satisfies the driver’s
preferences.
Deterministic Skyline Routes
• Route cost: cost(Ri) = <DI, TT, GE>
A vector of deterministic values.
Each value corresponds to a travel cost.
• Dominance relationship
Ri dominates Rj iff all the costs of Ri are no greater than those of Rj,
and there is at lest one cost of Ri is smaller than that of Rj.
• Consider multiple routes for the same source-destination.
R1: 3.5 km, 230 mg, 10 min;
R2: 5.1 km, 250 mg, 11 min;
R3: 5.1 km, 200 mg, 12 min;
• The skyline routes are the non-dominated routes.
Since R2 is dominated by R1, R1 and R3 are the skyline routes.
Stochastic Dominance
• Route cost: RC(Ri) = <RVDI, RVTT, RVGE>
A vector of random variables (RVs), where each RV represents the
distribution of a travel cost.
• Stochastic Dominance between two RVs
Given two RVs X and Y, if cdfX(a) >= cdfY(a), for all possible value
a in R+, we say ―X stochastically dominates Y‖.
Cost(R1).RVTT stochastically dominates Cost(R2). RVTT.
No stochastic dominance between Cost(R3). RVTT and Cost(R4).
RVTT.
Stochastic Skyline Routes
• Dominance between two routes Ri and Rj
If each RV of cost(Ri) stochastically dominates the corresponding
RV of cost(Rj), then Ri dominates Rj.
• Stochastic Skyline Routes
Given a source-destination pair, and a trip starting time.
Stochastic skyline routes are the routes that are not dominated by
any other routes.
• Brute force stochastic skyline route planning
Enumerate all possible routes, compute the route costs, and check
whether one route dominates another.
Very inefficient and works only for small road networks.
• Efficient stochastic skyline route planning
Prune early some routes that cannot be skyline routes.
Efficient stochastic dominance checking.
Example Result
• Skyline routes R1, R2, and R3, identified by our algorithm
• R1: 94,849 m; R2: 106,216 m; R3: 91,382 m;
DI: R3 dominates R1 and R2.
TT: R1 dominates R2 and R3
GE: R2 dominates R1 and R3.
Personalized Routing
• Different drivers may take different routes
because they may have quite different
preferences.
• The same drivers may take different
routes in different contexts.
Morning: try to save time to avoid being late.
Weekend afternoon: try to save fuel
consumption.
• Challenges
Identify contexts for drivers and identify driving
preference in each context.
Deal with time-dependent uncertain travel
costs, e.g., travel time and fuel consumption,
while considering individual drivers’ driving
behaviors, e.g., aggressive vs. moderate
driving.
Drivers’
Trajectories
Framework
Context and Preference
Identification Contexts &
Preferences
Driver,
Source,
Destination,
Time
Routing M
odule
Op
tim
al R
oute
s
Offline Phase Online Phase
Example Results
Dark, bold routes: actual routes used by drivers.
Red routes: shortest routes.
Green routes: fastest routes.
Blue routes: predicted routes using the identified contexts and
driving preferences.
Summary
• Eco-Routing is able to effectively reduce environmental
impact from transportation section.
• Eco-routing is achieved by
Reliably and accurately quantifying vehicles’ environmental impact
using GPS data and 3D road networks;
Assigning various types of eco-weights to edges;
Developing novel routing algorithms that enable effective and
practical eco-routing service.
High-Res 3D Model Usage
Eco-Routing •USD 6 Billion per annum in Fuel
savings with 3D model –
TomTom
•EcoMark: Benchmark shows
better CO2 emissions estimation.
Classifying 3D Models
• Need for accurate 3D spatial networks by
applications.
• Ease of availability of rich terrain datasets.
3D Road Model
High-Res Model
•± 20 cm Accuracy
•Higher Storage Cost
Low-Res Model
•± 2 mts Accuracy
•Lower Storage Cost
Problem Definition
Input
A 2D spatial network, where an edge is modeled as a 2D polyline
A laser scan point cloud containing a set of 3D points
Output
A 3D spatial network, where an edge is modeled as a 3D polyline
2D Spatial Network
+ 3D points
3D Spatial Network
3D Points
ε-Neighborhood
Framework
2D Spatial Network Laser Scan Point Cloud
One pass
Filtering
External Memory
based Filtering
Filtering
Delaunay Triangulation
or
Lifting
Interpolation
3D Spatial Network
One-Pass Filter (OPF)
•3D laser scan points are ―Inherently sorted spatially‖!
•Data Guarantee: A 1m x 1m cell contains two 3-D points.
•Assumption: The road is much smaller and fits in main memory.
Build a grid-index Hg from the road
segments, mapping road points to
cell IDs.
Scan the massive 3D point cloud in a
single pass and build a 3D point to
cell mapping Hp ”seeded” by Hg
If 3D points exceed a threshold cell
occupancy of α x δ x δ where α is
the user defined threshold in [0,1]
and δ is the grid cell width -> Lift cell
contents and release cell memory!
One-Pass Filter (OPF)
Advantages of OPF
Needs no pre-processing of massive 3D point cloud for sorting or
indexing.
Builds parts of the 3D road network as 3D points are streaming
through.
Faster runtimes, lower resolution. (± 2m accuracy)
External Memory based Filtering (EMF)
• No assumption about road network size.
Neither road points nor 3D points fit in
memory. Hence, stored on disk.
• Limited memory budget available to the
method.
Moore Neighborhood: 3,5,8-cell
neighborhood. Ex: grey shaded
region around cell (1,1).
External Memory based Filtering (EMF)
Read road point disk blocks Br from
disk into main memory.
For each block Br - read 3D points
Moore neighborhood blocks from
disk into a limited LRU cache
(size given in multiples of the grid
width c).
Continue this process in a space-
filling curve (Z-order or Row-Major)
fashion to avoid re-reading 3D point
blocks from disk.
Z-Order vs. Row-Major Disk Access
Why is Row-Major doing well at 3c?
LRU Cache
Block to process
Neighbor Block needed in memory
Blocks in cache
Z-Order vs. Row-Major Disk Access
Why is Row-Major doing well at 3c?
LRU Cache (Filled up)
Block to process
Neighbor Block needed in memory
Blocks in cache
Z-Order vs. Row-Major Disk Access
Why is Row-Major doing well at 3c?
Answer: Disk Blocks are read into memory exactly once !
LRU Cache (Filled up)
Block to process
Neighbor Block needed in memory
Blocks in cache
Blocks evicted
External Memory based Filtering (EMF)
Advantages of EMF
Requires little main memory
Slower runtimes, higher resolution (± 20 cm accuracy)
Can work to ―Zoom-in‖ on OPF created maps to improve
resolution on mobile devices
Triangulation
• Delaunay Triangulation
After filtering, a set of 3D points is associated with a cell of road
segments.
The set of 3D points is triangulated.
3D Points
2D Road Points
OPF vs. EMF
Interpolation
a1'a11'
a1
a11
a5'
a5
Project the TIN triangles
to the x-y plane.
Compute the
intersections of triangles
and road 2D polyline.
Re-project road points +
intersections back to TIN
surface again.
Interpolate road points
on TIN to get elevations.
Experimental Study
Dataset Statistics
•Aalborg (AA) : 7km x 5km
•North Jutland (NJ) : 185km × 130km
•NJ approx. 120 times larger than AA.
Experimental Study (Scalability)
• North Jutland (NJ) covering a region of 185km × 130km
• EMF
Accuracy ±20 cm in 21 hrs with memory use of 230 MB
• OPF
Accuracy ±2 m in 2 hrs, which is an order of magnitude
faster than EMF, using 2.4 GB of main memory
Summary
• Augment a 2D road network with elevation values from 3D
point cloud
• Propose two methods that work with limited memory budgets
• Extensive experiments to offer insight into both methods
** 3D Spatial Network available from
UCI Machine Learning Repository
(http://archive.ics.uci.edu/ml/)
Motivation
• The reduction of greenhouse gas (GHG) emissions from
transportation is essential to combat global climate
change.
EU: reduce GHG emissions by 30% by 2020.
G8: a 50% GHG reduction by 2050.
China: a 17% GHG reduction by 2015.
• Eco-driving and eco-routing can reduce GHG emissions.
Eco-driving: driving behavior
Eco-routing: routes that minimize fuel usage
• We need to be able to measure vehicular environmental
impact.
Several impact models have been developed, but a
comprehensive comparison of these is missing.
The utility of the models for eco-driving and eco-routing is not well
understood.
EcoMark
• EcoMark aims to enable a better understanding of how to
compute the environmental impact of vehicles.
• EcoMark is a benchmark that compares and analyzes 11
state-of-the-art vehicular environmental impact models.
No existing such comprehensive benchmark.
• Key characteristics of EcoMark
Uses GPS data from vehicles and a 3D spatial network
Categorizes the 11 models into two major categories
Compares and analyzes the models in the same categories, and
the models between the two categories
EcoMark Framework
GPS observations 2D Spatial Network Laser Scan Points
Input
Instantaneous Impact
6 Instantaneous Models 5 Aggregated Models
Aggregated Impact
Comparison and Analysis
Insights
EcoMark
Map Matching
GPS Trajectories
Spatial Network Lifting
3D Spatial Network
3D Spatial Network
• Spatial network lifting
2D spatial network: OpenStreetMap
Aerial laser scan of Denmark (1+ point per m2; 2.5 TB for
Denmark)
• A 3D spatial network is a directed, weighted graph
denoted as G = (V, E, F, H), where
V and E are the vertex and edge sets.
F maps vertices to 3D coordinates.
H maps edges to grades.
Benchmark Experiments
• Exp1: Eco driving: Comparisons of instantaneous models
Compare the behaviors of the 6 models.
Identify models that behave similarly.
• Exp2: Eco routing: Comparisons of aggregated models
Compare the behaviors of the 5 models.
Identify models that behave similarly.
• Exp3: Cross-comparisons of instantaneous and
aggregated models
Identity the similarity of the aggregation of instantaneous models
and aggregated models.
• Exp4: Study of models that support grade information
How important is 3D – effects of using a 3D spatial network.
Insights
• Instantaneous models
Five of six models behaved consistently.
They can be used to measure the impact of driving behaviors.
They can be used to classify good and bad driving behaviors.
• Aggregated models
All the five models behaved consistently.
Not suitable to identify impact at operational levels.
They can be used to assign eco-weights to road segments, and
are helpful for eco-routing.
• Importance of 3D Spatial Networks
Road grades affect all models substantially.
Eco-driving and eco-routing both benefit.
Summary
• EcoMark
Evaluates empirically 11 existing models of vehicular
environmental impact using a large trajectory database and a
lifted, 3D OpenStreetMap
Findings: Instantaneous models support eco driving, while
aggregated models support eco-routing.
• Next steps
Accuracy comparison with ground truth CAN bus data (EcoMark
2.0)
Requires access to corresponding GPS and CAN bus data.
Eco driving and eco routing
Outline
• Motivation
• Problem definition
• Framework
• ERN building
• GHG emissions estimation
• Empirical study
• Summary
Motivation – Advanced Eco-Weights
• Eco-routing is an effective way to reduce vehicular
environmental impact.
• The capture of the environmental costs of traversing
road network edges is key to eco-routing.
Eco-weights are uncertain.
Eco-weights are time-dependent.
Once obtained, existing routing algorithms can be
applied to compute eco-routes.
Problem Definition
• Goal: Obtain an Eco Road Network G = (V, E, F)
V: Vertex set. Each vertex indicates a road intersection.
E: Edge set. Each edge indicates a road segment.
Function F assigns a time-dependent, uncertain eco-weight to
each edge in E.
• Input
A set of map-matched trajectories TR.
An accompanying road network G' = (V, E, Null).
• Output
The Eco Road Network G = (V, E, F).
Approach
• Time-dependent uncertain histograms.
A vector of (period, histogram) tuples <Ti, Hi>.
Hi is the histogram describing the distribution of cost values
observed in period Ti.
• Time-dependent uncertain histograms are used to
represent the eco-weight of each road network edge.
• Several data compression techniques are applied to
reduce the storage space while retaining accuracy.
[10 a.m., 11 a.m.) [9 a.m., 10 a.m.) [8 a.m., 9 a.m.)
0
0,5
1
0
0,5
1
0
0,5
1
Framework
Trajectories TR Road Network
G = (V, E, NULL)
Traversal Record Analysis
Initial Histogram Construction
Histogram Merging
Bucket Reduction
Eco Road Network
G = (V, E, F)
Traversal Record Analysis
• GPS records are map-matched to the corresponding
edges.
• Map-matched records are transformed into traversal
records.
A traversal record r = (e, t, tt, ge) indicates that edge e is traversed
by a trajectory trj starting at time t and has travel time tt and GHG
emissions ge.
The VT-micro environmental impact model is used to estimate the
GHG emissions of each traversal record.
Initial Histogram Building
• Each edge is associated with a set of traversal records
• Divide the time space into intervals with equal width
The default value is 1 hour, (24 intervals in total).
• For each edge e
Build equi-width histograms for each time interval.
The number of buckets per time interval is configurable.
The histograms are isomorphic.
0.1
0.3
0.5
0.7
[0, 20) [20, 40) [40, 60) [60, 80]
Pro
bab
ilit
y
GHG emissions (mL)
[8 a.m., 9 a.m.)
[9 a.m., 10 a.m.)
Histogram Merging
• For each edge, merge two temporally adjacent histograms
if they are sufficiently similar.
• Use cosine similarity to quantify similarity.
• We use a merge threshold Tmerge to decide when to stop
merging.
sim(Hi , H j ) =V(H i ) ÄV(H j )
V(H i ) · V(H j )
0.1
0.3
0.5
0.7
[0,20) [20,40) [40,60) [60,80]
Pro
bab
ilit
y
GHG emissions (mL)
[8 a.m., 9 a.m.)
[9 a.m., 10 a.m.)
0.1
0.3
0.5
0.7
[0,20) [20,40) [40,60) [60,80]
Pro
bab
ilit
y
GHG emissions (mL)
[8 a.m., 10 a.m.)
Histogram Bucket Reduction
• Further reduce the storage size of an individual histogram
by merging adjacent buckets. Use SSE to measure the merge cost (accuracy loss).
Merge buckets when the cost does not exceed threshold Tred.
Iteratively merge adjacent buckets in all the histograms of a road
segment.
0.1
0.3
0.5
0.7
[0,20) [20,40) [40,60) [60,80]
Pro
bab
ilit
y
GHG emissions (mL)
[8 a.m., 10 a.m.)
0.1
0.3
0.5
0.7
[0,40) [40,60) [60,80]
Pro
bab
ilit
y
GHG emissions (mL)
[8 a.m., 10 a.m.)
GHG Emissions Estimation
• For a route
Estimate the distribution of GHG emissions as a histogram.
Aggregate the histograms of the edges in the route.
• Given two histograms H1 and H2 for adjacent edges
A histogram H′ is computed that represents the aggregated GHG
emissions distribution for traversing both edges.
21
' HHH
0.1
0.3
0.5
0.7
[0,20) [20,40) [40,60)
Pro
bab
ilit
y
GHG emissions (mL)
e1e2
0.1
0.3
0.5
0.7
[0,40) [40,80)[80,120)
Pro
bab
ilit
y
GHG emissions (mL)
e1 + e2
Histogram Aggregation
• Given buckets bj in H1 and bk in H2
• A bucket bi in H’ is generated as
• Enumerate all bucket pairs of H1[j] and H2[k].
• Rearrange the buckets in H′ to obtain an equi-width
histogram.
H '[i]. p= H1[ j ]. p×å H2[k]. p
Range(H '[i]) = Range(H1[ j ])+ Range(H2[k])
Experiments
• Experimental settings
Data Set: 200 million GPS records.
Road network of Denmark: 414K vertices and 818K edges.
GHG emissions: VT-micro is applied to evaluate vehicular
environmental impact.
• Evaluation
Baseline: without any compression.
State-of-the-art: T-drive.
Accuracy, storage, and efficiency.
Storage Study
• Histogram Merging
• Bucket Reduction
MCRm =M init - Mmerge
M init
0.6
0.7
0.8
0.9
1
0.9 0.92 0.94 0.96 0.98
MC
R
Merge Threshold
MCRr =Mmerge - M redu
Mmerge 0.6
0.7
0.8
0.9
1
35 40 45 50 55 60
MC
R
Bucket Limit
TMerge = 0.98
TMerge = 0.95
Accuracy Study
• Histogram Aggregation
• MCR vs. Err
0.8
0.9
1
2 6 10 14 18
HS
imil
arit
y
Number of Edges
0
0.1
0.2
0.3
0.4
35 40 45 50 55 60
Err
Bucket Limit
Tmerge = 0.98
Tmerge = 0.95
0.2
0.4
0.6
0.8
0.1 0.12 0.14 0.16 0.18
MC
R
Err
ERN
T-drive
• Error Metric – Err
The relative accumulative deviation of all the buckets in a
histograms from the original distribution.
0
0.1
0.2
0.9 0.92 0.94 0.96 0.98
Err
Merge Threshold
After Merging
Equi-width
• Histogram Merging
• Bucket Reduction
Summary
• We propose technique to estimate the time-dependent
environmental travel costs in a road network based on
histograms.
• An empirical study suggests that we can efficiently
compute eco-weights of road segments while retaining
accuracy.
• The proposed eco-weights can be used for eco routing
and real-time traffic monitoring.
Overview
• Motivation
• Problem definition
• Cost modeling
• Objective functions
Residual Sum of Squares
Topology constraint
Adjacency constraint
• Experiments
Motivation
• Why weight annotation?
Vehicle routing relies on a weighted-graph representation of the
underlying road network.
Given a graph with appropriate weights, different types of routes
can be computed.
• Why incomplete information?
Weights that capture travel costs such as travel times can be
extracted from GPS trajectory data.
GPS trajectory data typically does not cover all edges.
• Why complete weight annotation?
To use a graph for routing, all edges must have weights.
Problem Definition
GPS Records
Pre-Processing Module
A set of (trip, cost) pairs
{(t(i), c(i))} G''(V, E, null)
Weight Annotation Module
G(V, E, H)
Temporal Road Network Cost Modeling
A road network is modeled as a directed graph.
Temporal tags: during the same temporal tag, the traffic has similar
behavior, e.g., PEAK and OFFPEAK.
Each edge has 2 cost values, which indicate the cost per unit
length for traversing the corresponding road segment during PEAK
and OFFPEAK, respectively.
d(AB, PEAK) indicates the cost per unit length for traversing the road
segment AB during PEAK tag.
d: a vector of size |E| * |TAGS| (e.g., 6 * 2 =12)
A
B C
D
d(AB, PEAK)
d(AB, OFFPEAK)
d(BC, PEAK)
d(BC, OFFPEAK)
d(BA, PEAK)
d(BA, OFFPEAK)
d(CB, PEAK)
d(CB, OFFPEAK) d(BD, PEAK)
d(BD, OFFPEAK)
E d(CE, PEAK)
d(CE, OFFPEAK)
Trip Cost Modeling
• GPS trajectory
A sequence of records of the form (x,y,t).
• Trip
A sequence of records of the form (edge, ts, te).
• The cost of a trip is the sum of the cost of each record.
E.g., t1= <AB, 7:55, 8:00>, <BC, 8:00, 8:10>
cost_model(t1) = d(AB, PEAK) * length(AB) + d(BC, OFFPEAK) *
length(BC)
• If we know all cost variables in d, we can compute the
estimated cost of any trip.
• For every trip t in T’
The square of difference between the actual cost of t and the
estimated cost using the proposed model, i.e., cost_model(t).
E.g., t1= <AB, 7:55, 8:00>, <BC, 8:00, 8:10>, actual_cost(t1) =
100.
RSS(t1) = (100 – cost_model(t1))2
Problem: Training data T’ may not cover the whole road network.
By minimizing the RSS function, we can only get cost variables for
the road segments traversed by T’.
d(BC, OFFPEAK)
Objective Functions: Residual Sum of Squares
A
B C
D
d(AB, OFFPEAK)
d(BC, PEAK)
d(BA, PEAK)
d(BA, OFFPEAK)
d(CB, PEAK)
d(CB, OFFPEAK) d(BD, PEAK)
d(BD, OFFPEAK)
E d(CE, PEAK)
d(CE, OFFPEAK)
d(AB, PEAK)
Objective Function: Topological Constraint
• If two road segments locate in similar topology in a road
network, they tend to have similar cost during the same
tag.
• Given a pair of road segments ei and ej
The weighted square loss between the cost variables of ei and ej.
weight_TC(ei, ej) * (d(ei, PEAK) – d(ej, PEAK))2
If weight_TC (ei, ej) is big, d(ei, tag) and d(ej, tag) tend to be
similar in order to make (d(ei, PEAK) – d(ej, PEAK))2
small.
A
B C
D
d(AB, PEAK)
d(AB, OFFPEAK)
d(BC, PEAK)
d(BC, OFFPEAK)
d(BA, PEAK)
d(BA, OFFPEAK)
d(CB, PEAK)
d(CB, OFFPEAK) d(BD, PEAK)
d(BD, OFFPEAK)
E d(CE, OFFPEAK)
d(CE, PEAK)
Weighted PageRank Computation
• PageRank gives a value to each vertex in a graph.
• Primal graph -> dual graph
Vertex -> Edge
Edge -> Vertex
A
B C
D
E
BA CB
BD
AB BC
CE
Weighted PageRank Computation
• Original PageRank
A vertex evenly gives its PR value to all its adjacent vertices.
Vertex AB gives ½ * PR(AB) to BC and BD, respectively.
• Weighted PageRank
Weight(AB, BD) = 𝑇𝑟𝑖𝑝𝑁𝑢𝑚 𝐴𝐵,𝐵𝐷 +1
𝑇𝑟𝑖𝑝𝑁𝑢𝑚 𝐴𝐵,𝐵𝐷 +𝑇𝑟𝑖𝑝𝑁𝑢𝑚 𝐴𝐵,𝐵𝐶 +|𝑑𝑒𝑔𝑟𝑒𝑒(𝐴𝐵)|
100 trips from AB to BC, 50 trips from AB to BD
weight(AB, BD) = 50+1
50+100+2 =
51
152
weight(AB, BC) = 100+1
50+100+2 =
101
152
No trips information
weight(AB, BD) = weight(AB, BC) = 1
2
BA CB
BD
AB BC
CE
Weighted PageRank based Topological Constraint
• Use weighted PageRank value to quantify the topological
similarity.
If two edges have similar weighted PageRank values, they also
tend to have similar cost.
weight_TC(ei, ej) = m𝑖𝑛(𝑃𝑎𝑔𝑒𝑅𝑎𝑛𝑘 𝑒𝑖 ,𝑃𝑎𝑔𝑒𝑅𝑎𝑛𝑘(𝑒𝑗))
max(𝑃𝑎𝑔𝑒𝑅𝑎𝑛𝑘 𝑒𝑖 ,𝑃𝑎𝑔𝑒𝑅𝑎𝑛𝑘(𝑒𝑗))
Objective Function: Adjacency Constraint
If two road segments are directionally adjacent in a road network,
they tend to have similar cost during the same tag.
d(AB, PEAK) may be similar to d(BC, PEAK)
But not necessary to be similar to d(CB, PEAK) or d(BC, OFFPEAK)
Given a pair of directionally adjacent road segments ei and ej
The weighted square loss between the cost variables of ei and ej.
weight (ei, ej) * (d(ei, PEAK) – d(ej, PEAK))2
A
B C
D
d(AB, PEAK)
d(AB, OFFPEAK)
d(BC, PEAK)
d(BC, OFFPEAK)
d(BA, PEAK)
d(BA, OFFPEAK)
d(CB, PEAK)
d(CB, OFFPEAK) d(BD, OFFPEAK)
E d(CE, PEAK)
d(CE, OFFPEAK)
d(BD, PEAK)
Objective Functions
• Residual Sum of Squares
RSS(d) = (actual_cost(t) – cost_model(t))2t∈T′
• Topological Constraint
PRTC(d) = 𝑤𝑒𝑖𝑔𝑡_𝑇𝐶 𝑒𝑖, 𝑒𝑗 ∗ (𝑑 𝑒𝑖, 𝑡𝑎𝑔 − 𝑑(𝑒𝑗, 𝑡𝑎𝑔))2𝑖,𝑗
• Adjacency Constraint
ACTC(d) = 𝑤𝑒𝑖𝑔𝑡 𝑒𝑖, 𝑒𝑗 ∗ (𝑑 𝑒𝑖, 𝑡𝑎𝑔 − 𝑑(𝑒𝑗, 𝑡𝑎𝑔))2𝑖,𝑗
• Overall objective function
The sum of the above three (plus a regularizer term)
• Differentiating the objective function w.r.t. vector d and
setting it to 0 yields an equation that can be solved using
standard techniques to obtain a solution for d.
Experimental Results
• Two weeks GPS trajectories from SPF
• 18K trips in total, and 20% as training, 80% as testing
• Two road networks: Skagen and North Jutland
• Objective functions used
F1: RSS; F2: RSS + PRTC; F3: RSS + ACTC
F4: RSS + PRTC + ACTC
Outline
• Motivation
• Travel-cost time series
• Modeling travel-cost time series in a road network
• Learning a spatio-temporal hidden Markov model
• Empirical study
• Summary
• Vehicle routing plays an important role in transportation.
E.g., fastest routes, eco-routes.
• Effective vehicle routing relies on a weighted graph
representation of a road network.
The weight of an edge should accurately capture the travel cost of
traversing the edge.
E.g., travel times, greenhouse gas (GHG) emissions.
• Edge weights should be dynamically changing due to the
dynamics of traffic.
To support dynamic edge weights, we propose
techniques that are able to infer near-future travel costs
based on incoming GPS data.
Motivation
Travel-Cost Time Series
• The travel costs on edge ei are modeled as a
time series TSi = <Ci(1), Ci
(2), …, Ci(T)>.
• Travel cost set Ci(j) contains travel costs
observed in the j-th time interval on edge ei.
An interval is the finest time interval of interest.
E.g., 15 minutes, yielding 96 intervals per day.
A travel cost is the travel time or GHG emissions of a
traversal of edge ei during the j-th interval.
E.g., during the j-th interval [8:00, 8:15), if 3 vehicles
traversed ei taking 30, 32, 40 s, Ci(j) = {30, 32, 40}.
e1
e2
e3
TS1
TS2
TS3
Time Series Characteristics
• Dependent
Travel costs on adjacent edges may be dependent.
• Heterogeneous
The traffic on different edges is often different.
Some edges have clear peak/off-peak periods while others do not.
• Sparse
Some intervals do not have any travel cost observations.
Modeling Travel-Cost Time Series
• We use a Spatio-Temporal Hidden Markov Model
(STHMM) to model the interactions among multiple travel-
cost time series in a road network.
State transition
on edge e1
s1t-1=
cong
s1t=
cong s1
t+1
= clr
C1t-1 C1
t C1t+1 Time Series:
TS1
TP TP
OP OP OP
An edge ei has a set of hidden states Si, where each
state indicates a particular traffic status on the edge.
S1={cong, clr}
During an interval, the traffic of an edge is in a state.
Intervals: t-1 (previous) t (current) t+1 (next)
The traffic on an edge evolves over time, transiting
from one state to another according to transition
probabilities (TP).
In an interval, the travel costs observed on an edge
is dependent on the state of the edge according to
output probabilities (OP).
Modeling Multiple Time Series using STHMM
e1
e2
e3
e1
e2
e3
s1t-1=
cong
s1t =
cong
s1t+1=
clr
s2t-1=
clr
s2t =
clr
s2t+1=
cong
s3t-1=
cong
s3t =
clr
s3t+1=
clr
Intervals: t-1 (previous) t (current) t+1 (next)
TP: P(S1t|S1
t-1, S2t-1)
TP: P(S2t|S1
t-1, S2t-1, S3
t-1)
TP: P(S3t|S2
t-1, S3t-1)
Capture the traffic dependency among edges.
Travel-Cost Time
Series Derived from
Historical GPS Data
Framework Overview
State Formulation &
Parameter Learning
Spatio-Temporal
Hidden Markov Model
Routing Algorithms
Near Future Edge
Weights
Real-Time
GPS Data
Eco-routes &
Fastest-routes
State Formulation – Intuitions
• Since edges are heterogeneous, we need to identify a
unique state set for each edge.
Edges with heavy and complicated traffic: more states
Edges with light and simple traffic: fewer states
• Target properties of the state set of an edge
The traffic in the same state behaves similarly.
The current state depends on its previous state.
State Formulation – Methods
• For each edge, identify a Gaussian Mixture Model (GMM)
with K Gaussian components to capture the distribution of
all the travel costs on the edge.
Due to heterogeneity, different edges may have different Ks.
• Transform the travel costs in each interval in the edge’s
time series into a K+1 dimensional point.
The first K dimensions correspond to the travel cost distribution in
the interval, still using the K Gaussian components.
The K+1st dimension captures the difference between the travel
cost distributions in the interval and its previous interval.
• Cluster these K+1 dimensional points into several clusters,
and each cluster is a state.
Parameter Learning
• Output probabilities
Since each state is a cluster, we use the cluster center to derive a
GMM as the output probability for the state.
• Transition probabilities
For each edge, we need the edge’s travel-cost time series, along
with its adjacent edges’ travel-cost time series.
Based on the obtained output probabilities, transition probabilities
are then estimated using maximum likelihood estimation.
Travel Cost Inference based on an STHMM
e1
e2
e3
Est.
s1t
Est.
s2t
Est.
s2t+1
Est.
s3t
Intervals: t (current) t+1 (next)
Real Time
GPS data
On e3
Real Time
GPS data
On e2
Real Time
GPS data
On e1
OP +
Bayes
OP +
Bayes
OP +
Bayes
OP
Est.
Travel
Costs
TP
TP
TP
Empirical Settings
• Data
181 million GPS records from North Jutland, Denmark from April
2007 to March 2008.
The data is split into training and testing sets.
• Periods of interest
Weekdays during [6:00, 20:00).
• Road network
1,916 edges in North Jutland with more than 500 records.
• Travel costs
Travel time and GHG emissions.
• Accuracy measurements
Average sum of squared loss (ASSL) between estimated costs and
real costs.
Experimental Study
• Aspects studied
Effect of the length of interval.
Effect of the sizes of training data
Effect of recent data
Effect of seasonality
• Results
Efficient; able to support real-time weight updates
Accurate; outperforms a baseline and the state-of-the-art
Summary
• Described an STHMM that is able to model multiple
correlated, sparse time series.
Dependent: STHMM couples the states from adjacent edges.
Heterogeneous: State formulation identifies a unique state set for
each edge.
Sparse: Smoothing techniques are used to contend with sparse
data.
• The described framework is able to support real-time
weight updates and thus provide more accurate routing
services, e.g., eco-routing.
Outline
• Motivation and challenges
• Road network modeling
• Stochastic skyline route planning
• Empirical studies
• Summary
• Reduction in greenhouse gas (GHG) emissions is
important to the whole society.
EU: reduce GHG emissions by 30% by 2020.
G8: a 50% GHG reduction by 2050.
China: a 17% GHG reduction by 2015.
• Eco-routing is a simple yet effective way to achieve the
reduction from transportation section.
Of interest to both fleet owners and individual drivers.
For example, FlexDanmark is interested in using most eco-friendly
routes while considering travel times and distances.
• We aim to provide useful eco-routing services.
Motivation
Challenges
• To provide practically useful eco-routing, we need to
contend with three challenges.
• Multiple travel costs
Consider not only GHG emissions, but also distances and travel
times.
• Time dependence
Some travel costs are time-dependent.
For example, travel times on the same road may vary during peak
vs. off-peak periods.
• Uncertainty
Some travel costs are uncertain.
For example, aggressive and moderate driving on the same road
may produce different GHG emissions.
A novel road network model - MTUG
• MTUG is a Multi-cost, Time-dependent, Uncertain Graph.
• Assume N different costs are of interest in total.
Distance (DI), travel time (TT), GHG emissions (GE).
• G=(V, E, MM, W)
V is the vertex set, and E is the edge set.
MM = <MM(1), … , MM(N)>
Function MM(i) maps an edge to the minimum and maximum i-th
cost of using the edge.
MM(TT) (ea) = (150 seconds, 500 seconds)
MM(GE) (eb) = (10 ml, 85 ml)
W = <W(1), … , W(N)>
Function W(i) maps an edge to a set of (interval, random variable)
pairs of the i-th cost type.
W(TT) (ea) = {([0:00, 7:15), N(300, 120)), ([7:15, 8:45), N(450, 100)), …}
W(GE) (eb) = {([0:00, 7:00), N(30, 100)), ([7:00, 9:00), N(50, 80)), …}
Instantiation of MM in an MTUG
• MM and W are instantiated using GPS records.
• GPS records are map matched to edges.
• Each edge is associated with a set of edge records in the
form of (e, t, C).
An edge record indicates that a traversal on edge e at time t takes
costs C, where C is a vector of all costs of interest.
(e1, 8:08, <55 seconds, 80 ml>)
(e1, 9:18, < 45 seconds , 63 ml>)
(e1, 10:10, < 43 seconds , 60 ml>)
(e1, 21:03, < 45 seconds , 62 ml>)
• Based on the edge records on an edge, functions MM on
the edge can be instantiated.
MM(TT) (e1) = (43 seconds, 55 seconds)
MM(GE) (e1) = (60 ml, 80 ml)
Instantiation of W in an MTUG
• Partition a day into 96 15-min intervals.
• For each (edge, interval) pair, we obtain a multi-set containing
the costs on the edge during the interval.
ms={(10 s, 3), (20 s, 10), (25 s, 20), (30 s, 10), (40 s, 5), …}
• Estimate a random variable (RV) based on the multi-set.
Use a Gaussian Mixture Model (GMM) to represent an RV.
GMMs can approximate arbitrary distributions.
A GMM is a weighted sum of K Gaussian distributions.
GMM(x) = 𝑚𝑘 ∙ 𝑁(𝑥|µ𝑘 , δ𝑘2)𝐾
𝑘=1
Instantiation of W in an MTUG (cont.)
• If two RVs in two adjacent intervals are similar, we combine
the two intervals into a long interval.
Use KL-divergence to measure the similarity between two RVs.
• Re-estimate a new RV for the long interval using the costs in
the long interval.
• The whole procedure works iteratively until no RVs from
consecutive intervals are similar enough to be combined.
• The long intervals along with their RVs instantiate W.
Route Costs in MTUG
• Given a route Ri=<r1, r2, …, rX>, where ri ∈ E is an edge.
• RC(Ri, t) indicates the costs of using route Ri at time t
RC(Ri, t) = <RVDI, RVTT, RVGE> is a vector of RVs, and each RV
corresponds to a travel cost.
• RVDI is a deterministic value, which equals to the sum of
the length of each edge in route Ri.
• RVTT is the convolution of the corresponding travel time
RV of each edge in route Ri.
Deciding the travel time RV of the first edge r1 is dependent on the
trip start time t.
Deciding the travel time RV of the k-th edge rk is dependent on the
travel time of the previous k-1 edges, which may be uncertain.
• RVGE is the convolution of the corresponding GHG
emission RV of each edge in route Ri.
Stochastic Dominance
• Stochastic Dominance between two RVs
Given two RVs X and Y, if cdfX(a) >= cdfY(a), for all possible value
a in R+, we say ―X stochastically dominates Y‖.
RC(R1, t).RVTT stochastically dominates RC(R2, t). RVTT.
No stochastic dominance between RC(R3, t). RVTT and RC(R4, t).
RVTT.
Stochastic Skyline Routes
• Route Ri dominates route Rj iff
No RV of cost(Rj) stochastically dominates the corresponding RV
of RC(Ri).
At least one RV of RC(Ri, t) stochastically dominates the
corresponding RV of RC(Rj, t).
• Stochastic skyline route
Given a source-destination pair, and a trip starting time.
Stochastic skyline routes are the routes that are not dominated by
any other routes.
Trajectories
Framework Overview
Pre-Processing
Road Network
MTUG
Source,
Destination,
Time
Routing M
odule
Sto
cha
stic S
kylin
e R
ou
tes
Cost Records
Offline Phase Online Phase
Instantiating MM and W
Stochastic Skyline Route Planning
• A brute force method
Enumerate all possible routes, compute the route costs, and check
whether one route dominates another
Very inefficient, and only works for small road networks
• An efficient method
Prune some routes that cannot become skyline routes early.
Efficient stochastic dominance checking
Early Pruning Strategy
• Do the following for all travel cost types of interest.
We use travel time as an example.
We maintain a graph where each edge is associated with the
minimum travel time, which is recorded in MM.
From the destination, run Dijkstra’s algorithm on the graph.
As each vertex is associated with the minimum travel time, we get the
―fastest‖ possible route from the source to the destination.
Compute the route cost of the ―fastest‖ possible route and add the
route as a candidate Skyline route.
vs vd
Each vertex vx has the
shortest distance
least travel time
least GHG emissions
to the destination vd
v1
v2
v3
Candidate skyline route 1
Candidate skyline route 2
Candidate skyline route 3
Early Pruning Strategy (cont.)
• Explore routes from source, until no more routes can be
explored.
Estimate the least possible travel costs for a partially explored
route R’.
If the partially explored route with its estimated least possible costs
is dominated by an existing candidate skyline route, there is no
need to explore the route any further.
Otherwise, continue exploring.
Update candidate skyline route if necessary.
vs vd
v1
v2
v3
Candidate skyline route 1
Candidate skyline route 2
Candidate skyline route 3 R’
Candidate skyline route 4
Stochastic Dominance Checking
• Naïve way: check according to the definition of stochastic
dominance.
For each value a, check whether cdfX(a) >= cdfY(a).
• An efficient way
Consider one cost type at a time.
Compute the minimum and maximum possible travel costs of a
route.
• Distinguish among three cases based on the min and max
travel costs of two routes.
Disjoint case: dominance.
Stochastic Dominance Checking (cont.)
• Covered case: non-dominance.
• Overlapping case (needs further checking)
Both dominance and none-dominance may occur.
Empirical Settings
• GPS Data
More than 180 million GPS records collected from Denmark in
2007 and 2008.
1 Hz sampling rate: one GPS record per second.
• Road networks (from OpenStreetMap)
Aalborg (AA): 4,981 vertices and 12,614 edges.
North Jutland (NJ): 68,318 vertices and 163,546 edges.
Jutland (JU): 667,876 vertices and 1,622,974 edges.
AA is the largest city in NJ, and NJ is part of JU.
• Travel costs
Distances (DI): obtained from OpenStreetMap.
Travel time (TT): computed from the timestamps in the GPS
records.
GHG emissions (GE): computed from the instantaneous speeds
and accelerations from the GPS records.
MTUG Generation
• Hot edges: edges covered by the GPS data set.
• Instantiate MM and W based on the GPS data.
• Cardinalities of TT and GE intervals.
Largest cardinalities: 65 for TT and 72 for GE.
Average cardinalities:
• Cold edges: edges not covered by the GPS data set.
Derive corresponding travel costs using speed limits.
The traffic in AA is much more
dynamic that that in other parts.
Stochastic Skyline Route Queries
• Query settings:
• 50 random source-destination pairs with a starting time in
each setting.
• Study the cardinality of the stochastic skyline routes and
the run time.
Cardinalities of Skyline Routes
• As the number of travel cost types considered increases,
the cardinality of the skyline routes also increases.
• When distance between source and destination increases,
the cardinality of the skyline routes also increases.
Run Time
• The run-time for a query is below 5 seconds, and most
queries are computed in less than 2 seconds.
• Advanced dominance checking method is effective and
efficient.
Benefits of using the advanced dominance checking is more
substantial on JU since long routes require more dominance
checks.
Summary
• Described a framework that enables stochastic skyline
route planning in road networks with multiple, time-
dependent, and uncertain travel costs.
• Enables eco-routing in a realistic setting.
Motivating Example
Fastest route: Route #1
Shopping
Mall
Food
Store
50
70
S
E
B
F
G A
I
H
J
K D Route #1
Motivating Example
Shortest route: Route #2
Shopping
Mall
Food
Store
50
70
S
E
B
F
G A
I
H
J
K D
Route
#2
Motivating Example
Route #3 avoids parking lots
Shopping
Mall
Food
Store
50
70
S
E
B
F
G A
I
H
J
K D
Route
#3
Motivating Example
Route #4 is attractive when stores are closed
Shopping
Mall
Food
Store
50
70
S
E
B
F
G A
I
H
J
K D
Route #4
Motivating Example
Which route from source S to destination D is preferable?
Shopping
Mall
Food
Store
50
70
S
E
B
F
G A
I
H
J
K D Route #1
Route
#3
Route
#2
Route #4
The fastest – takes the least travel time?
The shortest – smallest distance?
The most ―direct‖– fewest turns?
The one most constrained to main roads –
better road quality?
The ―safest,‖ avoiding bad neighborhoods?
…
The one most preferred by locals?
Introduction
• Local travel
Knowledge of the surroundings
Follow familiar routes
• Travel in unfamiliar surroundings to unknown destinations
Depend on available routing services
Expect that the provided route is the best
• Idea: Use GPS data to let those who travel in unfamiliar
surroundings benefit from the insights of local travelers
Motivation
• Can’t we just use a standard routing service?
• How often do drivers follow routes recommended by
routing a service?
V. Čeikutė and C.S. Jensen. Routing service quality—local driver behavior versus routing services. MDM 2013, pp. 97–106
0
500
1000
1500
2000
2500
0.5-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-220
# r
oute
s
route length (km)
Routes
5000 routes
Motivation
• How often do drivers follow routes recommended by
routing a service?
• Breakdown by distance traveled 7
3%
56
%
44
%
36
%
19
%
9%
18
%
20
%
11%
0
20
40
60
80
100
0.5-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-220
on a
vg
ma
tch
(%
)
route length (km)
V. Čeikutė and C.S. Jensen. Routing service quality—local driver behavior versus routing services. MDM 2013, pp. 97–106
• ~60% of routes match the
route provided by the routing
service; ~40% do not!
• Use of previous user’s trips
can improve routing quality
Goal of the Study
• Propose a routing framework that
Utilizes GPS data volunteered by local drivers
Exploits possibly hard-to-formalize insight into local conditions
Takes into account temporal variation in driver behavior
Recommends routes based on popularity and temporal aspects
• Evaluate the quality of proposed routes
Study based on trip length and pre-selected drivers
Quality comparison with existing routing service and route
recommendation approaches
Outline
• Introduction and motivation
• Framework
• Data preparation methodology
• Scoring of routes
• Empirical study
Data foundation
Routing quality evaluation and comparison
• Related work
• Conclusion and research directions
Framework
offline
GPS logs map-
matching trajectory
DB
Routing
service
online
trajectory
data
user’s
input data
top route
user
GPS trace
• source
• destination
• time
Trips used:
• Start or end at, or go through,
source and destination locations
• Trips that start during the issued
time period are preferred
When the source and
destination are not covered
by available GPS data, an
existing routing service is
used.
Data Preparation Methodology
• A trip is a sequence of
timestamped GPS points
between stay points.
• Stay point
A location where the user stays
for a certain period time (no
GPS data is received)
A location reported repeatedly
for a certain period of time (the
engine remained on)
• In practice, trips are
consolidated
Short trips are dropped and
some consecutive trips are
merged
x
y
t
same
location
Data Preparation Methodology
• Trips are map-matched and
transformed into sequences of
road segments with entry and
exit timestamps
• Example
Trip 𝑡𝑟1is map-matched to a
sequence of road segments:
(𝑠1, 𝑡𝑠1 , 𝑡𝑒1), (𝑠2, 𝑡𝑠2 , 𝑡𝑒2), (𝑠3, 𝑡𝑠3 , 𝑡𝑒3)
𝑡𝑟3
𝑡𝑟2
𝑡𝑟1
𝑠1
𝑠2
𝑠3
Data Preparation Methodology
The similarity of two trips is
determined by applying
Longest Common
Subsequence (LCSS) to the
trajectories represented as
sequences of road segments.
s1
s2 s3 s4
s1
s2 s3
s4 s11 s9 s10
s10 s9
s5
s6 s7
s8
Similar trips belong to the same route.
𝑆1 𝑝𝑙𝑠, 𝑝𝑙𝑠′ =𝑙𝑒𝑛𝑔𝑡ℎ(𝐿𝐶𝑆𝑆(𝑝𝑙𝑠,𝑝𝑙𝑠′))
𝑙𝑒𝑛𝑔𝑡ℎ(𝑝𝑙𝑠)∙ 100%
𝐶1 𝑝𝑙𝑠, 𝑝𝑙𝑠′ = 𝑡𝑟𝑢𝑒𝑖𝑓𝑆1 𝑝𝑙𝑠, 𝑝𝑙𝑠′ ≥ 𝑔𝑒𝑡𝑀𝑎𝑡𝑐(𝑝𝑙𝑠, 𝑝𝑙𝑠′)𝑓𝑎𝑙𝑠𝑒𝑜𝑡𝑒𝑟𝑤𝑖𝑠𝑒
Similarity of two
trajectories
The similarity threshold
depends on the trip length.
Longer trips, higher
similarity thresholds,
interval [90-100].
Do trips follow the same route?
Data Preparation Methodology
• Trips that follow the same sequence of road segments are
grouped into route usage objects.
user route # traversals
𝑢1 𝑟1: 𝑠𝐵𝐶 , 𝑠𝐶𝐷, 𝑠𝐷𝐹 , 𝑠𝐹𝐺 - 10off
𝑢2 𝑟2: 𝑠𝐶𝐷, 𝑠𝐷𝐹 , 𝑠𝐹𝐺 3peak 5off
𝑢3 𝑟3: 𝑠𝐸𝐷 , 𝑠𝐷𝐹 , 𝑠𝐹𝐼 , 𝑠𝐼𝐺 7peak 4off
𝑢4 𝑟2: 𝑠𝐶𝐷, 𝑠𝐷𝐹 , 𝑠𝐹𝐺 5peak 6off
A B
𝑟1
𝑟2
𝑟3 E
D F
I
G
C
• Route 𝑟2 is taken by users 𝑢2 and 𝑢4 3 and 5 times during
peak hours and 5 and 6 times during off-peak hours.
(𝑟2, 𝑝𝑙, 𝑝𝑒𝑎𝑘, 𝑢2, 3 , 𝑢4, 5 , 𝑜𝑓𝑓, 𝑢2, 5 , 𝑢4, 6 )
Scoring of Routes
Preferred routes
• Popular among drivers
• Taken by many distinct drivers
• Popular on the time of the day
and day of the week of the query
peak hours
off-peak hours
destination
source
B
A
Final route score:
𝑠𝑐𝑜𝑟𝑒 𝑟 = 𝛽 ∙ 𝑝𝑟𝑒𝑓𝑀 𝑟 + (1 − 𝛽) ∙ 𝑝𝑟𝑒𝑓𝑁(𝑟)
Scoring of Routes
Route preference value:
𝑝𝑟𝑒𝑓 𝑟 = 𝛼 ∙ 𝑢𝑠𝑒𝑟𝑠 𝑟 + (1 − 𝛼) ∙ 𝑡𝑟𝑎𝑣𝑒𝑟𝑠𝑎𝑙𝑠(𝑟)
# distinct drivers
taking the route
# of traversals
of the route
Considers trips taken during
the query temporal pattern
Considers trips taken during
other temporal patterns
Empirical Study: Data
• Monitoring period: 2 years
• Number of drivers: 285
• Number of GPS points (raw data): ~182,700,000
• Number of trips: ~275,000
• ―Pay as You Speed‖ project (http://www.sparpaafarten.dk)
0
20000
40000
60000
80000
100000
120000
140000
2-10 10-20 20-30 30-40 40-50 50-350
# tra
ve
rsals
traversal length (km)
Afternoon peak
Morning peak
Off-peak
Routing Quality Evaluation: Data
For this study, we randomly selected equal amounts of trips
for different trip length intervals
0
100
200
300
400
500
600
700
2-10 10-20 20-350
# tra
ve
rsals
traversal length (km)
Off-peak
Morning peak
Afternoon peak
Routing Quality: Match
A match is identified using LCSS
Trajectory
DBtest
Routing
Service
source and
destination
Compare
route with
trajectory
preferred
route
trajectory
Total
number of
matches match! C
A
B
Trajectory
DB
Routing Quality Evaluation: Results
Our Proposal Google Directions API (top-1)
TPMFP (SIGMOD’13)
0
20
40
60
80
100
2-10 10-20 20-350
%
length (km)
Unmatched
Matched
0
20
40
60
80
100
2-10 10-20 20-350
%
length (km)
Unmatched
Matched
0
20
40
60
80
100
2-10 10-20 20-350
%
length (km)
EmptyUnmatchedMatched
Routing Quality Evaluation: Data
For this study, we considered the five drivers with the most
trips.
0
500
1000
1500
2000
2500
3000
3500
1 2 3 4 5
# tra
ve
rsals
driver
2-10 km
10-30 km
30-300 km
0
1000
2000
3000
4000
5000
1 2 3 4 5# tra
ve
rsals
driver
Off-peak
Afternoon peak
Morning peak
Traversals per Length Traversals per Temporal Pattern
Routing Quality Evaluation: Results
Our proposal
0%
20%
40%
60%
80%
100%
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
source and dest. source destination subroutes
Different Types of Route Calculation
Matched Unmatched Empty
0
20
40
60
80
100
1 2 3 4 5
%
driver
unmatched matched
Google Directions API (top-1)
Related Work
[1] K.-P. Chang, L.-Y. Wei, M.-Y. Yeh, and W.-C. Peng. Discovering personalized routes from trajectories. LBSN 2011, pp. 33–40
[2] Z. Chen, H. T. Shen, and X. Zhou. Discovering popular routes from trajectories. ICDE 2011, pp. 900–911
[3] W. Lou. H. Tan, L. Chen, and L.M. Ni. Finding time period-based most frequent path in big trajectory data. In SIGMOD 2013,
pp. 713–724
[4] J. Yuan, Y. Zheng, C. Zhang, W. Xie, X. Xie, G. Sun, and Y. Huang. T-Drive: Driving directions based on taxi trajectories. In GIS
2010, pp. 99–108
• Four existing routing techniques use drivers’ trajectories
[1],[2],[3],[4]
The road network is formed from the road segments that are
covered by the trajectory data set
A route is formed by
Prioritizing parts of roads that are followed the most by a specific
driver (personalized routes) [1]
Prioritizing parts of roads that are taken by other drivers [2],[4]
Possibly using sub-routes from multiple routes [3]
Trajectories used for scoring must contain the destination and
must start and end during the provided time interval. [3]
Suggested routes are formed from the most popular routes or
route parts in the available data set.
Conclusion and Research Directions
Conclusions
• The proposed framework utilizes trajectory data collected from local
drivers for routing.
• A preferred route is selected using a flexible scoring function that
considers
The number of traversals of the route
The number of distinct drivers taking the route
The time periods when the traversals occurred
• Use of travel histories of local drivers can increase routing quality
• More details in the paper!
Research directions
• Additional aspects of the framework can be considered
Efficiency of route identification process (LCSS technique)
Inclusion of personalized routes
Better support for routes that are constructed from sub-routes
Demos and Prototype Systems
• EcoTour: http://daisy.aau.dk/its/
Computes and compares the shortest, the fastest, and the most
eco-friendly routes for arbitrary source-destination pairs in DK.
Best demo award at IEEE MDM 2013.
• EcoSky: http://daisy.aau.dk/its/eco/
Supports skyline eco-routing and personalized eco-routing
• Sheafs: http://daisy.aau.dk/its/sheaf
Trajectory based traffic sheafs
• Strict-Path Queries: http://daisy.aau.dk/its/spqdemo
Trajectory based, Strict-Path Queries
Trips (historical travel-time), route choice, Napoleon (road usage)
• iPark: identifying parking spaces from GPS trajectories
On-street parking lanes vs. parking zones
The Future
• Much more data
Inductive loop detectors
Bus data
Rejsekortet
• Much more connected vehicles
• New services
Routing
Safety and warnings
Parking, fees, insurance, road pricing
Car sharing, multi-modality
• Driver-less vehicles
Challenges
• Detect ―black spots‖ before they occur.
• Modeling spatio-temporal congestion from data
• Characterize the effects of events
Accidents, malfunctioning of traffic signals, rain, a concert
• Real-time traffic management
In response to current or predicted situation, actuate traffic signals
and drivers (via their smartphones or navigation devices) to
optimize the use of the infrastructure and driver experience
• Automated trade-off between weight level of detail and
available data.
• Stochastic routing at 20 milliseconds.
• Integrate with ―point‖ data.
Acknowledgments
• Colleagues at Aalborg University, Aarhus University, and
beyond.
• The EU FP7 project, Reduction: http://www.reduction-
project.eu/
• The Obel Family Foundation: http://www.obel.com/en
Readings
• V. Ceikute, C. S. Jensen: Vehicle Routing with User-Generated Trajectory Data. MDM (1) 2015
• C. Guo, B. Yang, O. Andersen, C. S. Jensen, K. Torp: EcoMark 2.0: empowering eco-routing with vehicular environmental models and actual vehicle fuel consumption data. GeoInformatica 19(3):567-599 (2015)
• C. Guo, Bin Y., O. Andersen, C. S. Jensen, K. Torp: EcoSky: Reducing vehicular environmental impact through eco-routing. ICDE 2015:1412-1415
• B. Yang, C. Guo, Y. Ma, C. S. Jensen: Toward personalized, context-aware routing. VLDB J. 24(2):297-318 (2015)
• Y. Ma, B. Yang, C. S. Jensen: Enabling Time-Dependent Uncertain Eco-Weights For Road Networks. GeoRich@SIGMOD 2014:1:1-1:6
• B. Yang, C. Guo, C. S. Jensen, M. Kaul, S. Shang: Stochastic skyline route planning under time-varying uncertainty. ICDE 2014:136-147
• B.- Yang, M. Kaul, C. S. Jensen: Using Incomplete Information for Complete Weight Annotation of Road Networks. IEEE TKDE 26(5):1267-1279 (2014)
• C. Guo, C. S. Jensen, B. Yang: Towards Total Traffic Awareness. SIGMOD Record 43(3):18-23 (2014)
Readings
• V. Ceikute, C. S. Jensen: Routing Service Quality - Local Driver Behavior Versus Routing Services. MDM (1) 2013: 97-106
• M. Kaul, B. Yang, C. S. Jensen: Building Accurate 3D Spatial Networks to Enable Next Generation Intelligent Transportation Systems. MDM 2013: 137-146, best paper award.
• Ove Andersen, Christian S. Jensen, Kristian Torp, Bin Yang: EcoTour: Reducing the Environmental Footprint of Vehicles Using Eco-routes. MDM 2013: 338-340, Demo paper, best demo award.
• B. Yang, C. Guo, C. S. Jensen: Travel Cost Inference from Sparse, Spatio-Temporally Correlated Time Series Using Markov Models. PVLDB 6(9): 769-780 (2013)
• C. Guo, Y. Ma, B. Yang, C. S. Jensen, Manohar Kaul: EcoMark: evaluating models of vehicular environmental impact. SIGSPATIAL/GIS 2012: 269-278
• L. Speicys, C. S. Jensen: Enabling Location-based Services - Multi-Graph Representation of Transportation Networks. GeoInformatica 12(2): 219-253 (2008)
• A. Brilingaite, C. S. Jensen: Enabling Routes of Road Network Constrained Movements as Mobile Service Context. GeoInformatica 11(1): 55-102 (2007)
• C. Hage, C. S. Jensen, T. B. Pedersen, L. Speicys, I.Timko: Integrated Data Management for Mobile Services in the Real World. VLDB 2003: 1019-1030