Gis capabilities on Big Data Systems

Geospatial Analytics and Spatial Capabilities on Big

Data SystemsBy: Ahmed Jawad (PhD)

Agenda

2

• Why Analyze Telematics

• Analysis of Movement Data

• Analytical Assets for Telematics

• Operational Requirements on Telematics

• Data flow on big data platforms

• Analytical Challenges and Applications Solved through Machine Learning• Snap to road• Unifying trajectories to patterns of movements and routines• Traffic event detection

Why Analyze Telematics• We are being recorded everywhere • Provides great insights into the customer routines and

movement• Key players competing in the market

3

Analyzing Movement Data

Trajectory

4

Object in motion (time – space)

Coordinate based recording

Raw trajectories

Symbolic trajectories

Discretization

Streets, locations, or events

Traditional Operational Requirements In The World Of Geographic Information Systems (GIS)

• Traditional use cases : cartography, geo-algebra (display of statistical events, hotspots, co-locations on the map)

• Databases used : postgres, sql server• Mostly static data sources• Relatively small data sets• Moderate geometric accuracy• Offline processing acceptable• Complex geometric datatypes support

Operational Requirements and Design Considerations for Telematics

• Realtime ingestion and analytics on sensor data, distance queries, snap-to-road

• 100 TBs/ Petabyte scale of the data• High variation in geospatial queries (range queries, etc..) and

throughtput of CRUD operations: insertion/deletion/read• Processing flow and map applications, nature of the relationships in the

data implicating storage technology. Indexing techniques and implications.

Telematics and Geospatial Data Types• Spatial data structures:

• Raster: geographically-referenced matrix of uniform size

• Vector: features on the earth’s surface are represented as geographically-referenced vector objects

• Hierarchical nature of objects• Points: different types : Entity, label, area, node• Lines: lines, polylines, arc, link, etc.• Polygons: area, polygon, complex polygon

• Requirements: The ability to manipulate Geospatial Data. • Databases and libraries required to manipulate these objects on

distributed scale ( Spark and scala, MongoDB, or any other nosql data base)

Analytical Assests for Telematics

• The analytical assets for Telematics can be broadly related to

• Snap-to-road

• Analysis of User Activities (Clustering)

• Traffic Event Detection (Classification)

• Realtime location search

• Set operations on geometriy objects and geoalgebra (layering of geospatial information atop each other and algebraic operations on them)

Conceptual dataflow and geospatial processing in Telematics

9

PDAEvent capture

KafkaEvent Processing & Delivery Descision

Stream Processing EnginePDA Geodata & Critical events

Mongo / Hbase , Cassandra

/ Elastic

(on top of Hadoop)Persistence Layer

Risk area

Tomcat App (Optional

Raster Processing - Geotrellis)Datafeed client

Preload risk area

Preload traffic info

ClientD3 / Ajax /

Leaflet

API Push(REST)

Push

Websocket

Push

Pull

Push

Stream

Pull

Persistent layer should be scalable & support storage and querying of spatiotemporal objects (point, polygons, lines, line strings, for reference see mongo db’s 2d spherical indexing and geospatial querying). The following low level queries shall be supported. (1) nearest neighbor query: given a point (lat, long) find all the line strings that are within x meter radius. (2) containment query: give all the points within a polygon, or given a point find al the polygons containing them .

Client browser. e.g. fleet manager. In the current scheme, we have deferred all the intelligence to the client. i.e. the raster processing, displaying the map, and different layers along with map algebra will be done on the client side. One such example can be leaflet. An alternate strategy can be to use geotrellis.io as a geo processing engine to do the raster operations and only use client for the display of the map.

Stream processing queries (1)Instantaneous speed/ angular momentum of the PDA. (2) Distance to a traffic event pulled from bing (3) Running aggregates, e.g. how long the vehicle has spent at the current location

Geocoding Service

OSM / Realtime traffic API

Analytics Cluster GIS capablitiesClient browser. e.g. fleet manager. In the current scheme, we have deferred all the intelligence to the client. i.e. the raster processing, displaying the map, and different layers along with map algebra will be done on the client side. One such example can be leaflet. An alternate strategy can be to use geotrellis.io as a geo processing engine to do the raster operations and only use client for the display of the map. Hadoop Cluster

NoSql Database

Mongo DB

/ Hbase/ Elastic

Data Storage

Provisioning Layer

Spark

Scala +

R Studio

Server & RMR

Processing Layer

Data Storage - Persistence layer

Name Index strategy geometry Query types Ease of use/integration

Scalability/ Speed

Comments

Elastic search Geohash Point Bbox, Radius Good 3 stars 10s of TBs, Average writes, reads and search extremely fast

Neo4j Rtree Point/Line/Polygon

Bbox, Radius Moderately Good 2 stars

10s of TBs Too much Granular

Hbase Buily your own index

- - Moderately Good 2 stars

Petabytes Writes are fast, reads as well, needs specialization

Cassandra Build your own index

- - Good , 3 stars Petabytes Same as HBase

Mongo db/ couch base

geohash Point /line /polygon

1) geo-within2) Near3) intersect

Excellent, 5 starsGeojson / leaflet/ osm

10s of TBs,Average throughput

Best Integration with geojson in all cases

Proposed Solutions: Short term : Mongo DB

Long term: Elastic search as the indexing engine and Hbase/ Cassandra as the storage technology on top of hadoop

Analytical Services on Telematics Cluster1) Geocoding and reverse geocoding service on the

cluster

2) Weather and traffic Api (real time and history) to support the use cases related to weather and traffic related analytics

3) Street maps ( open street map in the start and then some better map providers in the longer run)

• Required for the following analytics: regular trips , snap to road, Mode of transport, Identification of risky roads, Impact of POI (e.g. school) on events , enables Location based gamification, Location based / Driving pattern based Cross and upselling, Dangerous parking areas

Analytical Operations/Procedures Useful For Spatial Analysis (R Studio Server With R Packages)

•Having an R studio Server on the cluster would be useful.•Github Repository (already established)•R packages for dealing with vector data (rgdal, rgeos, geojson_io, SpatialTransforms)

• Point pattern analysis – dbscan, glm, gbm

• Describing and Analyzing Fields , Statistical Analysis of Fields/Spatial Interpolation-krigging, tps

• Network Analysis, snap –to-road, frequent routes, etc.. (igraph, sna)

• Visualization of the data – leaflet, shiny

Geospatial processing layer on top of persistence

• The Geospatial Processing layer that performs the integration of map geometry and algebra to display the information on map. On a small scale, can be performed via java script (leaflet / d3)

• The following operations are required

1) Vector Operations

2) Map Algebra• On larger scale, a software engineering layer for

distributed geospatial processing , for example, Scala, Spark and Geotrellis is required.

• http://www.google-melange.com/gsoc/proposal/public/google/gsoc2015/allixender/5676830073815040

Analytical Challenges in Movement Data

• Basic challenges in movement data• Matching (Snap-to-road, street network matching)• Similarity measures• Trajectory clustering • Event detection (classification)

15

Example Applications Solved through Machine Learning

• For raw trajectories• Snap-to-road

• For symbolic trajectories• Analysis of user activities• Traffic event detection

16

Snap-to-road

• Given a trajectory T and a street network G• Find a path in G that matches T with its real or ground

truth path

17

Snap-to-road: Analytical Modeling• Multiregression view:

• Task = estimate noise free function f from Tthat preserves the structural information

• Preserving structural correlations in output: • Try kernelized embedding with kernel for raw trajectories

•

18

Snap-to-road• An important problem in organizations like Here, IBM and

Microsoft.• Error between 10-100 meters (Wifi, Vehicle Navigation,

Mobile Devices)• Sampling rate deteriorated and sparse GPS data• Difficult at roundabouts, and tunnels

19

Solution: Basic steps:

Embed the trajectory by Kernel Methods but ignore map constraints

Benefits: Noise reduction Capture multi-output, non-linear

dependencies

‘Round’ the resulting ‘relaxed assignment’ to street map

20

Snap-to-road Algorithm

21

Snap-to-road:Does it Work?• Performance over challenging real tasks

22

Grouping Of Trajectories/Stops In Similar Routines

Basically Requires similarity measures for trajectories. Unroll a trajectory by defining a mapping

23

Similarity Measures For Trajectories -- Symbolic Trajectories

• Formed by discretization of the curve through measurement process or algorithms.

• Snap-to-road • Stay points• Regional division

24

Clustering of Staypoints to find Homezones

25

Grouping Of Trajectories/Stops In Similar Routines

Applications for Symbolic Trajectories Clustering and Event Detection

• Trajectory clustering• User activity analysis

• Traffic event detection• Classification of events from non-event data

• Rerouting of traffic during baseball games• Detection of conference in auditoriums

26

Applications for Symbolic Trajectories

• Exploit sequence analysis (in particular biological sequence analysis)

1. Discretize the raw trajectories with an appropriate alphabet2. Use alignment kernel with traffic symbol similairty in order to

translate traffic invariances to biological domain3. Exploit sequence analysis to find discrete sequential patterns

(Where Traffic Meets DNA, Best Poster Award, ACM GIS 2011, Ahmed Jawad)

27

Trajectory Clustering

28

http://iapg.jade-hs.de/personen/brinkhoff/generator/ X

Time

24:00

20:00

16:00

12:00

8:00

4:00

0:00

Y

Home

Work

Sports

Trajectory Clustering : Analysis of User Activities• Analysis of user activities

• Frequent routes in trajectories • Clustering at map matched Level

• Frequent routines in trajectories• Clustering at stay point level• Visualization of variability in routines (sequence logos)

29

Trajectory Clustering: Map Matched Discretization

30

Trajectory Clustering:Comparison to State-of-the-Art

31

Trajectory Clustering:Routine analysis

32

Application for Symbolic Trajectories:Traffic Event Detection

Using biological sequence methods to model event persistence• Analysis of Dodger’s baseball games from highway sensor

data• Detecting Presence of Baseball Game• Visualization

• Analysis of events at Caltech auditorium Entrance• Detecting conferences in the auditorium

33

Traffic Event Detection

• Normalization based classifier

34

Readings from a taffic sensor

Traffic Event Detection: Sequence Analysis

35

Summary and Conclusions

• Structural information analysis is the connection between machine learning and GIS

• Still, a lot of data engineering and task specific tricks needed, e.g., regularization, and normalization

36

Active Directions being pursued

• In Snap-to-road• Fisher kernels for Sparse GPS data• Testing KMM with real world system

• In clustering and event detection• User profiles and diaries• Label sequence graph kernels

• In structural information• Can doing away the latitude/longitude pairs and keeping only

the structural information help with privacy issues

37

References (1)• Thomas Brinkhoff, Generating Network-Based Moving Objects, Proceedings of the 12th International Conference on Scientific and

Statistical Database Management, p.253, July 26-28, 2000

• C. Körner, M. May, S. Wrobel. Spatiotemporal Modeling and Analysis - Introduction and Overview. KI, 2012.

• Yi Guo , Junbin Gao , Paul W. Kwan, Twin Kernel Embedding, IEEE Transactions on Pattern Analysis and Machine Intelligence, v.30 n.8, p.1490-1495, August 2008

• Julian J. McAuley, Teofilo de Campos, and Tiberio S. Caetano. Unified graph matching in euclidean spaces. In CVPR, 2010.

• Tom Mitchell. Mining our reality. Science, 326(5960):1644--1645, 2009.

• Paul Newson , John Krumm, Hidden Markov Snap-to-road through noise and sparseness, Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, November 04-06, 2009, Seattle, Washington

• Novi Quadrianto, Le Song, and Alex Smola. Kernelized sorring. In NIPS 21, pages 1289--1296. 2009.

• Mohammed A. Quddus, Washington Y. Ochieng, and Robert B. Noland. Current map-matching algorithms for transport applications: State-of-the art and future research directions. Transportation Research Part C: Emerging Technologies, 15(5):312--328, 2007.

• A. Abbott. A primer on sequence methods. Organization Science, 1(4):375--392, 1990.

• Gennady Andrienko , Natalia Andrienko , Stefan Wrobel, Visual analytics tools for analysis of movement data, ACM SIGKDD Explorations Newsletter, v.9 n.2, December 2007

• Mihael Ankerst , Markus M. Breunig , Hans-Peter Kriegel , Jörg Sander, OPTICS: ordering points to identify the clustering structure, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.49-60, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States

• Gerben de Vries , Maarten van Someren, Clustering vessel trajectories with alignment kernels under trajectory compression, Proceedings of the 2010 European conference on Machine learning and knowledge discovery in databases: Part I, September 20-24, 2010, Barcelona, Spain

• R. Durbin, S. R. Eddy, A. Krogh, and G. Mitchison. Biological Sequence Analysis. Cambridge University Press, 1998.

• M. Ester, H. P. Kriegel, S. Jörg, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD, pages 226--231, 1996.

39

References (2)• Alexander Ihler , Jon Hutchins , Padhraic Smyth, Adaptive event detection with time-varying poisson processes, Proceedings of

the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, August 20-23, 2006, Philadelphia, PA, USA

• Ahmed Jawad , Kristian Kersting, Kernelized Snap-to-road, Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, November 02-05, 2010, San Jose, California

• C. Joh, T. A. Arentze, and H. J. P. Timmermans. Multidimensional sequence alignment methods for activity-travel pattern analysis: A comparison of dynamic programming and genetic algorithms. Geographical Analysis, 33(3):247--270, 2001.

• John A. Lee , Michel Verleysen, Nonlinear Dimensionality Reduction, Springer Publishing Company, Incorporated, 2007 • Yanchi Liu , Zhongmou Li , Hui Xiong , Xuedong Gao , Junjie Wu, Understanding of Internal Clustering Validation Measures,

Proceedings of the 2010 IEEE International Conference on Data Mining, p.911-916, December 13-17, 2010 • T. Mitchell. Mining our reality. Science, 326(5960):1644--1645, 2009. • Salvatore Rinzivillo , Dino Pedreschi , Mirco Nanni , Fosca Giannotti , Natalia Andrienko , Gennady Andrienko, Visually driven

analysis of movement data by progressive clustering, Information Visualization, v.7 n.3, p.225-239, June 2008 • Albrecht Schmidt , Marc Langheinrich , Kritian Kersting, Perception beyond the Here and Now, Computer, v.44 n.2, p.86-88,

February 2011 • S. Schonfelder and K. W. Axhausen. Urban Rhythms and Travel Behavior: Spatial and Temporal Phenomena of Daily Travel

(Transport and Society). Ashgate, 2010. • N. Shoval and M. Isaacson. Sequence alignment as a method for human activity analysis in space and time. Annals of the

Association of American Geographers, 97(2):282--297, 2007. • C. Wilson. Analysis of travel behavior using sequence alignment methods. Journal of the Transportation Research Board, 1645(-

1):52--59, 1998.

40

References (3)• T. Gärtner. Kernels for structured data. World Scientific, Hackensack, N.J., 2008.

• T. Gärtner, P. A. Flach, and S. Wrobel. On graph kernels: Hardness results and ecient alternatives. In Proceedings of Conference on Learning Theory (COLT), pages 129---143, 2003.

• T. Gärtner, T. Horvath, Q. V. Le, A. J. Smola, and S.Wrobel. Kernel methods for graphs. In Mining Graph Data, pages 253--282. John Wiley and Sons, Inc,2006.

• Intelligence (PAMI), 31(5):944{952, 2009.

• R. O. Duda, D. G. Stork, and P. E. Hart. Pattern classification. Wiley, New York; Chichester, 2nd edition, 2000.

• R. Durbin, S. R. Eddy, A. Krogh, and G. Mitchison. Biological SequenceAnalysis. Cambridge University Press, 1998.

• M. Ester, H. P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), pages 226{231, 1996.

• D. Fox, J. Hightower, L. Liao, D. Schulz, and G. Borriello. Bayesian ltering for location estimation. IEEE Pervasive Computing, 2(3):24--33, 2003.

• S. J. Ganey, A. W. Robertson, P. Smyth, S. J. Camargo, and M. Ghil. Probabilistic clustering of extratropical cyclones using regression mixture models. Climate Dynamics, 29(4):423--440, 2006.

• M. Gariel, A. N. Srivastava, and E. Feron. Trajectory clustering and an application to airspace monitoring. IEEE Transactions on Intelligent Transportation Systems (TITS), 12(4):1511--1524, 2006.

41

Appendix: persistence options• Neo4j Spatial :• Utilities for importing from ESRI Shapefile as well as Open Street Map

files

• Support for all the common geometry types

• An RTree index for fast searches on geometries

• Support for topology operations during the search (contains, within, intersects, covers, disjoint, etc.)

• The possibility to enable spatial operations on any graph of data, regardless of the way the spatial data is stored, as long as an adapter is provided to map from the graph to the geometries.

• Ability to split a single layer or dataset into multiple sub-layers or views with pre-configured filters

Appendix: persistence options

Hbase/Cassandra - Build your own index .

• Perform Geohashing yourself or use elastic search as a hashing / search engine

• Libraries Available, to connect ES with cassandra /Hbase

• Besides geohashing is easy to program• http://thenewstack.io/building-streaming-data-

hub-elasticsearch-kafka-cassandra/

Appendix: persistence options

Mongodb Geospatial

• Store your location data as GeoJSON objects with this coordinate-axis order: longitude, latitude. The coordinate reference system for GeoJSON uses the WGS84 datum.

http://docs.mongodb.org/manual/reference/glossary/#term-wgs84

Mongodb: Querying Datadb.<collection>.find( { <location field> :

{ $geoWithin :

{ $geometry :

{ type : "Polygon" ,

coordinates : [ <coordinates> ]

} } } } )

db.places.find( { loc :

{ $geoWithin :

{ $geometry :

{ type : "Polygon" ,

coordinates : [ [[ 0 , 0 ] ,[ 3 , 6 ] ,[ 6 , 1 ] ,

[ 0 , 0 ]] ]} } } } )

Gis capabilities on Big Data Systems

Data & Analytics

Transcript of Gis capabilities on Big Data Systems