Advanced "Big Data" Social Intelligence Capabilities Overview
Gis capabilities on Big Data Systems
-
Upload
ahmad-jawwad -
Category
Data & Analytics
-
view
804 -
download
6
Transcript of Gis capabilities on Big Data Systems
Geospatial Analytics and Spatial Capabilities on Big
Data SystemsBy: Ahmed Jawad (PhD)
Agenda
2
• Why Analyze Telematics
• Analysis of Movement Data
• Analytical Assets for Telematics
• Operational Requirements on Telematics
• Data flow on big data platforms
• Analytical Challenges and Applications Solved through Machine Learning• Snap to road• Unifying trajectories to patterns of movements and routines• Traffic event detection
Why Analyze Telematics• We are being recorded everywhere • Provides great insights into the customer routines and
movement• Key players competing in the market
3
Analyzing Movement Data
Trajectory
4
Object in motion (time – space)
Coordinate based recording
Raw trajectories
Symbolic trajectories
Discretization
Streets, locations, or events
Traditional Operational Requirements In The World Of Geographic Information Systems (GIS)
• Traditional use cases : cartography, geo-algebra (display of statistical events, hotspots, co-locations on the map)
• Databases used : postgres, sql server• Mostly static data sources• Relatively small data sets• Moderate geometric accuracy• Offline processing acceptable• Complex geometric datatypes support
Operational Requirements and Design Considerations for Telematics
• Realtime ingestion and analytics on sensor data, distance queries, snap-to-road
• 100 TBs/ Petabyte scale of the data• High variation in geospatial queries (range queries, etc..) and
throughtput of CRUD operations: insertion/deletion/read• Processing flow and map applications, nature of the relationships in the
data implicating storage technology. Indexing techniques and implications.
Telematics and Geospatial Data Types• Spatial data structures:
• Raster: geographically-referenced matrix of uniform size
• Vector: features on the earth’s surface are represented as geographically-referenced vector objects
• Hierarchical nature of objects• Points: different types : Entity, label, area, node• Lines: lines, polylines, arc, link, etc.• Polygons: area, polygon, complex polygon
• Requirements: The ability to manipulate Geospatial Data. • Databases and libraries required to manipulate these objects on
distributed scale ( Spark and scala, MongoDB, or any other nosql data base)
Analytical Assests for Telematics
• The analytical assets for Telematics can be broadly related to
• Snap-to-road
• Analysis of User Activities (Clustering)
• Traffic Event Detection (Classification)
• Realtime location search
• Set operations on geometriy objects and geoalgebra (layering of geospatial information atop each other and algebraic operations on them)
Conceptual dataflow and geospatial processing in Telematics
9
PDAEvent capture
KafkaEvent Processing & Delivery Descision
Stream Processing EnginePDA Geodata & Critical events
Mongo / Hbase , Cassandra
/ Elastic
(on top of Hadoop)Persistence Layer
Risk area
Tomcat App (Optional
Raster Processing - Geotrellis)Datafeed client
Preload risk area
Preload traffic info
ClientD3 / Ajax /
Leaflet
API Push(REST)
Push
Websocket
Push
Pull
Push
Stream
Pull
Persistent layer should be scalable & support storage and querying of spatiotemporal objects (point, polygons, lines, line strings, for reference see mongo db’s 2d spherical indexing and geospatial querying). The following low level queries shall be supported. (1) nearest neighbor query: given a point (lat, long) find all the line strings that are within x meter radius. (2) containment query: give all the points within a polygon, or given a point find al the polygons containing them .
Client browser. e.g. fleet manager. In the current scheme, we have deferred all the intelligence to the client. i.e. the raster processing, displaying the map, and different layers along with map algebra will be done on the client side. One such example can be leaflet. An alternate strategy can be to use geotrellis.io as a geo processing engine to do the raster operations and only use client for the display of the map.
Stream processing queries (1)Instantaneous speed/ angular momentum of the PDA. (2) Distance to a traffic event pulled from bing (3) Running aggregates, e.g. how long the vehicle has spent at the current location
Geocoding Service
OSM / Realtime traffic API
Analytics Cluster GIS capablitiesClient browser. e.g. fleet manager. In the current scheme, we have deferred all the intelligence to the client. i.e. the raster processing, displaying the map, and different layers along with map algebra will be done on the client side. One such example can be leaflet. An alternate strategy can be to use geotrellis.io as a geo processing engine to do the raster operations and only use client for the display of the map. Hadoop Cluster
NoSql Database
Mongo DB
/ Hbase/ Elastic
Data Storage
Provisioning Layer
Spark
Scala +
R Studio
Server & RMR
Processing Layer
Data Storage - Persistence layer
Name Index strategy geometry Query types Ease of use/integration
Scalability/ Speed
Comments
Elastic search Geohash Point Bbox, Radius Good 3 stars 10s of TBs, Average writes, reads and search extremely fast
Neo4j Rtree Point/Line/Polygon
Bbox, Radius Moderately Good 2 stars
10s of TBs Too much Granular
Hbase Buily your own index
- - Moderately Good 2 stars
Petabytes Writes are fast, reads as well, needs specialization
Cassandra Build your own index
- - Good , 3 stars Petabytes Same as HBase
Mongo db/ couch base
geohash Point /line /polygon
1) geo-within2) Near3) intersect
Excellent, 5 starsGeojson / leaflet/ osm
10s of TBs,Average throughput
Best Integration with geojson in all cases
Proposed Solutions: Short term : Mongo DB
Long term: Elastic search as the indexing engine and Hbase/ Cassandra as the storage technology on top of hadoop
Analytical Services on Telematics Cluster1) Geocoding and reverse geocoding service on the
cluster
2) Weather and traffic Api (real time and history) to support the use cases related to weather and traffic related analytics
3) Street maps ( open street map in the start and then some better map providers in the longer run)
• Required for the following analytics: regular trips , snap to road, Mode of transport, Identification of risky roads, Impact of POI (e.g. school) on events , enables Location based gamification, Location based / Driving pattern based Cross and upselling, Dangerous parking areas
Analytical Operations/Procedures Useful For Spatial Analysis (R Studio Server With R Packages)
•Having an R studio Server on the cluster would be useful.•Github Repository (already established)•R packages for dealing with vector data (rgdal, rgeos, geojson_io, SpatialTransforms)
• Point pattern analysis – dbscan, glm, gbm
• Describing and Analyzing Fields , Statistical Analysis of Fields/Spatial Interpolation-krigging, tps
• Network Analysis, snap –to-road, frequent routes, etc.. (igraph, sna)
• Visualization of the data – leaflet, shiny
Geospatial processing layer on top of persistence
• The Geospatial Processing layer that performs the integration of map geometry and algebra to display the information on map. On a small scale, can be performed via java script (leaflet / d3)
• The following operations are required
1) Vector Operations
2) Map Algebra• On larger scale, a software engineering layer for
distributed geospatial processing , for example, Scala, Spark and Geotrellis is required.
• http://www.google-melange.com/gsoc/proposal/public/google/gsoc2015/allixender/5676830073815040
Analytical Challenges in Movement Data
• Basic challenges in movement data• Matching (Snap-to-road, street network matching)• Similarity measures• Trajectory clustering • Event detection (classification)
15
Example Applications Solved through Machine Learning
• For raw trajectories• Snap-to-road
• For symbolic trajectories• Analysis of user activities• Traffic event detection
16
Snap-to-road
• Given a trajectory T and a street network G• Find a path in G that matches T with its real or ground
truth path
17
Snap-to-road: Analytical Modeling• Multiregression view:
• Task = estimate noise free function f from Tthat preserves the structural information
• Preserving structural correlations in output: • Try kernelized embedding with kernel for raw trajectories
•
18
Snap-to-road• An important problem in organizations like Here, IBM and
Microsoft.• Error between 10-100 meters (Wifi, Vehicle Navigation,
Mobile Devices)• Sampling rate deteriorated and sparse GPS data• Difficult at roundabouts, and tunnels
19
Solution: Basic steps:
Embed the trajectory by Kernel Methods but ignore map constraints
Benefits: Noise reduction Capture multi-output, non-linear
dependencies
‘Round’ the resulting ‘relaxed assignment’ to street map
20
Snap-to-road Algorithm
21
Snap-to-road:Does it Work?• Performance over challenging real tasks
22
Grouping Of Trajectories/Stops In Similar Routines
Basically Requires similarity measures for trajectories. Unroll a trajectory by defining a mapping
23
Similarity Measures For Trajectories -- Symbolic Trajectories
• Formed by discretization of the curve through measurement process or algorithms.
• Snap-to-road • Stay points• Regional division
24
Clustering of Staypoints to find Homezones
25
Grouping Of Trajectories/Stops In Similar Routines
Applications for Symbolic Trajectories Clustering and Event Detection
• Trajectory clustering• User activity analysis
• Traffic event detection• Classification of events from non-event data
• Rerouting of traffic during baseball games• Detection of conference in auditoriums
26
Applications for Symbolic Trajectories
• Exploit sequence analysis (in particular biological sequence analysis)
1. Discretize the raw trajectories with an appropriate alphabet2. Use alignment kernel with traffic symbol similairty in order to
translate traffic invariances to biological domain3. Exploit sequence analysis to find discrete sequential patterns
(Where Traffic Meets DNA, Best Poster Award, ACM GIS 2011, Ahmed Jawad)
27
Trajectory Clustering
28
http://iapg.jade-hs.de/personen/brinkhoff/generator/ X
Time
24:00
20:00
16:00
12:00
8:00
4:00
0:00
Y
Home
Work
Sports
Trajectory Clustering : Analysis of User Activities• Analysis of user activities
• Frequent routes in trajectories • Clustering at map matched Level
• Frequent routines in trajectories• Clustering at stay point level• Visualization of variability in routines (sequence logos)
29
Trajectory Clustering: Map Matched Discretization
30
Trajectory Clustering:Comparison to State-of-the-Art
31
Trajectory Clustering:Routine analysis
32
Application for Symbolic Trajectories:Traffic Event Detection
Using biological sequence methods to model event persistence• Analysis of Dodger’s baseball games from highway sensor
data• Detecting Presence of Baseball Game• Visualization
• Analysis of events at Caltech auditorium Entrance• Detecting conferences in the auditorium
33
Traffic Event Detection
• Normalization based classifier
34
Readings from a taffic sensor
Traffic Event Detection: Sequence Analysis
35
Summary and Conclusions
• Structural information analysis is the connection between machine learning and GIS
• Still, a lot of data engineering and task specific tricks needed, e.g., regularization, and normalization
36
Active Directions being pursued
• In Snap-to-road• Fisher kernels for Sparse GPS data• Testing KMM with real world system
• In clustering and event detection• User profiles and diaries• Label sequence graph kernels
• In structural information• Can doing away the latitude/longitude pairs and keeping only
the structural information help with privacy issues
37
Q & A
References (1)• Thomas Brinkhoff, Generating Network-Based Moving Objects, Proceedings of the 12th International Conference on Scientific and
Statistical Database Management, p.253, July 26-28, 2000
• C. Körner, M. May, S. Wrobel. Spatiotemporal Modeling and Analysis - Introduction and Overview. KI, 2012.
• Yi Guo , Junbin Gao , Paul W. Kwan, Twin Kernel Embedding, IEEE Transactions on Pattern Analysis and Machine Intelligence, v.30 n.8, p.1490-1495, August 2008
• Julian J. McAuley, Teofilo de Campos, and Tiberio S. Caetano. Unified graph matching in euclidean spaces. In CVPR, 2010.
• Tom Mitchell. Mining our reality. Science, 326(5960):1644--1645, 2009.
• Paul Newson , John Krumm, Hidden Markov Snap-to-road through noise and sparseness, Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, November 04-06, 2009, Seattle, Washington
• Novi Quadrianto, Le Song, and Alex Smola. Kernelized sorring. In NIPS 21, pages 1289--1296. 2009.
• Mohammed A. Quddus, Washington Y. Ochieng, and Robert B. Noland. Current map-matching algorithms for transport applications: State-of-the art and future research directions. Transportation Research Part C: Emerging Technologies, 15(5):312--328, 2007.
• A. Abbott. A primer on sequence methods. Organization Science, 1(4):375--392, 1990.
• Gennady Andrienko , Natalia Andrienko , Stefan Wrobel, Visual analytics tools for analysis of movement data, ACM SIGKDD Explorations Newsletter, v.9 n.2, December 2007
• Mihael Ankerst , Markus M. Breunig , Hans-Peter Kriegel , Jörg Sander, OPTICS: ordering points to identify the clustering structure, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.49-60, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
• Gerben de Vries , Maarten van Someren, Clustering vessel trajectories with alignment kernels under trajectory compression, Proceedings of the 2010 European conference on Machine learning and knowledge discovery in databases: Part I, September 20-24, 2010, Barcelona, Spain
• R. Durbin, S. R. Eddy, A. Krogh, and G. Mitchison. Biological Sequence Analysis. Cambridge University Press, 1998.
• M. Ester, H. P. Kriegel, S. Jörg, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD, pages 226--231, 1996.
39
References (2)• Alexander Ihler , Jon Hutchins , Padhraic Smyth, Adaptive event detection with time-varying poisson processes, Proceedings of
the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, August 20-23, 2006, Philadelphia, PA, USA
• Ahmed Jawad , Kristian Kersting, Kernelized Snap-to-road, Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, November 02-05, 2010, San Jose, California
• C. Joh, T. A. Arentze, and H. J. P. Timmermans. Multidimensional sequence alignment methods for activity-travel pattern analysis: A comparison of dynamic programming and genetic algorithms. Geographical Analysis, 33(3):247--270, 2001.
• John A. Lee , Michel Verleysen, Nonlinear Dimensionality Reduction, Springer Publishing Company, Incorporated, 2007 • Yanchi Liu , Zhongmou Li , Hui Xiong , Xuedong Gao , Junjie Wu, Understanding of Internal Clustering Validation Measures,
Proceedings of the 2010 IEEE International Conference on Data Mining, p.911-916, December 13-17, 2010 • T. Mitchell. Mining our reality. Science, 326(5960):1644--1645, 2009. • Salvatore Rinzivillo , Dino Pedreschi , Mirco Nanni , Fosca Giannotti , Natalia Andrienko , Gennady Andrienko, Visually driven
analysis of movement data by progressive clustering, Information Visualization, v.7 n.3, p.225-239, June 2008 • Albrecht Schmidt , Marc Langheinrich , Kritian Kersting, Perception beyond the Here and Now, Computer, v.44 n.2, p.86-88,
February 2011 • S. Schonfelder and K. W. Axhausen. Urban Rhythms and Travel Behavior: Spatial and Temporal Phenomena of Daily Travel
(Transport and Society). Ashgate, 2010. • N. Shoval and M. Isaacson. Sequence alignment as a method for human activity analysis in space and time. Annals of the
Association of American Geographers, 97(2):282--297, 2007. • C. Wilson. Analysis of travel behavior using sequence alignment methods. Journal of the Transportation Research Board, 1645(-
1):52--59, 1998.
40
References (3)• T. Gärtner. Kernels for structured data. World Scientific, Hackensack, N.J., 2008.
• T. Gärtner, P. A. Flach, and S. Wrobel. On graph kernels: Hardness results and ecient alternatives. In Proceedings of Conference on Learning Theory (COLT), pages 129---143, 2003.
• T. Gärtner, T. Horvath, Q. V. Le, A. J. Smola, and S.Wrobel. Kernel methods for graphs. In Mining Graph Data, pages 253--282. John Wiley and Sons, Inc,2006.
• Intelligence (PAMI), 31(5):944{952, 2009.
• R. O. Duda, D. G. Stork, and P. E. Hart. Pattern classification. Wiley, New York; Chichester, 2nd edition, 2000.
• R. Durbin, S. R. Eddy, A. Krogh, and G. Mitchison. Biological SequenceAnalysis. Cambridge University Press, 1998.
• M. Ester, H. P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), pages 226{231, 1996.
• D. Fox, J. Hightower, L. Liao, D. Schulz, and G. Borriello. Bayesian ltering for location estimation. IEEE Pervasive Computing, 2(3):24--33, 2003.
• S. J. Ganey, A. W. Robertson, P. Smyth, S. J. Camargo, and M. Ghil. Probabilistic clustering of extratropical cyclones using regression mixture models. Climate Dynamics, 29(4):423--440, 2006.
• M. Gariel, A. N. Srivastava, and E. Feron. Trajectory clustering and an application to airspace monitoring. IEEE Transactions on Intelligent Transportation Systems (TITS), 12(4):1511--1524, 2006.
41
Appendix: persistence options• Neo4j Spatial :• Utilities for importing from ESRI Shapefile as well as Open Street Map
files
• Support for all the common geometry types
• An RTree index for fast searches on geometries
• Support for topology operations during the search (contains, within, intersects, covers, disjoint, etc.)
• The possibility to enable spatial operations on any graph of data, regardless of the way the spatial data is stored, as long as an adapter is provided to map from the graph to the geometries.
• Ability to split a single layer or dataset into multiple sub-layers or views with pre-configured filters
Appendix: persistence options
Hbase/Cassandra - Build your own index .
• Perform Geohashing yourself or use elastic search as a hashing / search engine
• Libraries Available, to connect ES with cassandra /Hbase
• Besides geohashing is easy to program• http://thenewstack.io/building-streaming-data-
hub-elasticsearch-kafka-cassandra/
Appendix: persistence options
Mongodb Geospatial
• Store your location data as GeoJSON objects with this coordinate-axis order: longitude, latitude. The coordinate reference system for GeoJSON uses the WGS84 datum.
Mongodb: Querying Datadb.<collection>.find( { <location field> :
{ $geoWithin :
{ $geometry :
{ type : "Polygon" ,
coordinates : [ <coordinates> ]
} } } } )
db.places.find( { loc :
{ $geoWithin :
{ $geometry :
{ type : "Polygon" ,
coordinates : [ [[ 0 , 0 ] ,[ 3 , 6 ] ,[ 6 , 1 ] ,
[ 0 , 0 ]] ]} } } } )