Geospatial Artificial Intelligence. An introduction to pipelining ......6.3.2 PostGIS PostGIS is a...

12
DRAFT Geospatial Artificial Intelligence An introduction to pipelining for automated geospatial analysis, modelling and AI Simon D. Wenkel March 30, 2019

Transcript of Geospatial Artificial Intelligence. An introduction to pipelining ......6.3.2 PostGIS PostGIS is a...

Page 1: Geospatial Artificial Intelligence. An introduction to pipelining ......6.3.2 PostGIS PostGIS is a PostgreSQL database with geospatial extensions developed by The PostGIS Development

DRAFTGeospatial Artificial Intelligence

An introduction to pipelining for automated geospatial analysis,modelling and AI

Simon D. Wenkel

March 30, 2019

Page 2: Geospatial Artificial Intelligence. An introduction to pipelining ......6.3.2 PostGIS PostGIS is a PostgreSQL database with geospatial extensions developed by The PostGIS Development

DRAFT6 Selecting file formats

Some of the biggest questions we have to ask ourselves when we start a geospatialproject is what file formats do we want to use. There are many out there and somehave their specific advantages for certain niches. GDAL/OGR [1] lists 96 vectorformats and 155 raster formats. That is a lot to choose from.

Comment 6.1 Utilizing normal files and databases for geospatial data

In theory we could use any kind of file or database to store geospatial dataas long as we know how we stored it and how the projection is linked tocoordinates. Since the same coordinates will lead to different positions fordifferent projections this can be dangerous and therefore is not recommended.However, we can see this often especially with text files and if they are notdocumented properly we end up with a big mess.

6.1 Databases vs. single files

First, we have to decide on whether we want single files or a real database to storeour data. The main challenge with geospatial data stored in single files is that weend up having multiple files that make a “single” file. If one of them is lost or notcopied correctly we may are doomed if this is an essential one. The big advantageof using single files to store geospatial data is that if we manage to copy themcorrectly everyone can work with them without having deep knowledge on setting updatabases. Moreover, we can to risky things such as editing them manually with non-geospatial tools. This is not recommended but there are some special applicationsfor it. Another disadvantage of single files is that their size is limited by the filesystem of the device they are stored on. There are still many devices in use that areformatted with FAT32 (with LFS) which limits file sizes to 4 GB.We could use real databases to store our geospatial data. They have the advantage

that it they are accessed via centralized APIs and if set up correctly it is much easierto collaborate with other team members. Another big advantage of them is thatthey usually offer basic data operations on a database level which is usually moreoptimized for performance than our geospatial software tools. If we are working with

Preview - ©Simon D. Wenkel (https://www.simonwenkel.com) - p. 27

Page 3: Geospatial Artificial Intelligence. An introduction to pipelining ......6.3.2 PostGIS PostGIS is a PostgreSQL database with geospatial extensions developed by The PostGIS Development

DRAFT

6 Selecting file formats

really big datasets and/or enjoying slow network connections we can save ourselvesa lot of time to transfer the data to our workstation, perform our calculations andtransfer them back to the server.However, setting them up correctly especially if for complex cases and in certain

corporate environments can be challenging. Furthermore, it is difficult to exchangethem with external people since they either have to be dumped and converted orhave to been shipped as a snapshot or third parties require access to the api whichcan be challenging depending on company culture and security.

Comment 6.2 Databases vs. single files

Shipping functioning databases can be challenging especially if data sizes arereally big. Databases such as POSTGIS offer the big advantage to performcalculations on a database level and therefore save a lot of time.Therefore, we should:

• ship single files, if file sizes are reasonable

• ship databases, otherwise.

If we are working on smaller projects especially if we are the only ones involvedwe will safe time with single files in the short-term especially if we have neverused databases before and safe a lot of time with databases in the long-term.

6.2 Single files

If we decide to use single files as basis for our analyses then we encounter the mostcommon:

• Shapefile (Chapter 6.2.1)

• Keyhole Markup Language [KML] (Chapter 6.2.2)

• Geotiff (Chapter 6.2.3)

• x,y,z text files (Chapter 6.2.4)

• Geopackage (Chapter 6.2.5)

Preview - ©Simon D. Wenkel (https://www.simonwenkel.com) - p. 28

Page 4: Geospatial Artificial Intelligence. An introduction to pipelining ......6.3.2 PostGIS PostGIS is a PostgreSQL database with geospatial extensions developed by The PostGIS Development

DRAFT

6.2 Single files

Comment 6.3 Metadata security flaws

In the field of GIS and more general data science cybersecurity is neglectedoften. In our case it is less about securing workstations and servers itself but theamount of information of our systems we publish/ship accidentally. Automaticgenerated metadata may not only contain useful data such as the process howanalyses were done but absolute file paths and therefore a lot of informationon our systems and workflow as well. This seems to be more common withdesktop GIS and less in the way we are working here. Nevertheless, we shouldclean our metadata if we ship it to increase computer security.

6.2.1 Shapefile

The shapefile format is developed by ESRI with the main purpose of exchangingvector features between ArcGIS and non-ArcGIS users. Therefore, it is compatiblewith a lot of software even CAD software packages. It can store only one feature typeand one layer. As defined by the shapefile whitepaper [2], it requires the followingfiles to work:

.shp # stores feature geometry

.shx # spatial index of feature geometries

.dbf # dBase containing all attributes

We will experience often that we have 7 files instead of 3. The other two files arecommonly:

.shp.xml # metadata in xml form

.prj # text file containing the projection

.qpj # additional projection information by QGIS

.sbn/.sbx # spatial index of features to speed up spatial queries

Depending on what software we are using we may end up with more files:

.ain/.aih # attribute index of active fields

.atx # attribute index for dBase file

.cpg # specifies character encoding of the dBase file

.fbn/.fbx # similar to .sbn/.sbx for read-only features

.ixs # geocoding index for shapefiles (read/write)

.mxs # similar to .ixs but for ODB format

.qix # quadtree spatial index

Preview - ©Simon D. Wenkel (https://www.simonwenkel.com) - p. 29

Page 5: Geospatial Artificial Intelligence. An introduction to pipelining ......6.3.2 PostGIS PostGIS is a PostgreSQL database with geospatial extensions developed by The PostGIS Development

DRAFT

6 Selecting file formats

The main disadvantage of shapefiles is that we may end up with unusable data ifone of the essential files is not included and perhaps the projection is missing. Thishappens far more often then it should.

6.2.2 Keyhole Markup Language (KML)

When dealing with single file geospatial data we have to mention KML - the KeyholeMarkup Language [3]. KML is famous because it is one of the few filetypes that arereadable and writable by Google Earth. Therefore, it sometimes used to store GPStracks instead of using the GPS Exchange Format (.gpx).There are two kinds of files of KML:

.kml # keyhole markup language document

.kmz # zipped keyhole markup language document

6.2.3 Geotiff

The shapefile equivalent for raster data is Geotiff. The Geotiff file [4] is similar to astandard tiff image file however it extended a bit to store information on georefer-encing. Hence, we only need one file:.tif # georeferenced TIFF file containing the raster

Nevertheless, there might be more files for a single raster. We may encounterMapInfo or ESRI world files that store information on georeferencing instead ofembedding it directly. Further, we may have our xml metadata files again:.aux.xml # PAM (Persistant Auxiliar Metadata).ovr # storing pyramid layers of the raster.tab # MapInfo file.tif.xml # contains metadata.twf/.tifw/.tiffw/.wld # ESRI world file

6.2.4 x,y,z text files

We can think of x,y,z text files as csv files (comma separated values) that look a bitlike this:0,0,0,'Amsterdam',123,'foo' ...Without special (manual) processing it can only be used for points and therefore

they are often used to ship DEMs (Digital Elevation Models). The main problem isthat they have to be accompanied by metadata to know which projection has to beassigned to it after importing.

Preview - ©Simon D. Wenkel (https://www.simonwenkel.com) - p. 30

Page 6: Geospatial Artificial Intelligence. An introduction to pipelining ......6.3.2 PostGIS PostGIS is a PostgreSQL database with geospatial extensions developed by The PostGIS Development

DRAFT

6.2 Single files

6.2.5 GeoPackage (GPKG)

GeoPackage is a single file that can store multiple vector features and rasters. Accord-ing to the current specifications [5] it is a container for a SQLite database with somedegree of similarity to SpatiaLite. However, unlike SpatiaLite it is a pure storageformat and not a database that allows certain (optimized) operations on a databaselevel. It is designed to store data and leaves the processing to other software.

Comment 6.4 Geopackage is the default file format for QGIS 3

Geopackage is used as the default file format for QGIS 3 [6]. Let us hopethat it will lead to wide spread usage of GeoPackage in times of “open datapolicies” in Europe. This could remove the battle with incomplete shapefilesand geotiffs.

6.2.6 GeoJSON

GeoJSON [7] is the abbreviation of Geographic JavaScript Object Notation. It is angeospatial extension based on the JSON format and therefore aims at web applica-tions.It is a rather simple format that uses WGS 84 coordinates that are written as

decimal degrees. It supports the following geometry types:

• Point, MultiPoint

• LineString, MultiLineString

• Polygon, MultiPolygon

Code 6.1 GeoJSON example

This is the example from the GeoJSON homepage (http://geojson.org/).

1 {2 "type": "Feature",3 "geometry": {4 "type": "Point",5 "coordinates": [125.6, 10.1]6 },

Preview - ©Simon D. Wenkel (https://www.simonwenkel.com) - p. 31

Page 7: Geospatial Artificial Intelligence. An introduction to pipelining ......6.3.2 PostGIS PostGIS is a PostgreSQL database with geospatial extensions developed by The PostGIS Development

DRAFT

6 Selecting file formats

7 "properties": {8 "name": "Dinagat Islands"9 }

10 }

6.3 Databases

In terms of geospatial databases, we mainly run into SQL (Structured Query Lan-guage) databases meaning that we are dealing with relational databases.Relational databases were introduced by Codd [8] and have standardized termi-

nology describing a “table”:

• Row: tupel/record

• Column: attribute/field

• Table: relation

One of the biggest advantages of real databases is the usage of native database toolsto perform queries and some analyses directly on the database without transferringdata from a database to our workstations and do the calculations there. The nativedatabase tools (DBMS - database management system) are highly optimized for fastcalculations with a minimum of overhead.

6.3.1 SpatiaLite

SpatiaLite is a database that uses SQLite as its underlying basis. It is provides anextension for geospatial data to SQLite and is developed by Furieri [9]. The develop-ment is funded by the Tuscany Region - Territorial and Environmental InformationSystem. It is a wonderful one-file solution similar to GeoPackage. However, it unlikeGeoPackage we can do basic manipulation and analyses on a database level usingit’s DBMS.SpatiaLite is not able to store raster data such as images by default. We have to

come up with our own solution for storing raster data in SpatiaLite or we have touse the “librasterlite2” extension developed by Furieri [10] as well.We can use external files and algorithms via so called “Virtual Interfaces”. These

interfaces support:

Preview - ©Simon D. Wenkel (https://www.simonwenkel.com) - p. 32

Page 8: Geospatial Artificial Intelligence. An introduction to pipelining ......6.3.2 PostGIS PostGIS is a PostgreSQL database with geospatial extensions developed by The PostGIS Development

DRAFT

6.3 Databases

• External datasources– VirtualShape: Shapefiles– VirtualDBF: DBF– VirtualTex: Text files– VirtualXLS: Spreadsheets– VirtualPG: PostgreSQl and PostGIS via an additional extension called

“VirtualPG”– VirtualOGR: various GDAL/OGR supported file types

• Non-native geometries– VirtualFDO: FDO geometries– VirtualGPKG: Geopackage geometries

• Geospatial algorithms, search and routing queries– VirtualRouting: SQL-based routing– VirtualKNN: search for k Nearest Neighbors– VirtualElementary: separation of complex geometries– VirtualXPatch: XPath

More precisely, SpatiLite has become a collection of extensions and tools for theSQLite database. They are centered around “libspatialite” and the developmentis lead by the same person and (mainly) funded by the Tuscany Region - Territo-rial and Environmental Information System. This ecosystem is developed platformindependent.Currently, the following extensions and tools exist:

• libspatialite– Core extension for SQLite

• librasterlite2– Raster data extension for SpatiaLite/SQLite

• spatialite-tools– set of tools that support all of SpatiaLite’s functions– Command Line Interface (CLI) only

Preview - ©Simon D. Wenkel (https://www.simonwenkel.com) - p. 33

Page 9: Geospatial Artificial Intelligence. An introduction to pipelining ......6.3.2 PostGIS PostGIS is a PostgreSQL database with geospatial extensions developed by The PostGIS Development

DRAFT

6 Selecting file formats

• spatialite-gui

– GUI for spatialite-tools

• VirtualPG

– provides direct SQL access to PostgreSQL and PostGIS via Virtual Inter-faces

• FreeXL

– extracts valid data within an Excel spreadsheet

• ReadOSM

– extracts data from OSM files (.osm, .osm.pbf)

• LibreWMS

– WMS (Web Map Service) viewer on top of RasterLite2

• dataSeltzer

– CGI component for web publishing of datasets

6.3.2 PostGIS

PostGIS is a PostgreSQL database with geospatial extensions developed by ThePostGIS Development Group [11]. PostGIS is a single solution unlike SpatiLitewhich consists of many tools. It can store raster data by default and the DBMScovers much more operations. Further, it supports search algorithms for high speedspatial querying. We will use it for many applications throughout this book.

6.3.3 Other database solutions

We find other database solutions for geospatial data out there. Many of them areproprietary solutions that comes with the draw-back of vendor lock-in and previousspeed advantages are no longer relevant since the free databases perform on a similarlevel nowadays. Without going into too many details, we are basically speakingabout NoSQL and graph databases.

Preview - ©Simon D. Wenkel (https://www.simonwenkel.com) - p. 34

Page 10: Geospatial Artificial Intelligence. An introduction to pipelining ......6.3.2 PostGIS PostGIS is a PostgreSQL database with geospatial extensions developed by The PostGIS Development

DRAFT

6.4 Geospatial data in times of Big Data

6.4 Geospatial data in times of Big Data

A major drawback of all “classical” file and database solutions for storage of geospa-tial data is that they do not scale well with Big Data. While SQL databases aregood to store and process relatively static data that require consistency, they havesevere problems with unstructured and distributed real-time data collections. Thereare many existing NoSQL solutions for “non-geospatial” data. Some of them havebeen extended for geospatial purposes.Before we will have a look at two of them, we should ask ourselves why cannot use

“classical” solutions. Big Data is fundamentally different, because we may

• Process a constant stream of data

• Receive our data from other pipelines

• Computation increases

• Storage requirements are different

• Distributed storage solutions

Big Data applications change our geospatial products and applications as well.Some example applications of geospatial data in the context of Big Data are:

• Real-time traffic routing optimization for fleets as well as for single entities

• Global (daily) tree counts

• Monitoring of construction sites and mines (with daily satellite images andlocal company process information)

• ... and everything else that has spatial components...

For this kind of analyses it is vital to use database functions and not load ourhuge amounts of data to our computer/server, process it there in memory and loadit back into the database. In general, such problems are solved by using onlinealgorithms such as OnlineStats.jl [12]. There are two major ports of "classical bigdata platforms" for geospatial applications:

• SpatialHadoop [13]

• SpatialSpark [14]

Preview - ©Simon D. Wenkel (https://www.simonwenkel.com) - p. 35

Page 11: Geospatial Artificial Intelligence. An introduction to pipelining ......6.3.2 PostGIS PostGIS is a PostgreSQL database with geospatial extensions developed by The PostGIS Development

DRAFT

6 Selecting file formats

Comment 6.5 "Real Big Data" with geospatial properties

Well, we looked at big data for most use cases or at least in terms of whatmost people do. This is not entirely true.

• Seismic simulations and reconstruction

• All kind of geophysical processing for reservoirs etc.

• Meteorological forecasting

• Space exploration [15] though different spatial references systems areused than the ones we are used to

References

[1] GDAL/OGR contributors. GDAL/OGR Geospatial Data Abstraction softwareLibrary. Open Source Geospatial Foundation. 2018. url: http://gdal.org.

[2] ESRI. ESRI Shapefile Technical Description. White paper. Environmental Sys-tems Research Institute Inc., 1998. url: http://www.esri.com/library/whitepapers/pdfs/shapefile.pdf.

[3] D. Burggraf, B. McClendon, M. Weiss-Malik, S. Askay, L. Colaiacomo, and R.Martell.OGC KML 2.3 Specification. 2015. url: http://docs.opengeospatial.org/is/12-007r2/12-007r2.html.

[4] N. Ritter and M. Ruth. GeoTIFF Format Specification GeoTIFF Revision 1.0.2000. url: http://geotiff.maptools.org/spec/geotiffhome.html.

[5] OGC. OGC® (Open Geospatial Consortium) GeoPackage Encoding Standard.2017. url: http://www.geopackage.org/spec120/index.html.

[6] QGIS Development Team. QGIS 3.0 Girona Release Notes. 2018. url: http://changelog.qgis.org/en/qgis/version/3.0.0/.

[7] H. Butler, M. Daly, A. Doyle, S. Gillies, S. Hagen, and T. Schaub. The Geo-JSON Format. Tech. rep. rfc7946. Internet Engineering Task Force (IETF),2016. url: https://tools.ietf.org/html/rfc7946.

[8] E. F. Codd. “A relational model of data for large shared data banks”. In:Communications of the ACM 13.6 (June 1970), pp. 377–387. doi: 10.1145/362384.362685.

Preview - ©Simon D. Wenkel (https://www.simonwenkel.com) - p. 36

Page 12: Geospatial Artificial Intelligence. An introduction to pipelining ......6.3.2 PostGIS PostGIS is a PostgreSQL database with geospatial extensions developed by The PostGIS Development

DRAFT

REFERENCES

[9] A. Furieri. SpatiaLite - Documentation. 2018. url: https://www.gaia-gis.it/gaia-sins/spatialite_topics.html.

[10] A. Furieri. Rasterlite2 Documentation. 2018. url: https://www.gaia-gis.it/fossil/librasterlite2/wiki?name=rasterlite2-doc.

[11] The PostGIS Development Group. PostGIS 2.5.0rc2dev Manual. 2018. url:https://postgis.net/docs/manual-2.5/.

[12] J. T. Day. “Online Algorithms for Statistics”. PhD thesis. North Carolina StateUniversity, 2018. url: https://repository.lib.ncsu.edu/bitstream/handle/1840.20/34945/etd.pdf.

[13] A. Eldawy and M. F. Mokbel. “SpatialHadoop: A MapReduce Framework forSpatial Data”. In: 31st IEEE International Conference on Data Engineering,ICDE 2015, Seoul, South Korea, April 13-17, 2015. 2015, pp. 1352–1363. doi:10.1109/ICDE.2015.7113382.

[14] S. You, J. Zhang, and L. Gruenwald. Large-Scale Spatial Join Query Process-ing in Cloud. Tech. rep. CUNY Graduate Center, The City College of NewYork and The University of Oklahoma, 2015. url: www-cs.ccny.cuny.edu/~jzhang/papers/spatial_cc_tr.pdf.

[15] J. Regier, K. Fischer, K. Pamnany, A. Noack, J. Revels, M. Lam, S. Howard, R.Giordano, D. Schlegel, J. McAuliffe, R. Thomas, and Prabhat. “Cataloging thevisible universe through Bayesian inference in Julia at petascale”. In: Journalof Parallel and Distributed Computing 127 (May 2019), pp. 89–104. doi: 10.1016/j.jpdc.2018.12.008.

Preview - ©Simon D. Wenkel (https://www.simonwenkel.com) - p. 37