Cyber-Infrastructure for Data-Intensive Geospatial Computing · (CI) platform that can bridge the...

Cyber-Infrastructure for Data-IntensiveGeospatial Computing

Rajasekar Karthik, Alexandre Sorokine, Dilip R. Patlolla, Cheng Liu,Shweta M. Gupte, and Budhendra L. Bhaduri

With the recent advent of heterogeneous High-performance Computing (HPC) tohandle EO “Big Data” workloads, there is a need for a unified Cyber-infrastructure(CI) platform that can bridge the best of many HPC worlds. In this chapter, wediscuss such a CI platform being developed at Geographic Information Science andTechnology (GIST) group using novel and innovative techniques, and emergingtechnologies that are scalable to large-scale supercomputers. The CI platformutilizes a wide variety of computing such as GPGPU, distributed, real-time andcluster computing, which are being brought together architecturally to enable data-driven analysis, scientific understanding of earth system models, and researchcollaboration. This development addresses the need for close integration of EOand other geospatial information in the face of growing volumes of the data,and facilitates spatio-temporal analysis of disparate and dynamic data streams.Horizontal scalability and linear throughput are supported in the heart of theplatform itself. It is being used to support very broad application areas, rangingfrom high-resolution settlement mapping, national bioenergy infrastructure to urbaninformation and mobility systems. The platform provides spatio-temporal decisionsupport capabilities in planning, policy and operational missions for US federalagencies. Also, the platform is designed to be functionally and technologicallysustainable for continued support of the US energy and environment mission forthe coming decades.

R. Karthik (�) • A. Sorokine • D.R. Patlolla • C. Liu • S.M. Gupte • B.L. BhaduriGeographic Information Science and Technology Group, Oak Ridge National Laboratory, OakRidge, TN, USAe-mail: [email protected]; [email protected]; [email protected]; [email protected];[email protected]; [email protected]

© The Author(s) 2018P.-P. Mathieu, C. Aubrecht (eds.), Earth Observation Open Science and Innovation,ISSI Scientific Report Series 15, https://doi.org/10.1007/978-3-319-65633-5_7

143

mailto:[email protected]






https://doi.org/10.1007/978-3-319-65633-5_7

144 R. Karthik et al.

Introduction

In today’s world, there is a multitude of heterogeneous Earth Observation High-performance Computing (EO-HPC) systems, each designed to solve specific scienceor technological missions. These EO-HPC systems often work independently ofeach other, hindering the flow of information and limiting the ability to achieveinteroperability among systems. Bringing these systems together to create a fullyand closely knitted Cyber-infrastructure (CI) platform provides a shared universefor EO data driven analysis and discovery capabilities for US federal agencies.Though there have been some promising community efforts to build a CI platformthat can integrate various types of EO-HPC systems together, there does not exista unified CI platform currently (Kalidindi 2015; Bhaduri et al. 2015a). With “BigData” explosion changing the landscape of software architecture design, there is anincreasing need for such a platform that can meet the modern complex needs.

Geographic Information Science and Technology (GIST) and The Urban Dynam-ics Institute (UDI) at Oak Ridge National Laboratory (ORNL) are building such a CIplatform by integrating our next-generation EO-HPC systems under one umbrellausing novel techniques and emerging technologies. We employ wide breadth oftechniques across the spectrum of scientific computing such as GPGPU, distributed,real-time and cluster computing (Bhaduri et al. 2015a; Karthik 2014a; Sorokineet al. 2012). The platform is designed to scale to petabytes of data and handlemassive workloads. One of the biggest architectural challenges in developing ourCI platform was addressing complex interdependencies among the various systemswithout compromise in efficiency or functionalities of individual systems. Theplatform achieves foundational, structural and semantic interoperabilities to createa harmonized and seamless experience across various systems. With this platform,various systems are designed to work together as a whole information system,but also retain the ability to operate independently if desired. In this integratedplatform, the systems communicate with each other providing various levels ofcontrol of components and functionality, yet allowing for independent control aswell. Modularity is being designed at the core of the CI platform to support futureEO-HPC systems for easier integration, and foster sustainable development to meetUS science and energy missions now and into the future. Our CI platform aims totake the next major leap in Data Science and Cyber-infrastructure, while making thebest use of our existing EO-HPC systems.

In the following sections, we describe the challenges, trends and how our CIplatform plays an important role in our research initiatives illustrated with settlementmapping, mobility science, and urban information system.

Settlement Mapping Tool (SMTOOL)

Understanding high-resolution population distribution data is fundamental to reducedisaster risk, eliminate poverty, and foster sustainable development. Modern cen-suses fail to cover population in remote, inaccessible areas in many underdeveloped

Cyber-Infrastructure for Data-Intensive Geospatial Computing 145

nations. Small settlements are visible in high-resolution satellite imagery, whichhave historically been computationally expensive for information extraction. Uti-lizing GPUs, Oak Ridge National Laboratory is mapping the smallest settlementsand associated population across the globe for the first time in our history. Thehigh-resolution settlement and resulting 90 m LandScan HD population data haveprofoundly enhanced our ability to reach and serve vast vulnerable populations fromlocal to planet scales (Bhaduri et al. 2015b).

Past 50 years have witnessed the global population increase by four billion andwith 150 new births every minute, an additional four billion people will settleon this planet in the coming 50. Urban and rural population distribution dataare fundamental to prevent and reduce disaster risk, eliminate poverty and fostersustainable development. Commonly available population data, collected throughmodern censuses, do not capture this high-resolution population distribution anddynamics. However, there is gross underestimation of global human settlementsand population distribution. Footprints of our expanding activities are impactingthe future of this planet from availability of natural resources to a changing climate.Accurate assessment of high-resolution population data is essential for successfullyaddressing key issues such as good governance, poverty reduction strategies, andprosperity in social, economic and environmental health. Geospatial data andmodels offer novel approaches to disaggregate Census data to finer spatial units;with land use and land cover (LULC) data being the primary driver. With increasingavailability of LULC data from satellite remote sensing, “developed” pixels havebeen nucleus to assessing settlement build up from human activity. However, theprocessing and analysis of tera to peta scale satellite data has been computationallyexpensive and challenging.

With the availability of moderate to high resolution LULC data derived fromNASA MODIS (250–500 m) or Landsat TM (30 m) have facilitated the developmentof population distribution data at a higher spatial resolution such as Oak RidgeNational Laboratory’s (ORNL) LandScan Global (1 km) and LandScan USA(90 m); two finest resolution population distribution data developed. Although theseLULC data sets have somewhat alleviated the difficulty for population distributionmodels, in order to assess the true magnitude and extent of the human footprint, itis critical to understand the distribution and relationships of the small and medium-sized human settlements. These structures remain mostly undetectable from mediumresolution satellite derived LULC data. For humanitarian missions, the truly vul-nerable, such as those living in refugee camps, informal settlements and slumsneed to be effectively and comprehensively captured in our global understanding.This is particularly true in suburban and rural areas, where the population isdispersed to a greater degree than in urban areas. Extracting settlement informationfrom very high-resolution (1 m or finer), peta-scale earth observation imageryhas been a promising pathway for rapid estimation and revision of settlementand population distribution data. As early as 2005, automated feature extractionalgorithms implemented on available CPU-based architectures demonstrated radicalimprovement in image analysis efficiency when manual settlement identificationfrom a 100-km2 area was reduced from 10 h to 30 min. However, this scaled


inefficiently with limited resources on a workstation and at that rate processing 57-million km2 habitable area would take decades. The pressing need for identifyingpopulation distribution in the smallest human settlements and monitoring settlementpatterns at local to planet scale as landscape changes are induced by populationgrowth, migration, and disasters, compels a computational solution to process largevolumes of very high resolution satellite imagery. Such a solution did not existbefore this work was accomplished at ORNL.

Extracting settlements at high-resolution satellite images was accomplished bymapping sub-meter pixel data to unique patterns that correlate with the underlyingsettlements. To account for the variations in settlement structures spanning fromskyscrapers to small dwellings, we generated patterns at different scales. Themapping process involved simultaneously analyzing a set of connected pixels toextract low-level structural features such as building edges, corners, and lines.Next, the spatial arrangements of these structural features at different scaleswere computed in parallel allowing us to generate unique settlement signaturesefficiently. Furthermore, pattern recognition based on a previously learned modelwas integrated with the pattern generation step allowing us overcome the need to foradditional storage. Our strategy of mapping pixels to underlying structural patternswas quite different from existing approaches that relied on spectral measurementssuch as reflectance at different wavelengths for settlement detection.

This approach using sub-meter resolution imagery is not only useful in gen-erating accurate human settlement maps, but also it allows potential (social andvulnerability) characterization of population from settlement structures (tents, huts,buildings) from image texture and spectral features. Rapid ingestion and analysisof high resolution imagery to enhance quality and timely availability of inputspatial data provides a cost and time effective solution for developing current andaccurate high resolution population data. Such progresses in geospatial scienceand technology hold tremendous promise for advancing the state of accuracyand timely flow of critical geospatial information not only to benefit numeroussustainable development programs; but also has significant implications for timecritical missions of disaster support.

High-resolution settlement data is foundational information for locating pop-ulations and activities in an area. One major usage of this data is as input toORNL’s LandScan HD population distribution model, which combines the settle-ment data with population density information to generate population distributionat an unprecedented 90 m resolution. For many underdeveloped countries, officialcensuses never reach remote, difficult to access areas. Moreover, there have not beenreasons to exploit high-resolution satellite imagery for those areas. Consequently,this capability has enabled us to locate populations in secluded areas of the planet forthe first time in our history. High-resolution settlement and population data are beingused by the global humanitarian community for missions ranging from planningcritical infrastructure and services to the deserving population, responding to theEbola crisis in western Africa, eradicating Polio in Nigeria, as well as defining andmitigating disaster risks.


Fig. 1 An overview ofSMTOOL architecture

Often solutions require advanced algorithms capable of extracting, representing,modeling, and interpreting scene features that characterize the spatial, structuraland semantic attributes. Furthermore, these solutions should be scalable enablinganalysis of big image datasets; at half-meter pixel resolution the earth’s surface hasroughly 600 Trillion pixels and the requirement to process at this scale at repeatedintervals demands highly scalable solutions. Thus, we developed a GPU-basedcomputational framework (as illustrated in Fig. 1) designed for identifying criticalinfrastructures from large-scale satellite or aerial imagery to assess vulnerablepopulation. We exploit the parallel processing capability of GPUs to present GPU-friendly algorithms for robust and efficient detection of settlements from large-scalehigh-resolution satellite imagery (Patlolla et al. 2012). Feature descriptor generationis an expensive (computationally demanding), but a key step in automated sceneanalysis. To address the large-scale data processing needs we exploited the parallelcomputing architecture and carefully designed our algorithm to scale with hardwareand fully utilize the memory bandwidth (required to transfer the high resolutionimage data) efficiently to produce great speedups times for the feature descriptorcomputation (as illustrated in Fig. 2).

We could thus achieve GPU-based high speed computation of multiple featuredescriptors—multiscale Histogram of Oriented Gradients (HOG) (Patlolla et al.2012), Gray Level Co-Occurrence Matrix (GLCM) Contrast, local pixel intensitystatistics, Texture response (local Texton responses to a set of oriented filtersat each pixel) (Patlolla et al. 2015), Dense Scale Invariant Feature Transform(DSIFT), Vegetation Indices (NDVI), Line Support Regions (extraction of straightline segments from an image by grouping spatially contiguous pixels with consistent


Fig. 2 Settlement mapping process

orientations), Band Ratios (a digital image-processing technique that enhancescontrast between features by dividing a measure of reflectance for the pixels in oneimage band by the measure of reflectance for the pixels in the other image band)etc. Once, the features are computed, a linear SVM is used to classify settlementand non-settlements. The computational process requires dozens of floating pointcomputations per pixel, which can result in slow runtime even for the fastest ofCPUs. The slow speed of a CPU is a serious hindrance to productivity for timecritical missions. Our GPU-accelerated computing solution provides an order ofmagnitude or more in performance by offloading compute-intensive portions of theapplication to the GPU, while the remainder of the code still runs on the CPU. Theimplementation further scales linearly with the available nodes thus enabling theprocessing of large-scale data on high end GPU-based cluster-computers.

With the introduction of very high-resolution satellite imagery, mapping of smallor spectrally indistinct settlements became possible on a global scale. However,existing methodologies for extracting and characterizing settlements rely on manualimage interpretation or involve computationally intensive object extraction andcharacterization algorithms that saturate the computational capabilities of conven-tional CPU-based options used by commercial remote sensing software packages.Many of these existing pixel based image analysis techniques used for mediumresolution Landsat imagery (�30 m) or coarser MODIS imagery (250–1000 m)are not ideal for interpreting satellite imagery with sub-meter spatial resolution.Advanced modeling of the spatial context is necessary to extract and representinformation from such high-resolution overhead imagery. On a global scale, criticalcomputational challenges are posed in the processing of petabytes of sub-meterresolution. For example an image of Kano city in Nigeria at 0.5 m resolutionrepresents 23 Gigabytes of data covering 13,050 km2. Attempts to accuratelyextract settlements using CPU-based commercial remote sensing packages wereunsuccessful. An estimate to manually digitize the settlements for this area was870 h. Meanwhile, SMTool on a 4 Tesla GPU workstation is able to process thislarge dataset in approximately 17 min (as illustrated in Table 1, Figs. 3 and 4).


Table 1 Performance ofvarious features—processingtimes are based on a 4 C2075GPU workstation

Feature Accuracy (%) Runtime (s)

HOG 93.5 1.6TEXTONS 92.7 4.7VEGIND 91.4 1.77BANDRT 86.1 1.93DSIFT 90.8 11.33

Fig. 3 TEXTONS performance—33 images of 0.5 m spatial resolution, each covering an area of2.6 km2, collected from various parts of Kandahar, Afghanistan


Fig. 4 SMTOOL results for Kano, Nigeria—On an average �1% training samples

The Compute Unified Device Architecture (CUDA) has enabled us to efficientlyutilize NVIDIA TESLA GPUs to develop and utilize (in a practical and efficientmanner) the expensive feature descriptor algorithms that would otherwise becomplex and impractical.

It is quite important to carefully design the computing strategies to fully exploitGPU’s parallel computational capabilities. For this we divide the high-res imageryinto non-overlapping square pixel-blocks consisting of M � M pixels. To set anoptimal value for M, we experimented with several values ranging from 8 to 64pixels. For each of the pixel-block we compute feature descriptors at several scalesbased on the low-level image features. Parallel processing is key to efficientlyimplementing the feature descriptor algorithm, where thousands of threads can runsimultaneously. The K20X Tesla GPU in the GEOCLOUD cluster has 2688 CUDAcores/GPU at �3.95 Tera Flops of single precision computing power, which helpedus deliver speedups ranging from 100� to 220�, thus cutting the feature descriptorcomputation time from days to mere minutes.

Efficient memory utilization is key for the optimal performance of our algorithmon the GPU as it involves transfer of large amounts of image raster data to the GPUand the output settlement layer back to the host. We leveraged the Tesla GPUs PCI-E 3.0 interface to achieve over 10 GB/s transfers between the host (CPUs) and thedevices (GPUs). An important factor to consider is the data transfer between CPUand GPU and optimal speedups require reducing the amount of data transferredbetween the two architectures. In our implementation the data transfer between CPUand GPU is performed only at the beginning and the end of the process. First, the


image is read using GDAL to store the data in the CUDA global memory, whichis 6 GB and provides a 288 GB/s memory–memory bandwidth. Though the datarequests to the global memory has higher latency, CUDA provides a number ofadditional methods for accessing memory (i.e. shared, constant memory etc.) thatcan remove the majority of data requests to the global memory, thus enabling usto keep the Compute to Global Memory Access (CGMA) ratio at high values toachieve fine grained parallelism.

This has been an interdisciplinary effort with team members with academic andprofessional training in geography, electrical engineering, computer science andengineering and expertise in geographical sciences, population and settlement geog-raphy, spatial modeling, satellite remote sensing, machine learning data analysis andhigh performance computing.

Toolbox for Urban Mobility Simulations (TUMS)

TUMS, Toolbox for Urban Mobility Simulations, is a web-based high-resolutionquick response traffic simulation modeling system for urban transportation studies.TUMS can be used both as a daily commuter traffic simulator or an emergencyevacuation planning tools. There are some unique features in TUMS comparingwith other similar transportation modeling and traffic simulation systems. It useshigh-resolution population distribution and detailed street network, both coveringthe entire world. TUMS is aiming to simulate county level traffic flow usingmicroscopic traffic simulation modeling and it has web applications based onWebGL. Users can take advantage of client side Geographic Process Unit (GPU)when it presents on client desktop. The transportation engine of TUMS is based onan open source package called TRANSIMS (TRansportation ANalysis SIMulationSystem, version 5.0) (Smith et al. 1995).

Global Dataset

Two main datasets used in TUMS are a population distribution dataset calledLandScan developed by ORNL (Bhaduri et al. 2002) and a worldwide open sourcestreet level transportation network, called OSM, OpenStreetMap (OpenStreetMap2016). Both datasets covers the entire planet.

LandScan has two components, LandScanUSA and LandScanGlobal. As thename indicates, LandScanUSA is the population distribution for USA and Land-ScanGlobal covers the entire world including USA. Both dataset are updated yearly.LandScan divides the study area into cells and each cell has a population count.LandScanUSA has higher resolution cells than LandScanGlobal. LandScanUSAuses 3 arc second cells while LandScanGlobal uses 30 arc second cells. Roughly,the 3 arc second cell has the size of 90 m by 90 m and the 30 arc second cell has the


size of 1 km by 1 km around the equator. The size of cells becomes smaller whenthe latitude is higher. In order to make the analysis consistent, TUMS decomposesLandScanGlobal to 3 arc second cells using a primitive moving average method.If the study area is within USA then TUMS uses LandScanUSA dataset. If thestudy area is outside of USA then TUMS uses the decomposed LandScanGlobal3 arc second dataset. From now on in this paper, we will use LandScan to representboth LandScanUSA and decomposed LandScanGlobal with 3 arc second cells.

OSM is updated weekly. The data quality in OSM depends on geographic region.Europe and North America data has much higher quality than Asia and Africa. SinceOSM keeps evolving, the data quality now has improved tremendous comparedto earlier version. Please do not be conceived by its name, OpenStreetMap, it notonly have street network, it has other features such as land use type, administrationboundary, physical features and lots more. However, TUMS only uses street networkat current stage.

Framework

There are three major components in TUMS framework, a pre-processing com-ponent, a traffic simulation component, and a web-based visualization component(Fig. 5). The pre-processing component is responsible for preparing the input datafor the transportation modeling. This first step is to define a study area, which canbe a county (in USA only), a polygon or a circle. The next step is to extract thepopulation and street network from LandScan and OSM. After integrated these two

OpenStreetMap

LandScan

Evacuationarea

Shelters

DataProcessing

Traveldemand

Tripdistribution

Trafficassignment

TRANSIMSengine

TrafficSimulationModels

Macro-scopic link-

basedanalysis

Micro-scopicvehicle-based

analysis

Web-basedVisualization

Fig. 5 TUMS framework


data together, TUMS creates a routable network with correct network topology andgenerates origin-destination (OD) tables for transportation modeling.

The traffic simulation models are based on TRANSIMS framework. TRANSIMShas more than a dozen executable programs, which are loosely coupled. Each exe-cutable program can be executed separately if the input data is properly set. Roughly,these programs can be grouped into five categories, synthetic population generation,network preparation, origin-destination (OD) table preparation, trip distribution andassignment, and Microscopic traffic simulation. TUMS takes the advantages of thisflexible framework and integrate its own modules into TRANSIMS framework. Forexample, the synthetic population generation modules are replaced by LandScanpopulation module. TUMS own OD table generation modules using LandScan andOSM substituted TRANSIMS OD table preparation modules.

Since TRANSIMS does not have a Graphic User Interface (GUI), TUMS devel-oped two independent visualization tools for different background users (Karthik2014b). The link based visualization and analysis tools are for planners who areinterest in the measure of efficacy (MOE) for the planning purpose. The vehiclebase animation tools are for traffic engineers who are more interested on operationssuch as intersection traffic control. Figures 6 and 7 are the examples for link-basedand vehicle based GUI.

Fig. 6 Link-based visualization tool


Fig. 7 Vehicle-based visualization tool

OD Tables

By manipulating the OD tables TUMS can simulate both daily commuter traffic flowor non-notice emergency evacuation simulation. Although LandScan only reportsthe total population count for each cell, but internally, LandScan has five layers,which are worker, residential, school, shopping and non-movement group. Withthese layers it is possible to generate the O-D tables for daily commuter traffic flow.Figure 8 is an example of daily commuter traffic flows for the year of 2015 and2035.

For non-notice emergency evacuation, TUMS assumes that every evacuee wouldlike to take a trip to the nearest shelter or exit point (boundary points) to get out ofthe evacuation area as quickly as possible. The OD tables for non-noticed emergencyevacuation simulation are generated by finding the nearest shelters for each LPC.

Resolution

Traditional transportation models, both macroscopic and microscopic, use TrafficAnalysis Zone (TAZ) for the OD tables. TAZs are the basic geographic unitfor demographic data and land use type. The size of TAZ varies. Zones aresmaller in urban area with high population density and larger in rural area with


Fig. 8 Daily traffic flow simulation at Cleveland, TN, on the year 2015 (Left) and 2035 (Right)

lower population density. With the rising of agent-based and driver behaviourtraffic modeling in transportation research, the large area TAZ is not suitable formicroscopic traffic simulations. For example, Alexandria County, VA, has only 62TAZs. Each TAZ covers quit large area. For microscopic traffic simulation thereis no reason that the TAZs could not be as small as a single building if there isenough computing resource and the data available. The computing resource is cheapnow, but unfortunately, the global single building population distribution databaseis not available yet. So TUMS uses LandScan as an alternative. There are 5657LandScan Population Cells (LPC) comparing with 62 TAZs in Alexandria, VA. TheLPC resolution is around 100 times higher than using TAZ.

The size of TAZ or LPC is related with the network level of details. A networkwith principle and minor arterials does not need high-resolution population dataset.In traditional traffic modeling the collect or local streets are ignored due to the lowtraffic volume. But if the OD zones use single buildings as the trip generation unit,then the network should include collect and local streets. For non-notice emergencyevacuation simulation, the collect and local streets become very important becausethe evacuees who are close to the boundary of evacuation region can get out theevacuation area very quick by using local streets. If the local streets are excludedfrom the network, all these evacuees have to travel to the opposite direction in orderto access the major arterials and then travel to the boundary points. This is unrealisticand generates artificial congestion on the arterials.

Unified Network and Population Database

There are many transportation modeling systems and traffic simulation toolsavailable both on open source or in commercial packages, such as TRANSIMS,


MITSIM, VISSIM, SUMO and MATSIM, just to name a few. All of these modelshave similar basic input requirements such as network and population. But eachone of them has its own input and output format. TUMS has developed a uniformdatabase for the entire world and also has utilities to convert the unified database todifferent format for different models. Currently TUMS supports TRANSIMS andMITSIM. SUMO and MATSIM will be added in the near future.

Big Data

In modern era like todays, Data is generated by internet activity, sensors forenvironment, traffic cameras, satellite imagery and is referred to as Big Data.Processing, analyzing and visualizing this data have its own challenges. The OSMdataset has 3C billion point and 300C million ways (links) in the planet data file(December 2015 version). Among the 300C million ways there are 80C millionstreet links. The LandScan has 93C billion cells.

Since all these data are read only, TUMS chooses flat binary files to store the datafor its simplicity and for easy random access. In order to retrieve the data efficiently,both the network and the population data are decomposed to 1 by 1 degree cells.The street network is stored in a shapefile format and the population data is storein a binary grid format. Since ESRI’s binary grid format is a proprietary format,TUMS has to develop its own binary grid format. In TUMS database the OSMstreet network occupies 21 GB and LandScan occupied 4.6 GB disk space.

The vehicle trajectory data is another challenge. Assume that there are 100Kvehicles in a median size county and the simulation time is 24 h for a daily commutertraffic simulation with 1-s simulation step, the total trajectory for all vehicles is 8Cbillion points per scenario. This data is stored in ESRI-like point shapefile. SinceESRI’s point shapefile has 2 GB file boundary limitation, TUMS developed its ownbinary format without the 2 GB boundary limitation. This vehicle trajectory data isstreaming to TUMS web-based application for animation.

Urban Information System (UrbIS)

Leveraging Big Data to Understand Urban Impacton Environment and Climate

Cities are one of the major contributors to climate change. At the same timecities themselves are most strongly affected by the changing environmental con-ditions. Immense complexity of interactions between urban areas, climate, andnatural environments presents scientists with a multitude of challenges. First, urbanenvironments are characterized by a very large number of variables including


demographics, energy, quality of the environment and many others. Second, citiesshow significant variety and strongly differ in terms of their processes and energyand material flux. Third, cities themselves have become major defining forces fortheir surrounding environments by affecting local topography, air circulation, water-heat balance and habitats. Resulting human-natural system has a large number offeedback loops and correlations among its variables.

Understanding of the urban environments can be improved by tapping into vastinformation resources that have become available to the researchers thanks to theBig Data technologies (Chowdhury et al. 2015). Traditional sources of Big Datalike historical databases of Twitter messages, postings in other social networks, andcell phone locations can provide valuable insights into the functioning of peoplein urban environments. However, the majority of the data of interest for urbanresearchers exist in the form of an “ecosystem of small data”, as a large numberof disparate datasets created by different communities, government agencies, andresearch institutions. Finding such data and then merging them together for the usein a single analytical workflow has become a major hindrance for such studies. Toaddress these challenges we at ORNL have embarked on the developing of ORNLUrban Information System (UrbIS)—a web-based software tool that would allowurban scientists to perform most of their analytical and data processing tasks in thecloud within a unified browser-based user interface.

UrbIS goal is to address a number of problems typically faced by the researchersin this area. After analyzing ORNL experience in a number projects including theones described in this chapter we were able to identify multiple bottlenecks thatimpede scientists’ productivity. These challenges can be mitigated by developingsoftware for automating of several commonly performed tasks such as (1) findingthe data necessary to achieve the goals of the study, (2) preparing the data fromexternal sources for the use in the analytical software, (3) running modeling andanalytical programs on the high-performance computing systems, and (4) retrievingand understanding the results of the analysis including representation of the resultsin visual form as graphs and maps.

Although scientists typically have a good understanding of the kinds of data theyneed for their research, finding specific datasets and not missing the relevant onesmay be hard and time consuming. Most of the relevant data resides in the “deepweb”, i.e. not visible or not suitably indexed by general-purpose search engines likeGoogle or DuckDuckGo. Therefore, such search engines often produce noisy resultsthat require lots of manual filtering and verification or miss relevant data.

Search through dataset metadata provides a better alternative for finding sci-entific data. In the recent decade metadata has become a universally used toolfor documenting large amount of data especially produced by the governmental,international, and other major research organizations. Multiple standardizationefforts have generated several specifications that cover lots of aspects of importantdomain-specific knowledge necessary to precisely represents information about thedata. Metadata search capabilities are currently available in many data archivesand repositories such as, for example, NASA’s Data Portal (https://data.nasa.gov/)

https://data.nasa.gov/


and DataONE Earth Observation Network (https://www.dataone.org/) supported byNational Science Foundation.

Metadata search in most cases is more effectives than the use of general-purposesearch engines because the metadata is structured and curated according to welldefined standards. Users can filter through the data not only by the keywords orcommonly used phrase but also by specific spatial, temporal or attribute information.For example, it is possible to limit the search by a specific sensor, variable, targetarea, time interval, or range of values. Certain results can be excluded from thesearch by using negation criteria that is not easily achievable in general-purposesearch engines. However, typical metadata search requires interaction with multiplemetadata search systems and familiarity with a variety of user interfaces and APIs.

After the necessary data have been identified the users have to extract relevantsubsets of data (i.e. clipping a region of interest and/or limiting the data to a specifictime interval) and move the data to their workstations. Sometimes the volume of thedata can be very large like in the case of ensembles of global circulation models andcan reach the volumes on the order of terabytes. Present-day hard drive costs arelow enough not to be a limiting factor for storing data still movement of the largevolumes of data over the network requires lots of time and special software likeGlobus Toolkit GridFTP1. The bigger problem is the maintenance of the harvesteddata on the workstation or local network storage that requires not only cataloguingof the data but also checking the dataset integrity, creating backups and retrievingupdates for corrected errors or newer versions of the datasets.

The next preparation step is converting harvested data into the formats that canbe understood by the analytical software. This step includes not only simple formatconversion but also other non-analytical operations. Almost all of the urban data isspatiotemporal as the overwhelming majority of data records in urban datasets havesome kind of geographic and time reference. Thus there is always a need to maintainand convert cartographic projection and other spatial referencing information. Thedatasets often come in the formats that are not understood by the analyticalsoftware or in-house developed code and scientists are forced to spend their timeon developing format converters or perform lots of manual transformations. In caseof UrbIS we are often faced with the data that comes from different scientificcommunities—urban scientists and climate modellers. Most climate and weatherdata is stored in NetCDF or HDF5 files while urban datasets mostly rely on the fileformats of the commercial GIS software. Many of the open-source and commercialGIS are able to read these formats but the data have to be manually reorganized. Theseparate problem is semantics misalignment among the datasets especially whenthe datasets originate from different communities. This includes incompatibilitiesrelated to the units of measure, variable names, inconsistent naming of the grids andspatial regions. Such differences between the datasets are often not reflected in theirmetadata.

1http://toolkit.globus.org/toolkit/docs/latest-stable/gridftp/.

https://www.dataone.org/

http://toolkit.globus.org/toolkit/docs/latest-stable/gridftp/


Many of the analytical and modeling projects at ORNL including the ones inthis review heavily rely on high-performance computing systems and facilities.This includes conventional computing clusters, cloud-based systems and leadershipmassively parallel facilities like TITAN and Eos2. Developing, porting and usingscientific code and managing applications and data on such systems require specialtechnical skills and experience that are not commonly available.

Finally, the output of the high-performance models has to be presented in theform suitable for understanding and presentation. In our research domain this almostalways means visualization in the geographic context with the help of advancedvisualization tools found in the geographic information systems. At this point themodeling and analytical results should not only be converted into the formatsunderstandable by GIS but also aligned with other pertinent geographic data.

Even though most of the outlined difficulties are technical in nature, theyimpose a significant toll on the scientists’ time and increase overall costs ofresearch. Moving these burdens from the scientist is one of the main goals ofUrbIS. Earlier ORNL experience with similar systems has demonstrated efficiencyof such approach. In 2010 ORNL has developed iGlobe—a desktop applicationfor the geographic analysis of climate simulation data that combined server-sideanalysis and management of data with geographic visualization in a single workflow(Chandola et al. 2011). iGlobe is built around NASA WordWind Java3 and allowsusers to retrieve the data from the data portals, process them on the server, andvisualize analysis results on the desktop. Control of the server-side processing ofthe data is performed through desktop GUI using secure shell connection. Resultsof the analysis are presented using NASA WorldWind visualization component asinteractive 2D or 3D geographic displays. With the advance of web, cloud and high-performance computing technologies we are leveraging our iGlobe experience atthe new level of web-centric and cloud-centric applications.

Currently UrbIS is under active development and exists as an early evolving pro-totype available to ORNL internal users. When completed UrbIS will allow urbanresearchers to execute a complete analytical workflow starting with discoveringand obtaining necessary data from diverse data repositories, analyzing them usinghigh-performance computing capabilities, and then to visualizing and publishingthe results. Researchers will be able to perform all these operations completely inthe cloud and/or on the server through a standardized web interface. When fullyimplemented UrbIS will eliminate the need to download and process any input,intermediate, or output data files on the workstation.

Screenshots of the UrbIS prototype are presented from Figs. 9, 10, 11 and12. UrbIS workflow starts with a federated metadata search interface (Fig. 9).This interface provides a user with the search capabilities through several externalmetadata search engines and internal ORNL data holdings. For the federated

2https://www.olcf.ornl.gov/titan/.3http://worldwind.arc.nasa.gov/java/.

https://www.olcf.ornl.gov/titan/.

http://worldwind.arc.nasa.gov/java/.


Fig. 9 UrbIS federated metadata search interface

Fig. 10 UrbIS workspace manager interface

metadata search engine we are using a customized version of Mercury4—an in-house ORNL metadata search engine that enables the search over other metadatarepositories and archives like DataONE (https://www.dataone.org/) and ORNLDAAC (https://daac.ornl.gov/). In addition to the external data repositories UrbISalso provides its users with several frequently used datasets with common used

4http://mercury.ornl.gov/.

https://www.dataone.org/

http://mercury.ornl.gov/


Fig. 11 HPC computations for analysis and modeling

Fig. 12 UrbIS visualization interface


geographic information. All the datasets and datastores can be searched throughthe same interface completely transparent for the user.

After finding the needed data the user defines the region of interest and spatialresolution of his study area. At the same time in the background the system startsretrieval and sub-setting of the requested data from the external data stores. After thedata has been placed into UrbIS scratch disk space the user will be notified and therecords about downloaded subsets will appear in the workspace manager interface(Fig. 10). Here users can check the statistics of the downloaded data and verifycompletion of the download and conversion processes. All the retrieved data willbe stored cloud-side and will not be downloaded to the user’s workstation unlessrequested. Internally the data will be converted into application-specific representa-tions optimized for further processing and access through UrbIS web services.

At the next stage of the workflow the user will be able to choose from a libraryof the analytical and modeling functionality (Fig. 11). As a part of the initialUrbIS development we are implementing high-performance clustering algorithmsfor building typologies of the cities based on a large number of input parameters.After specifying input parameters the user will submit a task to one of the high-performance computers. UrbIS will prepare the data in the form suitable forthe selected processing method and create a batch configuration file containingcommands for the target high-performance platform. The user will be able toinitiate processing on the target system directly from UrbIS interface. After the jobcompletion UrbIS will retrieve the results and convert them into the formats that areused internally.

The final step of the workflow is the visualization of the results in the geographiccontext. For that purpose we are using WebWorldWind (https://webworldwind.org/)—a modern javascript version of NASA WorldWind that utilizes WebGL. Itcan be launched from within popular browsers without the need to download anyplugins or desktop applications. Visualization section of UrbIS (Fig. 12) has userinterface typical for a digital globe like Google Earth or NASA WorldWind. Herethe user can visualize the input and output data in the geographic context. The datais fed to the visualization component with WMS and WFS services from the internalUrbIS storage. Also the user can pull the data from any other data source supportedby the NASA WebWorldWind including default WorldWind layers. The user willhave an ability to switch between 3D and 2D views and choose the background andportrayal methods most suitable for his visualization purposes.

Current implementation of UrbIS is being developed using nodejs for theserver side components. As a spatial data storage we are using PostgreSQL withPostGIS extensions. High-performance processing components are implemented asexternal modules and they use languages and tools most appropriate for the specificalgorithms and platform. UrbIS should be accessible from any modern browserwith WebGL support enabled (for visualization component). Internally UrbIS relieson service-oriented architecture with most functionality exposed through RESTfulprogramming interface.

Currently UrbIS is in the active development and is available for testing tointernal users. Its implementation will enable users to use high-performance and

https://webworldwind.org/

https://webworldwind.org/


cloud-based infrastructure in their research and reduce the time needed for mundanetasks such data movement and format conversion. Also UrbIS will serve as a testingground for new cloud-based technologies to facilitate the use of large geodata inscientific research within high-performance and cloud-based environment. Afterinitial release and testing with internal user community we will proceed to imple-menting other sets of functionalities and extend the library of the high-performanceanalytical routines with other methods and models. In the future we plan to integrateUrbIS infrastructure with systems like Jupyter Notebooks (http://jupyter.org/) sothat users can develop their own code through a web interface and access UrbISdata using web services.

Conclusions

Efforts to understand and analyze data-enabled science has created a clear needto unite various Earth Observation High-performance Computing (EO-HPC) sys-tems, where the best of these various worlds are brought together in one sharedCyber-Infrastructure (CI) platform. In this chapter, we have discussed such a CIplatform being developed at Oak Ridge National Laboratory using data-drivenGeoComputation, novel analytical algorithms and emerging technologies. Systemsinteroperability, scalability and sustainability play an ever-increasing role in data-driven and informed decision-making process in our platform. We have discussedarchitectural and technical challenges in development of our platform, and broad-ening implications of it as illustrated by our research initiatives for data and scienceproduction. With technological roots in HPC, our platform is optimized for EarthObservation Big Data used to accelerate the research efforts, and foster knowledgediscovery and dissemination more quickly and efficiently for US federal agencies.

Acknowledgements The authors would like to thank a number of US federal agencies fortheir continued support for the research presented here. Sincere gratitude is due to many ofour Geographic Information Science and Technology group colleagues for their collaborationand assistance. This paper has been authored by employees the US Federal Government andof UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the US Department of Energy.Accordingly, the US Government retains and the publisher, by accepting the article for publication,acknowledges that the US Government retains a non-exclusive, paid-up, irrevocable, worldwidelicense to publish or reproduce the published form of this manuscript, or allow others to do so, forUS Government purposes.

References

Bhaduri B et al. (2002) LandScan. Geoinformatics 5(2):34–37Bhaduri B et al. (2015a) Emerging trends in monitoring landscapes and energy infrastructures with

big spatial data. SIGSPATIAL Spec 6(3):35–45

http://jupyter.org/


Bhaduri BL et al. (2015b) Monitoring landscape dynamics at global scale: emerging computationaltrends and successes. Oak Ridge National Laboratory, Oak Ridge, TN

Chandola V et al. (2011) iGlobe: an interactive visualization and analysis framework for geospatialdata. Proceedings of the 2nd International Conference on Computing for Geospatial Research& Applications, 23 May 2011, p 21

Chowdhury P et al. (2015) An comparison of data storage technologies for remote sensing cyber-infrastructures. The International Conference on Big Data Analysis and Data Mining

Kalidindi SR (2015) Data science and cyberinfrastructure: critical enablers for accelerateddevelopment of hierarchical materials. Int Mater Rev 60(3):150–168

Karthik R (2014a) SAME4hpc: a promising approach in building a scalable and mobile envi-ronment for high-performance computing. Proceedings of the Third ACM SIGSPATIALInternational Workshop on Mobile Geographic Information Systems, 4 November 2014, pp68–71

Karthik R (2014b) Scaling an urban emergency evacuation framework: challenges and practices.Workshop on Big Data and Urban Informatics

OpenStreetMap (2016) https://www.openstreetmap.org. Accessed May 20 2016Patlolla DR et al. (2012) Accelerating satellite image based large-scale settlement detection with

GPU. Proceedings of the 1st ACM SIGSPATIAL International Workshop on Analytics for BigGeospatial Data, 6 November 2012, pp 43–51

Patlolla D et al. (2015) GPU accelerated textons and dense sift features for human settlementdetection from high-resolution satellite imagery

Smith L et al. (1995) TRANSIMS: transportation analysis and simulation system. Los AlamosNational Laboratory, New Mexico

Sorokine A et al. (2012) Tackling BigData: strategies for parallelizing and porting geoprocessingalgorithms to high-performance computational environments. GIScience

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,adaptation, distribution and reproduction in any medium or format, as long as you give appropriatecredit to the original author(s) and the source, provide a link to the Creative Commons license andindicate if changes were made.

The images or other third party material in this chapter are included in the chapter’s CreativeCommons license, unless indicated otherwise in a credit line to the material. If material is notincluded in the chapter’s Creative Commons license and your intended use is not permitted bystatutory regulation or exceeds the permitted use, you will need to obtain permission directly fromthe copyright holder.

https://www.openstreetmap.org

http://creativecommons.org/licenses/by/4.0/

Cyber-Infrastructure for Data-Intensive Geospatial Computing · (CI) platform that can bridge the...

Documents

Transcript of Cyber-Infrastructure for Data-Intensive Geospatial Computing · (CI) platform that can bridge the...