OpenTopography - Scalable Services for Geosciences Data

1
OpenTopography - Scalable Services for Geosciences Data www.opentopography.org Canopy Height (ft) @opentopography [email protected] DOI / OGC CSW DATA USAGE ANALYTICS HPC & CLOUD INTEGRATION CYBERINFRASTRUCTURE Spatiotemporal variations in data access illustrate that certain regions of a dataset can be "cold", while others are "hot". OT collects analytics which include user data selections through time. We have developed tools that allow us to mine and visualize this information, and are exploring how to utilize these analytics to develop storage optimizations based on data value and cost. For the hottest data, fast (I/O) and scaleable access are required. In these cases, data stored on SSD and accessible through HPC systems such as Gordon are desirable. For "cooler" data which sees more infrequent access, cheaper (and slower) storage systems such as the cloud can be used to lower data facility operating costs. A tiered storage system offers the potential to dynamically manage data storage and associated system performance based on real analytical information about usage. In the case of topographic data, events such as earthquakes, floods, landslides, and other geophysical events are likely to cause an increase in demand for data that intersect the spatial extent of the event. External feeds (e.g., USGS NEIC) could be monitored to proactively move data into high performance storage in anticipation of increased demand. Activity based data ranking and tiered cloud & HPC integrated storage 1. On-demand job execution on Gordon (XSEDE HPC Resource) OT received a Microsoft Azure for Research Award (allocated $40k in Azure Resources) to explore integration of cloud resources into our existing infrastructure. A prototype OT image on Azure VM depot allows us (or others) to quickly deploy the OT software stack on an appropriately sized resource. Data can be pulled from OT’s storage on the SDSC Cloud for processing in Azure. USE CASE: TauDEM hydrologic analysis of DEMs TauDEM is an open source hydrologic analysis toolkit developed by David Tarboton (USU). As part of OT’s CyberGIS collaboration, we implemented TauDEM (MPI) on Gordon. We dynamically scale the number of cores allocated to the job, as a function of the size of the input DEM. 2. Integration of cloud based on-demand geospatial processing services OT has a dedicated Gordon I/O Node XSEDE allocation with 48 GB Memory/4.8TB Flash memory + 16 Compute nodes (256 cores) with 64GB memory + QDR InfiniBand Interconnect. Performance tests using a DEM generation use case showed 20x job speed-ups when four concurrent jobs are executed on Gordon vs OT's standard compute cluster. Test case: 208 million LIDAR returns gridded to 20cm grid. http://www.engineering.usu.edu/dtarb/ The OpenTopography cyberinfrastructure employs a multi-tier service-oriented architecture (SOA) that is highly scalable, permitting upgrades to the infrastructure tier and corresponding algorithms without the need to update the APIs and clients. The SOA has enabled the integration of compute intensive algorithms, like the TauDEM hydrology suite running on the Gordon XSEDE resource, as a service made available to the OpenTopography user community. The pluggable services architecture allows researchers to integrate their algorithms into the OpenTopography processing workflow. OpenTopography also interoperates with other CI systems like the NSF-funded CyberGIS viewshed analysis application, NASA SSARA, etc. OpenTopography implements a catalog services for the web (CSW), using the ISO 19115 metadata standard that can be federated with other environments, e.g., NSF Earthcube, Thomson Reuters Web of Science, etc. All datasets served via OpenTopography are assigned a DOI that not only provides a persistent identifier for the dataset. Cover image of Science featured a 0.25 m digital elevation model (DEM) and hillshade of offset channels along the San Andreas Fault in the Carrizo Plain produced by OpenTopography. The OpenTopography facility was funded by the National Science Foundation (NSF) in 2009 to provide efficient online access to Earth science-oriented high-resolution lidar topography data, online processing tools, and derivative products. Currently, OpenTopography serves 183 high resolution LIDAR (Light Detection and Ranging) point cloud datasets with over 820 billion returns covering approximately 179,153 sq. km. of important geologic features such as the San Andreas Fault, Yellowstone, Tetons, Yosemite National Parks, etc., to a growing user community. Information collected from over 42,250 custom point cloud jobs that have processed upwards of 1.4 trillion LIDAR returns, and over 19,800 custom raster data jobs, is being analyzed to prioritize future development based on usage insights as well as identifying novel approaches to managing the exponential growth in data. Collaboration Opportunities Analysis of user behavior and data usage for optimizing data location in deep storage/memory hierarchies Pluggable services framework - Tracking software provenance / framework security New data types - Full waveform LIDAR, Hyperspectral Imagery data New processing algorithms - change detection, difference analysis and time series analysis. Algorithm optimizations/parallelization | | |

Transcript of OpenTopography - Scalable Services for Geosciences Data

Page 1: OpenTopography - Scalable Services for Geosciences Data

OpenTopography - Scalable Services for Geosciences Datawww.opentopography.org

Canopy Height (ft)

@opentopography

[email protected]

DOI / OGC CSW

DATA USAGE ANALYTICS

HPC & CLOUD INTEGRATION

CYBERINFRASTRUCTURE

Spatiotemporal variations in data access illustrate that certain regions of a dataset can be "cold", while others are "hot". OT collects analytics which include user data selections through time. We have developed tools that allow us to mine and visualize this information, and are exploring how to utilize these analytics to develop storage optimizations based on data value and cost.

For the hottest data, fast (I/O) and scaleable access are required. In these cases, data stored on SSD and accessible through HPC systems such as Gordon are desirable. For "cooler" data which sees more infrequent access, cheaper (and slower) storage systems such as the cloud can be used to lower data facility operating costs. A tiered storage system offers the potential to dynamically manage data storage and associated system performance based on real analytical information about usage.

In the case of topographic data, events such as earthquakes, floods, landslides, and other geophysical events are likely to cause an increase in demand for data that intersect the spatial extent of the event. External feeds (e.g., USGS NEIC) could be monitored to proactively move data into high performance storage in anticipation of increased demand.

Act iv i ty based data ranking and tiered cloud & HPC integrated storage

1. On-demand job execution on Gordon (XSEDE HPC Resource)

OT received a Microsoft Azure for Research Award (allocated $40k in Azure Resources) to explore integration of cloud resources into our existing infrastructure.

A prototype OT image on Azure VM depot allows us (or others) to quickly deploy the OT software stack on an appropriately sized resource.

Data can be pulled from OT’s storage on the SDSC Cloud for processing in Azure.

USE CASE: TauDEM hydrologic analysis of DEMsTauDEM is an open source hydrologic analysis toolkit developed by David Tarboton (USU).

As part of OT’s CyberGIS collaboration, we implemented TauDEM (MPI) on Gordon. We dynamically scale the number of cores allocated to the job, as a function of the size of the input DEM.

2. Integration of cloud based on-demand geospatial processing services

OT has a dedicated Gordon I/O Node XSEDE allocation with 48 GB Memory/4.8TB Flash memory + 16 Compute nodes (256 cores) with 64GB memory + QDR InfiniBand Interconnect.

Performance tests using a DEM generation use case showed 20x job speed-ups when four concurrent jobs are executed on Gordon vs OT's standard compute cluster.

Test case: 208 million LIDAR returns gridded to 20cm grid.

http://www.engineering.usu.edu/dtarb/

The OpenTopography cyberinfrastructure employs a multi-tier service-oriented architecture (SOA) that is highly scalable, permitting upgrades to the infrastructure tier and corresponding algorithms without the need to update the APIs and clients. The SOA has enabled the integration of compute intensive algorithms, like the TauDEM hydrology suite running on the Gordon XSEDE resource, as a service made available to the OpenTopography user community. The pluggable services architecture allows researchers to integrate their algorithms into the OpenTopography processing workflow. OpenTopography also interoperates with other CI systems like the NSF-funded CyberGIS viewshed analysis application, NASA SSARA, etc.

OpenTopography implements a catalog services for the web (CSW), using the ISO 19115 metadata standard that can be federated with other environments, e.g., NSF Earthcube, Thomson Reuters Web of Science, etc. All datasets served via OpenTopography are assigned a DOI that not only provides a persistent identifier for the dataset.

Cover image of Science featured a 0.25 m digital elevation model (DEM) and hillshade of offset channels along the San Andreas Fault in the Carrizo Plain produced by OpenTopography.

The OpenTopography facility was funded by the National Science Foundation (NSF) in 2009 to provide efficient online access to Earth science-oriented high-resolution lidar topography data, online processing tools, and derivative products. Currently, OpenTopography serves 183 high resolution LIDAR (Light Detection and Ranging) point cloud datasets with over 820 billion returns covering approximately 179,153 sq. km. of important geologic features such as the San Andreas Fault, Yellowstone, Tetons, Yosemite National Parks, etc., to a growing user community. Information collected from over 42,250 custom point cloud jobs that have processed upwards of 1.4 trillion LIDAR returns, and over 19,800 custom raster data jobs, is being analyzed to prioritize future development based on usage insights as well as identifying novel approaches to managing the exponential growth in data.

Collaboration Opportunities

Analysis of user behavior and data usage for optimizing data location in deep storage/memory hierarchies

Pluggable services framework - Tracking software provenance / framework security

New data types - Full waveform LIDAR, Hyperspectral Imagery data

New processing algorithms - change detection, difference analysis and time series analysis. Algorithm optimizations/parallelization| | |