Extending Climate Analytics-as-a-Service to the …...datasets using advanced technologies like...

1
Powerful computing resources are necessary for on-demand analytic processing across 34+ years of reanalysis data. The MapReduce operations currently leverage a cluster consisting of 36 Dell R720 servers each populated with six 2-terabyte hard drives in addition to their operating system disks (CentOS 6.8). For interconnectivity, there is a 36-port 56 Gbps Ethernet switch plus a 48-port GigE switch. Initial cluster metrics show peaks of 314 gigaflops/server and an overall capability of approximately 11 teraflops. This testbed, which will ultimately replace the current cluster, will incorporate new technologies (e.g., alternate software stacks, parallel HDFS file systems). Extending Climate Analytics-as-a-Service to the Earth System Grid Federation Progress Report on the Reanalysis Ensemble Service ABSTRACT We are extending our Climate Analytics-as-a-Service (CAaaS) capabilities to include the following: (1) A Reanalysis Ensemble Service (RES) offering a basic set of commonly used operations over multiple reanalysis collections that are accessible through NASA's climate data analytics web services and our client-side Climate Data Services Python library, CDSlib; (2) a high-performance Virtual Real-Time Analytics Testbed supporting multiple major datasets using advanced technologies like Apache Spark and Hadoop-based MapReduce analytics over native NetCDF files; and (3) an Open Geospatial Consortium (OGC) WPS-compliant web service interface to CDSLib to accommodate the Earth System Grid Federation (ESGF) web service endpoints. Glenn S. Tamkin 1 , John L. Schnase 1 , Daniel Q. Duffy 2 , Jian Li 1 , John H. Thompson 2 , Savannah L. Strong 1 1 COMPUTATIONAL AND INFORMATION SCIENCES AND TECHNOLOGY OFFICE 2 NASA CENTER FOR CLIMATE SIMULATION NASA GODDARD SPACE FLIGHT CENTER For Additional Information [email protected] [email protected] [email protected] IN13A-1649 ESGF Support RES Jupyter Notebook New Features A Python library (CDSlib) that supports full temporal, spatial, and grid-based resolution services A new reanalysis collections reference model to enable operator design and implementation An enhanced library of sample queries to demonstrate and develop use case scenarios Extended operators that enable single- and multiple- reanalysis area average, vertical average, re-gridding, and trend, climatology, and anomaly computations Full support for the MERRA-2 reanalysis and the initial integration of additional reanalyses (CFSR, ECMWF, 20CR, JRS-55), A Jupyter notebook-based distribution mechanism that combines CDSlib documentation with interactive use case scenarios and personalized project management Prototyped uncertainty quantification services that combine ensemble products with comparative observational products Convenient, one-stop shopping for commonly used data products from multiple reanalyses, including basic subsetting and arithmetic operations over the data and extractions of trends, climatologies, and anomalies Uncertainty analysis pervades all levels of climate impact assessments and climate data analysis. Preparing data for uncertainty quantification is a particularly time-consuming step. The Reanalysis Ensemble Service (RES) is an effort to make data quickly and easily ready for assessment. RES provides an Uncertainty Quantification Package (UQP) that contains the requested RES products along with context-sensitive peer products that can be used to characterize uncertainty. The following example demonstrates a possible use of a UQP for evaluating global precipitation among multiple reanalysis datasets. The workflow consists of the following tasks: 1) Calculate global average precipitation from five reanalysis collections (MERRA2, CFSR, JRA-55, NOAA 20CRv2c, and ERA-Interim). 2) Calculate global average precipitation from two observed climate records (CMAP and GPCP). 3) Calculate global ensemble average for the reanalysis collections. 3) Generate line plots to show long-term variability of global precipitation in each dataset. 4) Generate a Taylor diagram to summarize the similarity of global precipitation pattern between CMAP and other datasets. Example RES Workflow: Uncertainty Quantification Uncertainty Quantification Package Comparison of global mean precipitation values across multiple reanalysis (MERRA2, CFSR, JRA-55, NOAA 20CRv2c, ERA-Interim) and observational (CMAP, GPCP) datasets using CMAP as the reference correlation RES Client Distribution Virtual Real-Time Analytics Testbed Node Manager Resource Manager /merra 5TB /hadoop_fs 1TB /mapred 1TB Data Node 1 /hadoop_fs 16TB /mapred 16TB /hadoop_fs 16TB /mapred 16TB /hadoop_fs 16TB /mapred 16TB Head Nodes Data Nodes FDR IB Data Node 2 Data Node 34 RES Data 180TB Raw LAN The CDSlib Python library makes service-oriented client development easier. CDSlib simplifies access to RES’s underlying RESTful web service, which is based on the International Standards Organization (ISO) Open Archival Information System (OAIS) reference model. The library is provided to users via a Jupyter notebook distribution that includes several example applications ranging from simple web service calls to complex workflows. The Uncertainty Quantification workflow will be included in the latest distribution. See https://cds.nccs.nasa.gov/wp-content/test. We are extending our web service to support the Web Processing Service (WPS) API. The ESGF has adopted OGC’s WPS interface standard for its next-generation architecture. Adding WPS capabilities to our OAIS-based web service will facilitate integration of NASA’s capabilities into the ESGF by increasing machine-to-machine interoperability. Enabling the API to consume WPS web service endpoints will facilitate client software and workflow development. Greater system-to-system interoperability improves connectivity and, in the case of WPS, allows the ESGF community to avail itself of WPS-compliant capabilities that exist within the geospatial community. Having an API makes it easier to create toolkits, workbenches, workflows, and plug-ins tailored to the ESGF that can improve efficiencies and communication within the community. Web Service Interface Adapter Module REST Module Network Client Application Python Library Interactions REST Interface Interactions WPS API Extended Utilities Basic Utilities REST Interface Interactions CDS API Extended Utilities Basic Utilities REST Interface Interactions . . . Data Service (retrieval, sub-setting,…) Aggregation Service (packaging, UQP,…) Analytic Service (workflow,…) Reanalysis Ensemble Service - Notional Architecture Internal Data Service (Reanalysis) MERRA2 JR55 20CR ECMWF CFSR Reanalysis Ensemble Service External Data Service (Observationa l) ... MODIS CMAP LANDSAT GPCP Wrangler ... ESGF EOS NCAR NOAA Other External Data Sources High-Performance Data Analytic Platform (real-time, batch…) Regridding Service (resolution alignment, …) Formatting Service (NetCDF->Geotiff, …) Transformation Service (avg, min, max, sum, var,…) Statistics Service (mean, standard deviation….) Persistence Service (cached results, …)

Transcript of Extending Climate Analytics-as-a-Service to the …...datasets using advanced technologies like...

Page 1: Extending Climate Analytics-as-a-Service to the …...datasets using advanced technologies like Apache Spark and Hadoop-based MapReduce analytics over native NetCDF files; and (3)

Powerful computing resources are necessary for on-demand analytic processing across 34+ years of reanalysis data.

• The MapReduce operations currently leverage a cluster consisting of 36 Dell R720 servers each populated with six 2-terabyte hard drives in addition to their operating system disks (CentOS 6.8).

• For interconnectivity, there is a 36-port 56 Gbps Ethernet switch plus a 48-port GigE switch. Initial cluster metrics show peaks of 314 gigaflops/server and an overall capability of approximately 11 teraflops.

• This testbed, which will ultimately replace the current cluster, will incorporate new technologies (e.g., alternate software stacks, parallel HDFS file systems).

Extending Climate Analytics-as-a-Service to the Earth System Grid FederationProgress Report on the Reanalysis Ensemble Service

ABSTRACTWe are extending our Climate Analytics-as-a-Service (CAaaS) capabilities to include the following: (1) A Reanalysis Ensemble Service (RES) offering abasic set of commonly used operations over multiple reanalysis collections that are accessible through NASA's climate data analytics web services andour client-side Climate Data Services Python library, CDSlib; (2) a high-performance Virtual Real-Time Analytics Testbed supporting multiple majordatasets using advanced technologies like Apache Spark and Hadoop-based MapReduce analytics over native NetCDF files; and (3) an Open GeospatialConsortium (OGC) WPS-compliant web service interface to CDSLib to accommodate the Earth System Grid Federation (ESGF) web service endpoints.

Glenn S. Tamkin1, John L. Schnase1, Daniel Q. Duffy2, Jian Li1, John H. Thompson2, Savannah L. Strong1

1COMPUTATIONAL AND INFORMATION SCIENCES AND TECHNOLOGY OFFICE2 NASA CENTER FOR CLIMATE SIMULATION

NASA GODDARD SPACE FLIGHT CENTER

For Additional [email protected]@[email protected]

IN13A-1649

ESGF Support

RES Jupyter NotebookNew Features

• A Python library (CDSlib) that supports full temporal, spatial, and grid-based resolution services

• A new reanalysis collections reference model to enable operator design and implementation

• An enhanced library of sample queries to demonstrate and develop use case scenarios

• Extended operators that enable single- and multiple-reanalysis area average, vertical average, re-gridding, and trend, climatology, and anomaly computations

• Full support for the MERRA-2 reanalysis and the initial integration of additional reanalyses (CFSR, ECMWF, 20CR, JRS-55),

• A Jupyter notebook-based distribution mechanism that combines CDSlib documentation with interactive use case scenarios and personalized project management

• Prototyped uncertainty quantification services that combine ensemble products with comparative observational products

• Convenient, one-stop shopping for commonly used data products from multiple reanalyses, including basic subsetting and arithmetic operations over the data and extractions of trends, climatologies, and anomalies

Uncertainty analysis pervades all levels of climate impact assessments and climate data analysis. Preparing data for uncertainty quantification is a particularly time-consuming step. The Reanalysis Ensemble Service (RES) is an effort to make data quickly and easily ready for assessment. RES provides an Uncertainty Quantification Package (UQP) thatcontains the requested RES products along with context-sensitive peer products that can be used to characterize uncertainty. The following example demonstrates a possible use of a UQP for evaluating global precipitation among multiple reanalysis datasets. The workflow consists of the following tasks:1) Calculate global average precipitation from five reanalysis collections (MERRA2, CFSR, JRA-55, NOAA 20CRv2c, and ERA-Interim).2) Calculate global average precipitation from two observed climate records (CMAP and GPCP).3) Calculate global ensemble average for the reanalysis collections.3) Generate line plots to show long-term variability of global precipitation in each dataset.4) Generate a Taylor diagram to summarize the similarity of global precipitation pattern between CMAP and other datasets.

Example RES Workflow: Uncertainty Quantification

Uncertainty Quantification PackageComparison of global mean precipitation values across multiple reanalysis (MERRA2, CFSR, JRA-55, NOAA 20CRv2c, ERA-Interim) and

observational (CMAP, GPCP) datasets using CMAP as the reference correlation

RES Client Distribution

Virtual Real-Time Analytics Testbed

NodeManager

ResourceManager

/merra5TB

/hadoop_fs1TB

/mapred1TB

DataNode1

/hadoop_fs16TB

/mapred16TB

DataNode2

/hadoop_fs16TB

/mapred16TB

DataNode8

/hadoop_fs16TB

/mapred16TB

Head Nodes

Data Nodes

FDRIB

DataNode2

DataNode34

RESData

180TB Raw

LAN

The CDSlib Python library makes service-oriented client development easier. CDSlib simplifies access to RES’s underlying RESTful web service, which is based on the International Standards Organization (ISO) Open Archival Information System (OAIS) reference model.

The library is provided to users via a Jupyter notebook distribution that includes several example applications ranging from simple web service calls to complex workflows.

The Uncertainty Quantification workflow will be included in the latest distribution. See https://cds.nccs.nasa.gov/wp-content/test.

We are extending our web service to support the Web Processing Service (WPS) API. The ESGF has adopted OGC’s WPS interface standard for its next-generation architecture. Adding WPS capabilities to our OAIS-based web service will facilitate integration of NASA’s capabilities into the ESGF by increasing machine-to-machine interoperability. Enabling the API to consume WPS web service endpoints will facilitate client software and workflow development.

Greater system-to-system interoperability improves connectivity and, in the case of WPS, allows the ESGF community to avail itself of WPS-compliant capabilities that exist within the geospatial community.

Having an API makes it easier to create toolkits, workbenches, workflows, and plug-ins tailored to the ESGF that can improve efficiencies and communication within the community.

Web Service

Interface

Ada

pter

Mod

ule

RES

T M

odul

e

Network

ClientApplication

PythonLibrary

Interactions

RESTInterfaceInteractions

WPS API

ExtendedUtilities

BasicUtilities

RESTInterfaceInteractions

CDS API

ExtendedUtilities

BasicUtilities

RESTInterfaceInteractions

. . .

Data Service(retrieval, sub-setting,…)

Aggregation Service(packaging, UQP,…)

Analytic Service (workflow,…)

Reanalysis Ensemble Service - Notional Architecture

Internal DataService

(Reanalysis)

MERRA2

JR55

20CR

ECMWF

CFSR

Reanalysis Ensemble Service

External DataService

(Observational)

...

MODIS

CMAP

LANDSAT

GPCP

Wrangler

...

ESGF

EOS

NCAR

NOAA

Other

External Data Sources

High-PerformanceData Analytic Platform

(real-time, batch…)

Regridding Service(resolution alignment, …)

Formatting Service(NetCDF->Geotiff, …)

Transformation Service(avg, min, max, sum,

var,…)Statistics Service

(mean, standard deviation….)

Persistence Service(cached results, …)