Overview of Different Data Formats that a Climate Service...
Transcript of Overview of Different Data Formats that a Climate Service...
Ananda Kumar Das
NWP Division
Overview of Different Data Formats
that a
Climate Service can Use and Manage
Organization
Introduction – What is data ?
Data types in climate services
Data formats for different types of data
Brief description of a few different formats
NetCDF
HDF
GRIB
BUFR
Climate data formats in India
Summary
1. Facts and statistics collected together for reference or analysis. // The quantities, characters, or symbols on which operations are performed by a computer, which may be stored and transmitted in the form of electrical signals and recorded on magnetic, optical, or mechanical recording media. ~ Oxford dictionary
2. Information, especially facts or numbers, collected to be examined and considered and used to help decision-making, or information in an electronic form that can be stored and used by a computer. ~ Cambridge English Dictionary
What is Data ?
The first English use of the word "data" is from the 1640s. Using the word "data" to mean "transmittable and storable computer information" was first done in 1946. The expression "data processing" was first used in 1954.
https://www.etymonline.com/word/data
What is Data ?
?
• Describe the data?
• Read it? Store it? Find it? Share it? Mine it?
• Move it into, out of, and between computers and repositories?
• Achieve storage and I/O efficiency?
• Give applications and tools easy access our data?
What is Data ?
Point observations Station data Rawinsonde, a line of observations Aircraft data, a line of observations “Raw” satellite data, a swath of observations
Gridded fields, 2D or 3D fields Numerical model Satellite product Radar product
Data Types
Data from Observing Systems
Ground-based observations e.g. SYNOP, SHIP, BUOY, IVOF, METAR/SPECI AWS/ARG and Met. Towers
Ocean observations e.g. drifting and moored buoys, ARGO floats, VOS/VOSClim, GLOSS, XBT and etc.
Upper-air Observations e.g. TEMP, PILOT, AIREP, ACARS and Profilers
Remotely sensed Satellite and Radar observations
Groud-based radiometric obs., LIDAR, SODAR, Lightning detector and etc.
Air quality e.g. observations of Ozone, CO2, Other gases and PMs
Data from Modeling Systems
Global / regional reanalysis
Global / regional Forecast Systems (Coupled or Atmospheric)
Ocean Analysis/reanalysis
Ocean Global/regional Modeling system
===== Four dimensional data
Data Types
Various data formats Character format:ASCII/EBDIC , TAC format
Regular text and tabular messages, coded messages and etc. - easily apprehended but mostly not suitable for archival; used locally
Flat/Packet Binary
Description and system dependent; not suitable for transmission and used locally
WMO Formats
GRIB (GRId in Binary) 1& 2 - WMO
BUFR (Binary Universal Format Representation) - WMO
Other Formats
NetCDF (Network CDF - Common Data Format) –UCAR, USA
HDF (Hierarchical Data Format) – National Center for Computer Application, USA
XML (Extensible Markup Language) - World Wide Web Consortium (W3C)
McIDAS (Man computer Interactive Data Access System) - University of Wisconsin–Madison, USA
What is NetCDF?
Network Common Data Form (NetCDF)
An interface for (array-oriented) data access
A collection of libraries of data-access routines (for Fortran, C++, etc.)
A machine-independent format for scientific data
Developed at Unidata UCAR
More information at:
www.unidata.ucar.edu/software/netcdf/
Positive Attributes of NetCDF
Self-describing – includes information about the data it contains
Portable - A machine-independent binary data format
Direct-access – can access efficiently a subset of the dataset
NetCDF Classic Model
Enhanced Model
NetCDF Enhanced Model
Variable
Attribute Attribute
Variable
Attribute Attribute
Attribute Attribute
Variable
Attribute Attribute
Variable
Attribute Attribute
Attribute Attribute
A netCDF-4 file can organize variable, dimensions, and
attributes in groups, which can be nested.
Variable
Attribute Attribute
Attribute Attribute
An Example netCDF File
NetCDF APIs
The netCDF core library is written in C and Java.
Fortran 77 is “faked” when netCDF is built – actually C functions are called by Fortran 77 API.
A C++ API also calls the C API, a new C++ API us under development to support netCDF-4 more fully.
C API
nc_create(FILE_NAME, NC_CLOBBER, &ncid);
nc_def_dim(ncid, "x", NX, &x_dimid);
nc_def_dim(ncid, "y", NY, &y_dimid);
dimids[0] = x_dimid;
dimids[1] = y_dimid;
nc_def_var(ncid, "data", NC_INT, NDIMS,
dimids, &varid);
nc_enddef(ncid);
nc_put_var_int(ncid, varid, &data_out[0][0]);
nc_close(ncid);
netCDF Library & Programming Model
Modes: definition mode, data mode
IDs: dataset ID, dimension ID, variable ID, attribute number
Create & Write Read by name Read sequencially Add dim,var,att
Fortran API call check( nf90_create(FILE_NAME, NF90_CLOBBER, ncid) ) call check( nf90_def_dim(ncid, "x", NX, x_dimid) ) call check( nf90_def_dim(ncid, "y", NY, y_dimid) ) dimids = (/ y_dimid, x_dimid /) call check( nf90_def_var(ncid, "data", NF90_INT, dimids, varid) ) call check( nf90_enddef(ncid) ) call check( nf90_put_var(ncid, varid, data_out) ) call check( nf90_close(ncid) )
New C++ API (cxx4)
Existing C++ API works with netCDF-4 classic model files. The existing API was written before many features of C++ became standard, and thus needed updating. A new C++ API has been partially developed . You can build the new API (which is not complete!) with --enable-cxx4.
Java API
dataFile = NetcdfFileWriteable.createNew(filename, false);
// Create netCDF dimensions,
Dimension xDim = dataFile.addDimension("x", NX );
Dimension yDim = dataFile.addDimension("y", NY );
ArrayList dims = new ArrayList();
// define dimensions
dims.add( xDim);
dims.add( yDim);
...
Tools
ncdump – ASCII or NcML dump of data file.
ncgen – Take ASCII or NcML and create data file.
nccopy – Copy a file, changing format, compression, chunking, etc.
Conventions The NetCDF User's Guide recommends some conventions (ex. "units" and "Conventions" attributes).
Conventions are published agreements about how data of a particular type should be represented to foster interoperability.
Most conventions use attributes.
Use of an existing convention is highly recommended.
A netCDF file should use the global "Conventions" attribute to identify which conventions it uses.
Climate and Forecast Conventions
The CF Conventions are becoming a widely used standard for atmospheric, ocean, and climate data.
The NetCDF Climate and Forecast (CF) Metadata Conventions, Version 1.3 onwards, describes consensus representations for climate and forecast data using the netCDF-3 data model.
LibCF
The NetCDF CF Library supports the creation of scientific data files conforming to the CF conventions, using the netCDF API.
Distributed with netCDF.
GRIDSPEC: A standard for the description of grids used in Earth System models, developed by V. Balaji, GFDL, proposed as a Climate and Forecast (CF) convention.
UDUNITS
The Unidata units library, udunits, supports conversion of unit specifications between formatted and binary forms, arithmetic manipulation of unit specifications, and conversion of values between compatible scales of measurement.
Now being distributed with netCDF.
HDF The Hierarchial Data Format is available in two versions: the original HDF4 and the more recent HDF5. Unfortunately, HDF4 and HDF5 interfaces and data models are completely incompatible. The HDF5 data model is more flexible and is a "a true hierarchical file structure, similar to the Unix file system." HDF5 does have some new features that are appealing to climate research, such as, parallel I/O and variable compression.
HDF5-EOS5: HDF5-Earth Obseving System - defines three additional data types based on HDF objects: grid, point, and swath.
These data types allow the file contents to be referenced to Earth coordinates, such as latitude and longitude, and to time.
For Details: https://support.hdfgroup.org/HDF5/
Primary Objects
Groups
Datasets
Additional ways to organize and annotate data
Attributes
Storage and access properties
HDF Data Model
Data Metadata Dataspace
3
Rank
Dim_2 = 5
Dim_1 = 4
Dimensions
Time = 32.4
Pressure = 987
Temp = 56
Attributes
Chunked
Compressed
Dim_3 = 7
Storage Info
Integer
Datatype
HDF Dataset
10/15/08 HDF & HDF-EOS Workshop XII 29
Useful Tools For New Users
h5dump:
Tool to “dump” or display contents of HDF5 files
h5cc, h5c++, h5fc:
Scripts to compile applications
HDFView:
Java browser to view HDF4 and HDF5 files
10/15/08 HDF & HDF-EOS Workshop XII 30
H5dump Command-line Utility To View HDF5 File
h5dump [--header] [-a ] [-d <names>] [-g <names>]
[-l <names>] [-t <names>] [-p] <file>
--header Display header only; no data is displayed.
-a <names> Display the specified attribute(s).
-d <names> Display the specified dataset(s).
-g <names> Display the specified group(s) and all the members.
-l <names> Displays the value(s) of the specified soft link(s).
-t <names> Display the specified named datatype(s).
-p Display properties.
<names> is one or more appropriate object names.
10/15/08 HDF & HDF-EOS Workshop XII
HDF5 "dset.h5" {
GROUP "/" {
DATASET "dset" {
DATATYPE { H5T_STD_I32BE }
DATASPACE { SIMPLE ( 4, 6 ) / ( 4, 6 ) }
DATA {
1, 2, 3, 4, 5, 6,
7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18,
19, 20, 21, 22, 23, 24
}
}
}
}
“/”
Example of h5dump Output
‘dset’
WMO Migration: TAC (Traditional Alphanumeric Codes)
to TDCF (Table Driven Code Form)
Need of
• SELF DESCRIPTION
• FLEXIBILITY
• EXPANDABILITY
• SUSTAINABILITY
• COMPRESSION (PACKING) FOR BUFR
• EASY READABILITY FOR CREX
• DATA OF BETTER QUALITY LEADING TO BETTER PRODUCTS
Blocking • Current (but old) Traditional
Alphanumeric Codes (TAC) prevent the exchange of critical environmental information to NMHSs and their customers
• There is an unnecessarily inefficient costly use of resources for information exchange
• They will increasingly constrain NMHS capabilities to exchange more accurate and timely information
• They will become increasingly more costly to sustain expansion and growing scientific needs.
BUFR, and CREX (Character form for the Representation and Exchange) – for points GRIB - for grids
WMO specifies the format Many implementations Table based Tables request much metadata Want more metadata? Request new table! Bureaucratic: UN (WMO) + governmental agencies
Pros Transmission format Good packing/compression Many messages, not one big chuck of data
WMO Formats
http://www.wmo.int/pages/prog/www/WMOCodes.html
GRIB is based on messages, and each message is complete. … (1000 hPa TMP)(900 hPa TMP)(800 hPa TMP)(700 hPA TMP) … (problem) … (1000 hPa TMP)????????????(800 hPa TMP)(700 hPA TMP) … Suppose there was a problem while the 900 hPa TMP was being sent. Since each grib message is complete, you have only lost the 900 hPa TMP. Note: grib2 allows messages to be combined to make larger messages. This allows some space saving because some metadata like the grid description are duplicated.
Stream of data
GRIB
Section Name Length Notes
0 Indicator Section (IS) 16 bytes "GRIB" initially
1 Identification Section 21 bytes
2 * Local Use Optional section - may contain anything
3 ** Grid definition Section (GDS) 52 bytes?
4 *** Product Definition Section (PDS) 14-50 bytes?
5 *** Data representation typically 21 bytes
Dependent on compression
6 *** Bit-map Section 6 bytes + bitmap Optional section
7 *** Binary Data Section 5 bytes + data
8 End 4 bytes "7777"
*, **, *** - can be nested
• Developed by WMO for the exchange of gridded data
• Allowing detailed description of a huge variety of grids, parameters, processes, represented by codes that reference external tables
• Portable, implemented as octets (groups of 8 bits):
GRIB
GRIB2: packing/compression Models are getting bigger faster than the bandwidth increases. Compression helps but still cannot send the full resolution forecasts to the users.
GFS 0.25 degree single forecast hour, all fields in MB. Simple = scaled integers (also used in grb1)
Raw (IEEE)
Simple
Complex 1
AEC
Complex 2
Complex 3 with bitmap
Complex 3
Best Complex
JPEG2000
0 500 1000 1500 2000 2500 3000 3500
Jpeg2000: very good compression but very slow Poor when data has undefined values Widely used at NCEP Complex: encode data, deltas or deltas of deltas, 6 flavors of complex Good compression but not as good jpeg2000 Best = +9% (size), c3 = +12% c3.bitmap = +16% Better than jpeg2000 when undefined values Very fast decoding, 20x jpeg2000 Increasing use at NCEP (speed vs size) Png: compression not as good as jpeg2000 RLE: run length encoding, Japanese radar products Aec: open source szip, new, fast Simple: big files, very fast to pack/unpack Packing should be transparent to the user.
GRIB2: packing/compression
GRIB1 GrADS: visualization Grib2ctl: control file make for GrADS Wgrib: inventory, decode and basic data base functions Copygb: interpolation Grb1to2: grib1 → grib2 converter Some notable open source codes CDO GDAL NCL
GRIB2 GrADS: visualization G2ctl: control file make for GrADS Wgrib2: inventory, decode and more Some notable open source codes CDO degrib GDAL NCL RNomads
GRIB SOFTWARE
CF-compliant NetCDF (CF-NetCDF)
• NetCDF interface developed by Unidata to facilitate the access and sharing of array-orientated data in a form that was self-describing and portable (the format being machine-independent)
• Highly flexible data management solution for multi-dimensional gridded data
• Very widely used in atmospheric and oceanographic sciences community
• CF (Climate-Forecast) metadata convention offers, arguably, the best metadata standard available within this community
GRIB2
• Developed as WMO standard to provide an efficient, machine-independent format for the exchange of gridded data by National Met Services
• No standard interface, although several have been developed
• Metadata is code-based, needing to cross-reference external tables, with a highly specified metadata ‘vocabulary’ and layout and no indexing of the data
NetCDF vs GRIB
C O N T I N U O U S B I N A R Y S T R E A M
Section 0
Section 1
Section 2
Section 3
Section 4
Section 5
Section Name Contents
number
0 indicator section
"BUFR" (coded according to the CCITT International Alphabet No. 5, which is functionally equivalent to ASCII), length of message, BUFR edition number
1 identification length of section, identification of the section message
2 optional section
length of section and any additional items for local use by data processing centers
3 data description
length of section, number of data section subsets, data category flag, data compression flag, and a collection of data descriptors which define the form and content of individual data elements
4 data section length of section and binary data
5 end section "7777" (coded in CCITT International Alphabet No. 5)
BUFR
BUFR TOOL bufr_compare Compare BUFR messages contained in two files. If some differences are found it
fails returning an error code. Floating point values are compared exactly by default, different tolerance can be defined see -P -A -R. Default behaviour: absolute error=0, bit-by-bit compare, same order in files.
bufr_copy Copies the content of BUFR files printing values of some keys.
bufr_count Print the total number of BUFR messages in the given files.
bufr_dump Dump the content of a BUFR file in different formats.
bufr_filter Apply the rules defined in rules_file to each BUFR message in the BUFR files provided as arguments. If you specify '-' (a single dash) for the rules_file, the rules will be read from standard input.
bufr_get Get values of some header keys from a BUFR file. It is similar to bufr_ls, but fails returning an error code when an error occurs (e.g. key not found).
bufr_ls List content of BUFR files printing values of some header keys. Only scalar keys can be printed. It does not fail when a key is not found.
bufr_set Sets key/value pairs in the input BUFR file and writes each message to the output_bufr_file. It fails when an error occurs (e.g. key not found).
BUFRtool: Download from http://www.northern-lighthouse.com/
Climate Data Formats in India
Data formats
ASCII
Flat Binary
GRIB1/2
BUFR
NetCDF
HDF
Package Tools
GrADS/Ferret
R/NCL/MATLAB/IDL
NCO/NCVIEW
HDF-Viewer
PENOPLY
Many More