Implementing a Data Quality Strategy to simplify access to ...€¦ · 11/9/2016 5 nci.org.au...
Transcript of Implementing a Data Quality Strategy to simplify access to ...€¦ · 11/9/2016 5 nci.org.au...
11/9/2016
1
Implementing a Data Quality Strategy to simplify access to data
Kelsey Druken
Implementing a Data Quality Strategy to simplify access to data
Kelsey Druken, Claire Trenham, Lesley Wyborn, Ben Evans
National Computational Infrastructure, CanberraeResearch 2016
11/9/2016
2
nci.org.au
• The diverse data collections areco-located with a Petascale HPC and Cloud facility with a: • Top 50 Supercomputer (1.2Pflops)
• HPC Cloud (3000 node)
• Digital Laboratories
• Dynamic subsets are actively encouraged, and can be accessed via data services
• Processing times have decreased dramatically: new large data sets can be generated or analysed in minutes or hours instead of months
National Computational Infrastructure
nci.org.au
National Computational Infrastructure
• NCI hosts one of Australia’s largest repositories (10+ PBytes) of research data collections
• Spanning datasets from climate, coasts, oceans and geophysics through to astronomy, bioinformatics and the social sciences
11/9/2016
3
nci.org.au
National Computational Infrastructure
• NCI hosts one of Australia’s largest repositories (10+ PBytes) of research data collections
• Spanning datasets from climate, coasts, oceans and geophysics through to astronomy, bioinformatics and the social sciences
nci.org.au
National Computational Infrastructure
1. Climate/ESS Model Assets and Data Products
2. Earth and Marine Observations and Data Products
3. Geoscience Collections
4. Terrestrial Ecosystems Collections
5. Water Management and Hydrology Collections
Data Collections Approx. Capacity
CMIP5, CORDEX, ACCESS Models 5 Pbytes
Satellite Earth Obs: LANDSAT, Himawari-8, Sentinel, MODIS, INSAR 2 Pbytes
Digital Elevation, BathymetryOnshore/Offshore Geophysics
1 Pbytes
Seasonal Climate 700 Tbytes
Bureau of Meteorology Observations 350 Tbytes
Bureau of Meteorology Ocean-Marine 350 Tbytes
Terrestrial Ecosystem 290 Tbytes
Reanalysis products 100 Tbytes
11/9/2016
4
nci.org.au
• Application of community-agreed data standards to the broad set of Earth systems and environmental data that are being used
• Within these disciplines, data span a wide range of:- Gridded- Non-gridded (i.e., trajectories/profiles,
point data)- Coordinate reference projections- Resolutions
Key Challenges
nci.org.au
How data collections are accessed
Collections are being accessed and utilised from a broad range of options• Direct access on filesystem• Web and data services • Data portals• Virtual labs (e.g., virtual desktops)
eReefs online analysis portal
11/9/2016
5
nci.org.au
National Environmental Research Data Interoperability Platform (NERDIP)
HDF5
NetCDF-4
Climate
GDAL
API Layers
HP Data Library Layer
[SEG-Y][Airborne
Geophysics] [FITS] [LAS
LiDAR]
Data Conventions netCDF-CF
[HDF4-
EOS]
ISO 19115, ACDD, RIF-CS, DCAT, etc.
VGLAGDC
VL
Services Layer
Fast “whole-of-library”
catalogue
Lustre Other Storage (e.g., HDFS)
National Environmental Research Data Interoperability Platform (NERDIP)
Climate & Weather Science Lab
Biodiversity & Climate Change VL
OG
C
WFS
OG
C
SWE
OG
C
W*P
S
OG
CW
CS
OG
C
WM
S
OG
CW
*TS
RD
F, LD
VHIRLGlobe Claritas
Workflow Engines, Virtual Laboratories (VL’s), Science Gateways
AuScopePortal
TERNPortal
AODN/IMOSPortal
eMASTSpeddexes
All Sky Virtual Observatory
ANDS/RDAPortal
eReefs
ModelsFortran, C, C++, MPI, OpenMP
Python, R, MatLab, IDL
VisualisationDrishti
Ferret, NCO, GDL, GDAL, GRASS, QGIS
Digital Bathymetry & Elevation Portal
Data.gov.au
Open NavSurface
Tools Data Portals
Direct
Access
CS-W
NetCDF-4
Weather
NetCDF-4
Oceans
NetCDF-4
EONetCDF-4
Bathy
HDF5 ??
Vocab Service
PROV Service
Op
enD
AP
nci.org.au
National Environmental Research Data Interoperability Platform (NERDIP)
HDF5
NetCDF-4
Climate
GDAL
API Layers
HP Data Library Layer
[SEG-Y][Airborne
Geophysics] [FITS] [LAS
LiDAR]
Data Conventions netCDF-CF
[HDF4-
EOS]
ISO 19115, ACDD, RIF-CS, DCAT, etc.
VGLAGDC
VL
Services Layer
Fast “whole-of-library”
catalogue
Lustre Other Storage (e.g., HDFS)
National Environmental Research Data Interoperability Platform (NERDIP)
Climate & Weather Science Lab
Biodiversity & Climate Change VL
OG
C
WFS
OG
C
SWE
OG
C
W*P
S
OG
CW
CS
OG
C
WM
S
OG
CW
*TS
RD
F, LD
VHIRLGlobe Claritas
Workflow Engines, Virtual Laboratories (VL’s), Science Gateways
AuScopePortal
TERNPortal
AODN/IMOSPortal
eMASTSpeddexes
All Sky Virtual Observatory
ANDS/RDAPortal
eReefs
ModelsFortran, C, C++, MPI, OpenMP
Python, R, MatLab, IDL
VisualisationDrishti
Ferret, NCO, GDL, GDAL, GRASS, QGIS
Digital Bathymetry & Elevation Portal
Data.gov.au
Open NavSurface
Tools Data Portals
Direct
Access
CS-W
NetCDF-4
Weather
NetCDF-4
Oceans
NetCDF-4
EONetCDF-4
Bathy
HDF5 ??
Vocab Service
PROV Service
Op
enD
AP
11/9/2016
6
nci.org.au
National Environmental Research Data Interoperability Platform (NERDIP)
HDF5
NetCDF-4
Climate
GDAL
API Layers
HP Data Library Layer
[SEG-Y][Airborne
Geophysics] [FITS] [LAS
LiDAR]
Data Conventions netCDF-CF
[HDF4-
EOS]
ISO 19115, ACDD, RIF-CS, DCAT, etc.
VGLAGDC
VL
Services Layer
Fast “whole-of-library”
catalogue
Lustre Other Storage (e.g., HDFS)
National Environmental Research Data Interoperability Platform (NERDIP)
Climate & Weather Science Lab
Biodiversity & Climate Change VL
OG
C
WFS
OG
C
SWE
OG
C
W*P
S
OG
CW
CS
OG
C
WM
S
OG
CW
*TS
RD
F, LD
VHIRLGlobe Claritas
Workflow Engines, Virtual Laboratories (VL’s), Science Gateways
AuScopePortal
TERNPortal
AODN/IMOSPortal
eMASTSpeddexes
All Sky Virtual Observatory
ANDS/RDAPortal
eReefs
ModelsFortran, C, C++, MPI, OpenMP
Python, R, MatLab, IDL
VisualisationDrishti
Ferret, NCO, GDL, GDAL, GRASS, QGIS
Digital Bathymetry & Elevation Portal
Data.gov.au
Open NavSurface
Tools Data Portals
Direct
Access
CS-W
NetCDF-4
Weather
NetCDF-4
Oceans
NetCDF-4
EONetCDF-4
Bathy
HDF5 ??
Vocab Service
PROV Service
Op
enD
AP
nci.org.au
National Environmental Research Data Interoperability Platform (NERDIP)
HDF5
NetCDF-4
Climate
GDAL
API Layers
HP Data Library Layer
[SEG-Y][Airborne
Geophysics] [FITS] [LAS
LiDAR]
Data Conventions netCDF-CF
[HDF4-
EOS]
ISO 19115, ACDD, RIF-CS, DCAT, etc.
VGLAGDC
VL
Services Layer
Fast “whole-of-library”
catalogue
Lustre Other Storage (e.g., HDFS)
National Environmental Research Data Interoperability Platform (NERDIP)
Climate & Weather Science Lab
Biodiversity & Climate Change VL
OG
C
WFS
OG
C
SWE
OG
C
W*P
S
OG
CW
CS
OG
C
WM
S
OG
CW
*TS
RD
F, LD
VHIRLGlobe Claritas
Workflow Engines, Virtual Laboratories (VL’s), Science Gateways
AuScopePortal
TERNPortal
AODN/IMOSPortal
eMASTSpeddexes
All Sky Virtual Observatory
ANDS/RDAPortal
eReefs
ModelsFortran, C, C++, MPI, OpenMP
Python, R, MatLab, IDL
VisualisationDrishti
Ferret, NCO, GDL, GDAL, GRASS, QGIS
Digital Bathymetry & Elevation Portal
Data.gov.au
Open NavSurface
Tools Data Portals
Direct
Access
CS-W
NetCDF-4
Weather
NetCDF-4
Oceans
NetCDF-4
EONetCDF-4
Bathy
HDF5 ??
Vocab Service
PROV Service
Op
enD
AP
11/9/2016
7
nci.org.au
National Environmental Research Data Interoperability Platform (NERDIP)
HDF5
NetCDF-4
Climate
GDAL
API Layers
HP Data Library Layer
[SEG-Y][Airborne
Geophysics] [FITS] [LAS
LiDAR]
Data Conventions netCDF-CF
[HDF4-
EOS]
ISO 19115, ACDD, RIF-CS, DCAT, etc.
VGLAGDC
VL
Services Layer
Fast “whole-of-library”
catalogue
Lustre Other Storage (e.g., HDFS)
National Environmental Research Data Interoperability Platform (NERDIP)
Climate & Weather Science Lab
Biodiversity & Climate Change VL
OG
C
WFS
OG
C
SWE
OG
C
W*P
S
OG
CW
CS
OG
C
WM
S
OG
CW
*TS
RD
F, LD
VHIRLGlobe Claritas
Workflow Engines, Virtual Laboratories (VL’s), Science Gateways
AuScopePortal
TERNPortal
AODN/IMOSPortal
eMASTSpeddexes
All Sky Virtual Observatory
ANDS/RDAPortal
eReefs
ModelsFortran, C, C++, MPI, OpenMP
Python, R, MatLab, IDL
VisualisationDrishti
Ferret, NCO, GDL, GDAL, GRASS, QGIS
Digital Bathymetry & Elevation Portal
Data.gov.au
Open NavSurface
Tools Data Portals
Direct
Access
CS-W
NetCDF-4
Weather
NetCDF-4
Oceans
NetCDF-4
EONetCDF-4
Bathy
HDF5 ??
Vocab Service
PROV Service
Op
enD
AP
nci.org.au
National Environmental Research Data Interoperability Platform (NERDIP)
HDF5
NetCDF-4
Climate
GDAL
API Layers
HP Data Library Layer
[SEG-Y][Airborne
Geophysics] [FITS] [LAS
LiDAR]
Data Conventions netCDF-CF
[HDF4-
EOS]
ISO 19115, ACDD, RIF-CS, DCAT, etc.
VGLAGDC
VL
Services Layer
Fast “whole-of-library”
catalogue
Lustre Other Storage (e.g., HDFS)
National Environmental Research Data Interoperability Platform (NERDIP)
Climate & Weather Science Lab
Biodiversity & Climate Change VL
OG
C
WFS
OG
C
SWE
OG
C
W*P
S
OG
CW
CS
OG
C
WM
S
OG
CW
*TS
RD
F, LD
VHIRLGlobe Claritas
Workflow Engines, Virtual Laboratories (VL’s), Science Gateways
AuScopePortal
TERNPortal
AODN/IMOSPortal
eMASTSpeddexes
All Sky Virtual Observatory
ANDS/RDAPortal
eReefs
ModelsFortran, C, C++, MPI, OpenMP
Python, R, MatLab, IDL
VisualisationDrishti
Ferret, NCO, GDL, GDAL, GRASS, QGIS
Digital Bathymetry & Elevation Portal
Data.gov.au
Open NavSurface
Tools Data Portals
Direct
Access
CS-W
NetCDF-4
Weather
NetCDF-4
Oceans
NetCDF-4
EONetCDF-4
Bathy
HDF5 ??
Vocab Service
PROV Service
Op
enD
AP
11/9/2016
8
nci.org.au
Data Quality Strategy (DQS): The Goal?
Provide seamless programmatic access through standardisation of both data and
services
Data Quality Strategy (DQS)
nci.org.au
• Combining data• Visualising• How can we make enable this type of
easy access and use?
The Goal
11/9/2016
9
nci.org.au
Motivation: Data Management Maturity Program
DMM Capability – 25 Processes to Perform, Manage, Define
4. Data Operations Process Area13. Data Requirements Definition14. Data Lifecycle Management15. Contribution / Provider Management
5. Platform and Architecture Process Area16. Architectural Standards17. Architectural Approach18. Data Management Platform19. Data Integration / Data Linking20. Data Archiving and Preservation
6. Infrastructure Support Practices21. Measurement and Analysis22. Process Management23. Process Quality Assurance24. Risk Management25. Configuration Management
1. Data Management Strategy Process Area1. Data Management Strategy2. Communications3. Data Management Function4. Grant Strategy/Business Case5. Funding
2. Data Governance Process Area6. Governance Management7. Vocabulary/Glossary8. Metadata Management
3. Data Quality Process Area9. Data Quality Strategy10. Data Profiling11. Data Quality Assessment12. Data Cleansing and Curation
Please see the eResearchpresentation on this work:
Lesley Wyborn (NCI)Tuesday (today) @ 5pGrand Ballroom 1/2
nci.org.au
Data Quality Strategy (DQS)
Data Quality Strategy (DQS): What does it
involve?
1. Underlying HPD file format
2. Close collaboration with data custodians
and managers
• Planning, designing, or reassessing
the data collections
3. Quality control through compliance with
recognised community standards
4. Data assurance through demonstrated
functionality across common platforms,
tools, and services
11/9/2016
10
nci.org.au
National Computational Infrastructure
1. Climate/ESS Model Assets and Data Products
2. Earth and Marine Observations and Data Products
3. Geoscience Collections
4. Terrestrial Ecosystems Collections
5. Water Management and Hydrology Collections
Data Collections Approx. Capacity
CMIP5, CORDEX, ACCESS Models 5 Pbytes
Satellite Earth Obs: LANDSAT, Himawari-8, Sentinel, MODIS, INSAR 2 Pbytes
Digital Elevation, BathymetryOnshore/Offshore Geophysics
1 Pbytes
Seasonal Climate 700 Tbytes
Bureau of Meteorology Observations 350 Tbytes
Bureau of Meteorology Ocean-Marine 350 Tbytes
Terrestrial Ecosystem 290 Tbytes
Reanalysis products 100 Tbytes
NetCDFcommon data format
nci.org.au
NetCDF collection overview
FORMATBy collection
11/9/2016
11
nci.org.au
NetCDF collection overview
FORMATBy collection
The motivation: reduce ‘none’
nci.org.au
CF Conventions• Climate and Forecast Conventions and Metadata: http://cfconventions.org/
ACDD• Attribute Convention for Data Discovery:
http://wiki.esipfed.org/index.php/Attribute_Convention_for_Data_Discovery
Existing Community Standards
Together, these two standards define several categories of metadata ensuring:
Usage, discoverability, and understanding of the data contents
11/9/2016
12
nci.org.au
Collection
DatasetsDatasetsDatasets
Data file
Data file
Data file
Data file
Data file
Data file
Data file
Data file
Data file
Data file
Data file
Data file
Data file
Data file
Data file
Data file
Data file
Data file
Data file
Data file
Data file
Data file
Data file
Data file
Data file
Data file
Data file
Collection & dataset-levels
(e.g., parent-child metadata)
ISO-19115, ANZLIC, etc.
File (granule)-level
Contains 2 types of metadata:
(1) Variable-level (CF-Convention)
(2) Global file-level (ACDD**)
**Can link to collection/dataset metadata
File-levelVariable-level(s)
Many levels of metadata
nci.org.au
Self-contained file format
Traditional metadata information
(i.e., global attributes relevant to all variables/file contents):
• What is the data
• How was it produced
• Who/where/when
• Contact information
• Instruments, sources
• Version history
Data file
Global-level Variable(s)-level
Data content information:
(i.e., specific to each variable and the physical structure of the file)
• Coordinate variable(s) definition (value, name, size)
• Variable attributes (units, description, etc.)
• Type of data (gridded, discrete, ragged, etc.)
STANDARD:
Attribute Convention for Dataset Discovery (ACDD)
STANDARD:
Climate and Forecasts (CF) Convention
11/9/2016
13
nci.org.au
• Want to adopt or utilise existing community checkers if possible
• Two main options:• UK Reading (CF-Convention document links to this one)
• IOOS (growing fast, designed to be modified and extended)
Our own modifications• Needed our own wrapper to enable collection-level scans
• Tailor our output and reporting
Compliance checker
nci.org.au
Compliance checker
Summarised version on the compliance status.
The break down… compliance scores and also measure of consistency across the collection
Providing attack plan for improvements:Make it easy for data managers to efficiently address and meet baseline compliance
11/9/2016
14
nci.org.au
The result: “win-win” for all
• Working with data custodians and managers
• Common goal• Enabling access to organised, performant,
and interoperable data
• Progressive improvement in the quality of the datasets across the different subject domains
• “win-win”: • the ease by which the users can access,
utilise, and combine datasets
nci.org.au
Data Quality Strategy (DQS)
Data Quality Strategy (DQS): What does it
involve?
1. Underlying HPD file format
2. Close collaboration with data custodians
and managers
• Planning, designing, or reassessing
the data collections
3. Quality control through compliance with
recognised community standards
4. Data assurance through demonstrated
functionality across common platforms,
tools, and services
11/9/2016
15
nci.org.au
•Has now evolved to not only be a QC step in our data publishing process but also a Quality Assurance one
• Extend to test “usability” across wide spectrum of scientific tools and data services
• Commonly used libraries (e.g., netCDF, HDF, GDAL, etc.)
• Accessibility by data servers (e.g., THREDDS, Hyrax, GeoServer)
• Validation against scientific analysis and programming platforms (e.g., Python, Matlab, R, QGIS)
• Visualization tools (e.g., ParaView, IDV, WMS-viewers)
Functionality tests
nci.org.au
Functionality tests
Primary motivation:
Positive experience for our users.
Expectation that advertised collections and services are usable.
11/9/2016
16
nci.org.au
Bonus results
Bonus results:• Feedback to the local and international communities
• The more we test and test, the more we learn
• Functionality tests lead to reference and training material for our user community
Please see the eResearch presentation:
“A learner-centred approach to specialised user training”Claire Trenham (NCI)Wednesday (tomorrow) @ 2:25pGrand Ballroom 1/2
nci.org.au
Bonus: User reference material
11/9/2016
17
nci.org.au
Summary/Future Work
What’s next?
• Automating and extending these measures and tests across our full collection
• What about the broader file formats?
• Staying connected and working with international communities
• E.g., NSF Funded “Advancing netCDF-CF for the Geoscience Community” (EarthCube)
https://www.earthcube.org/content/advancing-netcdf-cf-geoscience-community
nci.org.au
Questions?
Questions?Thanks for listening.
Contact information:
Kelsey [email protected]
Claire [email protected]
Lesley [email protected]