CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: The NERC DataGrid – Building Bridges...

42
CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk The NERC DataGrid – Building Bridges for the Environmental Sciences Bryan Lawrence Kerstin Kleese, Roy Lowry, Kevin O’Neill, Andrew Woolf & others Head, NCAS/British Atmospheric Data Centre Rutherford Appleton Laboratory, CCLRC
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    220
  • download

    2

Transcript of CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: The NERC DataGrid – Building Bridges...

CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

The NERC DataGrid – Building Bridges for the Environmental Sciences

Bryan LawrenceKerstin Kleese, Roy Lowry, Kevin O’Neill, Andrew Woolf & others

Head, NCAS/British Atmospheric Data Centre

Rutherford Appleton Laboratory, CCLRC

CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

NDG Partners

• As funded a partnership between – British Atmospheric Data Centre (BADC, PI: Bryan Lawrence) – British Oceanographic Data Centre (BODC, Co-I: Roy Lowry)– CLRC E-science Centre (Co-I: Kerstin Kleese)– PCMDI at LNL in the US (Dean Williams, Bob Drach, Mike Fiorino)

• Project has caught the imagination, extra funding now supports:– A number of groups at the NERC Centre for Ecology and Hydrology

(CEH: Ecology DataGrid)– NERC Earth Observation Data Centre & Plymouth Marine Lab Remote

Sensing

• Not directly funded major collaborators will include:– ClimatePrediction.net, GODIVA (NERC e-science projects)– NCAS/CGAM: The Centre for Global Atmospheric Modelling at the University of Reading

(via Lois Stenman-Clark and Katherine Bouton)– Already required to provide technology to support the major UK project: HIGEM (a

collaboration between the Hadley Centre and the NERC academic community to develop the next generation of high resolution GCM models based on HadGEM).

CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

Outline

• Motivation:– The BADC, BODC, and the Metadata Gateway

• The NDG Goal

• NDG Metadata Structures and Architecture– Metadata Model

– Data Model

– ISO Context

• NDG Prototype Status

• Summary & Challenges

CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

The British Oceanographic Data Centre

(not for much longer, moving to a site on Liverpool University campus imminently)

CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

BODC Mission Statement

To operate a world class data centre in support of UK marine science by:

• providing data management support for UK marine science projects

• maintaining and developing the UK’s national oceanographic database

• developing innovative marine data products and digital atlases

• collaborating, on behalf of the UK, in the international exchange and management of oceanographic data

• making high quality data readily available to UK research scientists in academia, government and industry

CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

British Atmospheric Data Centre

The Role: Key words: Curation and Facilitation!

CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

BADC Users

882

342

230

149

214

179

154

Atmospheric

Water

Earth Science

Medical/Bio

Other

Geography

Engineering

3800 registered in March03

~ 300 individual users per month

Users by Discipline

November 02, 2150 Users

CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

BADC Storage Capacity

•Approx 50 TB (Nov02)

•Projected to quadruple well within next couple of years given existing commitments

•Planning exercise under way now.

•Committed to keeping as much as possible on spinning disk

•Further backup and extra storage at national archival centre (ATLAS, PB soon)

GBit Ethernet

WebServerNAS Storage:12.6TB

Tape Library5 TB

Tape Library30 TB

GB Switch

TapeServer

2.3 TB0.5 TB

GB Switch

Router

Router Router

ATLAS0.5 PB

SAN SCSI

1 Gbit1Gbit

622 Mbit

TapeServer

1GBit

1 Gbit

TVN

BADC in the RAL Network

2.5Gb

CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

Huge variety of Data Sets

CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

Querying datasets

Complex Metadata, held in Ingres database: export DIF and Z39.50

No possibility of automatic data usage …

CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

Different types of data returned: Wallingford

Supporting very diverse user community: NetCDF is not enough …

CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

NERC Metadata Gateway - SST

No clean handover from discovery to browse and use!

• Geospatial coordinates forgotten. Time reference forgotten. Need to get entire field(s), and find correct time!•And if I want to compare data from different locations?

- multiple logins- multiple formats- discovery?

CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

Outline

• Motivation:– The BADC, BODC, and the Metadata Gateway

• The NDG Goal• NDG Metadata Structures and Architecture

– Metadata Model

– Data Model

– ISO Context

• NDG Prototype Status

• Summary & Challenges

CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

The NERC DataGrid

Wider InternetNERC Grid

taperobot

XML data-base

XML data-base

BADC NDG Wrapper

OnlineData

OnlineData

BODC NDGWrapper

OnlineData

XML data-base

Group NDGWrapper

Software Agent

Grid User

Satellite Supercomputer

Research Group DataSources

Internet Link

Internet User

Internet LinkESG (&other)Applications

Wider Internet

NDGWeb

Portal

XML data-base

CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

Wider Internet

Research Group

Satellite

SuperComputer

Shared Resources

DB

Research Group

Research Group

Metadata Origins

Consider a hierarchy of data users beginning with an individual scientist, who may herself be part of a research group, itself part of a community sharing resources, lying in the wider internet …To be well integrated the metadata should have a role at each level!(The data portal client and server interface may be different at each level).At each level “extra” metadata will be required, probably produced by dedicated staff at the research group, or data centre.

CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

A google for data; the metadata carrot!

Wider InternetNERC Grid

taperobot

XML data-base

XML data-base

BADC NDG Wrapper

OnlineData

OnlineData

BODC NDGWrapper

OnlineData

XML data-base

Group NDGWrapper

Software Agent

Grid User

Satellite Supercomputer

Research Group DataSources

Internet Link

Internet User

Internet LinkESG (&other)Applications

Wider Internet

NDGWeb

Portal

XML data-base

Wider Internet

Research Group

Satellite

SuperComputer

Shared Resources

DB

Research Group

Research Group

CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

Outline

• Motivation:– The BADC, BODC, and the Metadata Gateway

• The NDG Goal• NDG Metadata Structures and Architecture

– Metadata Model

– Data Model

– ISO Context

• NDG Prototype Status

• Summary & Challenges

CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

NDG Metadata Taxonomy

Metadata

XML

AXML

B

A: Usage metadata generatedfrom (or about) the data.

Normally generated directlyfrom internal metadata

XML

CXML

D

XML

QQ: Schema whichdefines supported

queries uponA,B,C,D

Relationships

B: Generic completemetadata, semantic , syntactic(A), including discipline specfic

(E).

C: Metadata generated todescribe both documentations

and annotations (as opposed tobinary data).

D: Discovery metadatasuitable for harvesting.

Probably based on Dublin core& GEO. Subset of B and C.

Definitions

XML

D

XML

C

XML

B

XML

EXML

SE: Extra metadata,discipline specific.

S: Summary metadata(overlap between A&D)

XML

AS?

XML

D

XML

E

CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

XML

A

CreateAggregation

Metadata

OptionallyCreate higher level

Aggregations

XML

B

Applicationproduces

conformingdata files

AppropriateConvention

used todevelop

applicationsAdd further

discovery levelmetadata

DataFile

010010010

DataFile

010010010

XML

B

Move B into alocal NDGdirectorystructure

Data Ingestion

CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

XML

D

Ingestiontriggered fornew C or B

LocalNDG

Meta DB

Creation ofsummary

metadata forharvesting

(on demand?Nightly?)

NDGPortal(s?)

Harvesting

XML

CXML

BLocal Instance of

NDG directory

CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

XML

QQuery

Destination

NDGPortal

QueryType

One ormoreLocal

NDG DB

XML

D

Browse and redefine query

Discovery

QueryType

Data

Note that definitions A do not need tomatch any ingested A

Documents and Annotations

Detailed

User/SoftwareGenerates Query

XML

CDeliver one or moredocuments to user

XML

B

LocalNDG DBexists?

IngestA

Y

ExtractData

PhysicalData

Deliver Data

NDG Query and Data Delivery

Define DataRequest, Q

XML

A

CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

Separate data (A) and metadata (B) models

• Clear separation of function– Difference between data use

and discovery etc.

– “Tuning” of metadata to include relevant detail

• Allows increased reuse of metadata model– Avoids tie-in to details of a

particular fields data formats

– Can plug-in another data model

Metadata Model

Data Model

Datagranule ID

Datasummary

CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

(A) NDG Data Model: Overview

Dataset

Variable

Array

Coordinate GranuleDescriptor

1

*

*

*

1

1

1

*

*

*

1

0..1

Dataset: named container for a number of variablesVariable: physical parameters within the dataset; controlled vocabularies eg BODC datadictionary, CF standard namesArray: multidimensional container for other arrays or numeric dataCoordinate: may be shared between multiple Arrays; ‘anonymous’ if not georeferenced; MappedCoordinate vs ProductCoordinate; with respect to a Coordinate reference System (ref ISO 19111, ISO 19115)GranuleDescriptor: describes data granule in terms of file storage; enables file aggregation; SQL/OGSA-DAI for RDBMS; physical or logical (eg SRB) files

“Profiles” of model defined for important data types

CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

NDG Data ModelArray

name : string(idl) = WOCE SR3

WOCE SR3 section : Dataset

name : string(idl) = salinityparameter : CFStandardName = sea_water_salinityunits : PhysicalUnits = psumissingValue : NumericscaleFactor : Numericoffset : Numeric

Salinity : Variable

rank : short(idl) = 1axisSize : short(idl) = 50hasData : boolean(idl) = notype : NumericTypeNameisomorphicChildren : boolean(idl) = no

Salinity Data : Array

name : string(idl) = longitudeaxisType : AxisType = xaxisUnits : AxisUnits = degrees_eastrank : short(idl) = 1axisSize : short(idl) = 50type : NumericTypeName = floatarrayDimension : short(idl) = 1

Cruise track longitude : MappedCoordinate

name : string(idl) = latitudeaxisType : AxisType = yaxisUnits : AxisUnits = degrees_northrank : short(idl) = 1axisSize : short(idl) = 50type : NumericTypeName = floatarrayDimension : short(idl) = 1

Cruise track latitude : MappedCoordinate

rank : short(idl) = 1axisSize : short(idl) = 20hasData : boolean(idl) = yestype : NumericTypeName = floatisomorphicChildren : boolean(idl)

Cast 1 : Array

rank : short(idl) = 1axisSize : short(idl) = 40hasData : boolean(idl) = yestype : NumericTypeName = floatisomorphicChildren : boolean(idl)

Cast 50 : Array

name : string(idl) = depthaxisType : AxisType = zaxisUnits : AxisUnits = maxisSize : short(idl) = 20type : NumericTypeName = floatarrayDimension : short(idl) = 1

Cast 1 depth : ProductCoordinate

CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

(B) Metadata Model

Activity

IncludesIncluded-in

IncludesIncluded-in

IncludesIncluded-in

Can-be-aggregated-in

ProducesOutput-by

Derived data entities

Observation stationTypes

Basic data entitiesDataset types

Dataproduction

tools

IncludesIncluded-in

Deploys-aDeployed-on-a

ProducesOutput-at

ProducesOutput-by

Common Data Entities- dimensions, * spatial/temporal- grids- organisations- people- places/areas

Data Granules

CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

(B) Metadata Model: an NDG Intermediate Schema, Conceptual Overview

Tier-0Activity

IncludesIncluded-in

IncludesIncluded-in

IncludesIncluded-in

Inter-tier relationship

Directed relationship

Can-be-aggregated-in

Integrated-into

Produces/Output-by

Can-be-aggregated-in

Tier-1 -Observationstation Types

Common entities

Tier-4 - Basic dataentities

Tier-3 -Datasettypes

Tier-2 - Dataproductiontools

IncludesIncluded-in

CollectsCan-be-collected-in

Integrated-into

Deploys-aDeployed-on-a

Is-time-ordered-series-of

Follows-a

Superset-ofSubset-of

Processed-to-a

Instrument

Ensemble

Analysis

Stationary Moving

SectionProfileLagrangianpath

Grid

Time Trajectory

Point

Area

Place

Model

Simulation

Spatiotemporalentity

Entity withDIF record

Dataowningentity

Sample

Can-be-aggregated-in

Person

Organisation

Role

Tier-5 - Deriveddata entities TimeseriesClimatology

Measurement

IntegrationCan-be-aggregated-in

N-dimensionaldataset

Can-be-aggregated-in

Integrated-into

Integrated-into

ProducesOutput-by

ProducesOutput-by

Spatialdimensions

CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

Outline

• Motivation:– The BADC, BODC, and the Metadata Gateway

• The NDG Goal• NDG Metadata Structures and Architecture

– Metadata Model

– Data Model

• ISO Context

• NDG Prototype Status

• Summary & Challenges

CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

• ISO 19101: Geographic information – Reference model• ISO 19103: Geographic information – Conceptual schema

language• ISO 19107: Geographic information – Spatial schema• ISO 19108: Geographic information – Temporal schema• ISO 19109: Geographic information – Rules for application

schema• ISO 19111: Geographic information – Spatial referencing by

coordinates• ISO 19115: Geographic information – Metadata• ISO 19118: Geographic information – Encoding• ISO 19121: Geographic information – Imagery and gridded

data

ISO TC211

CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

Dataset title

Dataset reference date

Dataset responsible partyMetadata point of contact

Dataset language

Dataset character set

Dataset topic categoryAbstract describing dataset

Spatial resolution of dataset

Spatial representation type

Geographic location of dataset

Vertical/temporal extent for dataset

Reference system

Lineage

Distribution format

On-line resource

Metadata character set

Metadata date stamp

Metadata standard name

Metadata standard version

Metadata file identifier

Metadata language

ISO19115

CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

• Metadata extensions and profiles

ISO

Direct relationship between ISO19115 and our (B) Intermediate schema.

CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

• Profiling of ISO 191xx“The comprehensiveness and large number of options available in various base standards make it difficult to combine them for practical applications. … A profile integrates a set of base standards and/or modules (predefined subsets) of base standards to meet a specific implementation requirement.”

• Registration of profiles“A profile that is registered through an ISO registration procedure becomes an International Standardized Profile (ISP). National standards that are expressed as profiles of ISO base standards may be registered at a national level.”

ISO19101

CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

Further Application in NERC DataGrid

• eg Data model “Coordinates”

Dataset

Variable

Array

Coordinate GranuleDescriptor

1

*

*

*

1

1

1

*

*

*

1

0..1

ISO 19111

ISO 19108

CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

Outline

• Motivation:– The BADC, BODC, and the Metadata Gateway

• The NDG Goal• NDG Metadata Structures and Architecture

– Metadata Model

– Data Model

– ISO Context

• NDG Prototype Status

• Summary & Challenges

CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

The Data Use Chain

Discovery

Authentication

Authorisation

Extraction

Sub-Sampling

Regridding

Processing Display

Delivery

Formatting

Time-line

CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

Key Components – need APIs and standards

NERC DataGrid Key TechnologiesN

DG

Mid

dle

Wa

reG

rid

Fa

bric

Da

ta O

wn

erA

pp

lica

tion

s

DataBase

OGSA/DAI SRB RLS

disc array disc array 1..n disc arrays

DISCOVERY BROWSE REFORMAT DELIVERY

SECURITY

INGEST

GIS CDATVIS

SERVICEPORTAL ANNOTATE

TapeLibs

Globus

Harvest

CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

NDG Discovery Service Element

DirectoryInterchange

Format

DublinCore

GEOProfile

(Z39.50)

IntermediateSchema

Document(s)(XML)

XSLTProcessor

XSLTProcessor

XSLTProcessor

passthru

CatalogueInteroperabiltiy

Protocol ?

NDG DiscoveryServiceElement

XSLT IngestTransformation

ExistingMetadata

Traditional and Grid Service (GT3) Interfaces

CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

Starting with the LAS

Deployment for UK users within a few weeks (constraint is primarily access control)

CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

LAS – Simple Box fill Output

Work for us to do: Labelling is inadequate as yet ..

ERA40

CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

Cache management in LAS/CDAT

Calls cdms.open to open data file.CDAT

BADC/CDAT intercepts command and checks cache

BADC/CDAT

YES

Spectral file is converted

on-the-fly and placed in cache.

NO

Cache unlocked. New cdms.open command

sent to CDAT and cache file opened.

Cache also checks if enough room, deletes oldest files if necessary and checks against disk space limit.

Locks access to cache. Checks if

regular gridded file is in cache list.

localCache.py

18 TB virtual dataset

LAS

ERA-404 TB

Spectral Archive

ERA-40 < 1TBGrid Cache

Internet User

NetCDF file, plot or animations delivered

to user.

Data object delivered to LAS.

CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

NERC DataGrid Prototype

• (by hand) Ingestion of ACSOE data from BADC and BODC.

• NASA GCMD DIF based discovery– Exported from Intermediate Schema

– Harvested by hand

• Working on hand-over-mechanism to pass dataset info to DataModel based LAS service– Generate and populate LAS database in response

– Use standard LAS delivery

Next Steps:

• GT3 based services, improve LAS, improve delivery, implement multiple datamodel profiles, implement multiple discovery services.

CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

Summary

NDG project running for a year now, aiming to provide grid-enabled tools to support:– a diverse community

– with diverse datasets

NDG part of the UK National E-science programme, and will leverage off other projects to implement grid solutions.– initial prototype web-service based

– GT3 prototype due early in the new year

Software development based on plagiarising the maximum amount from other groups, and a standards based approach within the NDG.– All code will be in the public domain

Major challenge will not be technical; policy, attitudes, legal issues.

CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk

You’ve gone TOO FAR!