CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
The NERC DataGrid – Building Bridges for the Environmental Sciences
Bryan LawrenceKerstin Kleese, Roy Lowry, Kevin O’Neill, Andrew Woolf & others
Head, NCAS/British Atmospheric Data Centre
Rutherford Appleton Laboratory, CCLRC
CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
NDG Partners
• As funded a partnership between – British Atmospheric Data Centre (BADC, PI: Bryan Lawrence) – British Oceanographic Data Centre (BODC, Co-I: Roy Lowry)– CLRC E-science Centre (Co-I: Kerstin Kleese)– PCMDI at LNL in the US (Dean Williams, Bob Drach, Mike Fiorino)
• Project has caught the imagination, extra funding now supports:– A number of groups at the NERC Centre for Ecology and Hydrology
(CEH: Ecology DataGrid)– NERC Earth Observation Data Centre & Plymouth Marine Lab Remote
Sensing
• Not directly funded major collaborators will include:– ClimatePrediction.net, GODIVA (NERC e-science projects)– NCAS/CGAM: The Centre for Global Atmospheric Modelling at the University of Reading
(via Lois Stenman-Clark and Katherine Bouton)– Already required to provide technology to support the major UK project: HIGEM (a
collaboration between the Hadley Centre and the NERC academic community to develop the next generation of high resolution GCM models based on HadGEM).
CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
Outline
• Motivation:– The BADC, BODC, and the Metadata Gateway
• The NDG Goal
• NDG Metadata Structures and Architecture– Metadata Model
– Data Model
– ISO Context
• NDG Prototype Status
• Summary & Challenges
CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
The British Oceanographic Data Centre
(not for much longer, moving to a site on Liverpool University campus imminently)
CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
BODC Mission Statement
To operate a world class data centre in support of UK marine science by:
• providing data management support for UK marine science projects
• maintaining and developing the UK’s national oceanographic database
• developing innovative marine data products and digital atlases
• collaborating, on behalf of the UK, in the international exchange and management of oceanographic data
• making high quality data readily available to UK research scientists in academia, government and industry
CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
British Atmospheric Data Centre
The Role: Key words: Curation and Facilitation!
CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
BADC Users
882
342
230
149
214
179
154
Atmospheric
Water
Earth Science
Medical/Bio
Other
Geography
Engineering
3800 registered in March03
~ 300 individual users per month
Users by Discipline
November 02, 2150 Users
CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
BADC Storage Capacity
•Approx 50 TB (Nov02)
•Projected to quadruple well within next couple of years given existing commitments
•Planning exercise under way now.
•Committed to keeping as much as possible on spinning disk
•Further backup and extra storage at national archival centre (ATLAS, PB soon)
GBit Ethernet
WebServerNAS Storage:12.6TB
Tape Library5 TB
Tape Library30 TB
GB Switch
TapeServer
2.3 TB0.5 TB
GB Switch
Router
Router Router
ATLAS0.5 PB
SAN SCSI
1 Gbit1Gbit
622 Mbit
TapeServer
1GBit
1 Gbit
TVN
BADC in the RAL Network
2.5Gb
CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
Huge variety of Data Sets
CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
Querying datasets
Complex Metadata, held in Ingres database: export DIF and Z39.50
No possibility of automatic data usage …
CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
Different types of data returned: Wallingford
Supporting very diverse user community: NetCDF is not enough …
CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
NERC Metadata Gateway - SST
No clean handover from discovery to browse and use!
• Geospatial coordinates forgotten. Time reference forgotten. Need to get entire field(s), and find correct time!•And if I want to compare data from different locations?
- multiple logins- multiple formats- discovery?
CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
Outline
• Motivation:– The BADC, BODC, and the Metadata Gateway
• The NDG Goal• NDG Metadata Structures and Architecture
– Metadata Model
– Data Model
– ISO Context
• NDG Prototype Status
• Summary & Challenges
CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
The NERC DataGrid
Wider InternetNERC Grid
taperobot
XML data-base
XML data-base
BADC NDG Wrapper
OnlineData
OnlineData
BODC NDGWrapper
OnlineData
XML data-base
Group NDGWrapper
Software Agent
Grid User
Satellite Supercomputer
Research Group DataSources
Internet Link
Internet User
Internet LinkESG (&other)Applications
Wider Internet
NDGWeb
Portal
XML data-base
CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
Wider Internet
Research Group
Satellite
SuperComputer
Shared Resources
DB
Research Group
Research Group
Metadata Origins
Consider a hierarchy of data users beginning with an individual scientist, who may herself be part of a research group, itself part of a community sharing resources, lying in the wider internet …To be well integrated the metadata should have a role at each level!(The data portal client and server interface may be different at each level).At each level “extra” metadata will be required, probably produced by dedicated staff at the research group, or data centre.
CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
A google for data; the metadata carrot!
Wider InternetNERC Grid
taperobot
XML data-base
XML data-base
BADC NDG Wrapper
OnlineData
OnlineData
BODC NDGWrapper
OnlineData
XML data-base
Group NDGWrapper
Software Agent
Grid User
Satellite Supercomputer
Research Group DataSources
Internet Link
Internet User
Internet LinkESG (&other)Applications
Wider Internet
NDGWeb
Portal
XML data-base
Wider Internet
Research Group
Satellite
SuperComputer
Shared Resources
DB
Research Group
Research Group
CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
Outline
• Motivation:– The BADC, BODC, and the Metadata Gateway
• The NDG Goal• NDG Metadata Structures and Architecture
– Metadata Model
– Data Model
– ISO Context
• NDG Prototype Status
• Summary & Challenges
CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
NDG Metadata Taxonomy
Metadata
XML
AXML
B
A: Usage metadata generatedfrom (or about) the data.
Normally generated directlyfrom internal metadata
XML
CXML
D
XML
QQ: Schema whichdefines supported
queries uponA,B,C,D
Relationships
B: Generic completemetadata, semantic , syntactic(A), including discipline specfic
(E).
C: Metadata generated todescribe both documentations
and annotations (as opposed tobinary data).
D: Discovery metadatasuitable for harvesting.
Probably based on Dublin core& GEO. Subset of B and C.
Definitions
XML
D
XML
C
XML
B
XML
EXML
SE: Extra metadata,discipline specific.
S: Summary metadata(overlap between A&D)
XML
AS?
XML
D
XML
E
CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
XML
A
CreateAggregation
Metadata
OptionallyCreate higher level
Aggregations
XML
B
Applicationproduces
conformingdata files
AppropriateConvention
used todevelop
applicationsAdd further
discovery levelmetadata
DataFile
010010010
DataFile
010010010
XML
B
Move B into alocal NDGdirectorystructure
Data Ingestion
CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
XML
D
Ingestiontriggered fornew C or B
LocalNDG
Meta DB
Creation ofsummary
metadata forharvesting
(on demand?Nightly?)
NDGPortal(s?)
Harvesting
XML
CXML
BLocal Instance of
NDG directory
CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
XML
QQuery
Destination
NDGPortal
QueryType
One ormoreLocal
NDG DB
XML
D
Browse and redefine query
Discovery
QueryType
Data
Note that definitions A do not need tomatch any ingested A
Documents and Annotations
Detailed
User/SoftwareGenerates Query
XML
CDeliver one or moredocuments to user
XML
B
LocalNDG DBexists?
IngestA
Y
ExtractData
PhysicalData
Deliver Data
NDG Query and Data Delivery
Define DataRequest, Q
XML
A
CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
Separate data (A) and metadata (B) models
• Clear separation of function– Difference between data use
and discovery etc.
– “Tuning” of metadata to include relevant detail
• Allows increased reuse of metadata model– Avoids tie-in to details of a
particular fields data formats
– Can plug-in another data model
Metadata Model
Data Model
Datagranule ID
Datasummary
CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
(A) NDG Data Model: Overview
Dataset
Variable
Array
Coordinate GranuleDescriptor
1
*
*
*
1
1
1
*
*
*
1
0..1
Dataset: named container for a number of variablesVariable: physical parameters within the dataset; controlled vocabularies eg BODC datadictionary, CF standard namesArray: multidimensional container for other arrays or numeric dataCoordinate: may be shared between multiple Arrays; ‘anonymous’ if not georeferenced; MappedCoordinate vs ProductCoordinate; with respect to a Coordinate reference System (ref ISO 19111, ISO 19115)GranuleDescriptor: describes data granule in terms of file storage; enables file aggregation; SQL/OGSA-DAI for RDBMS; physical or logical (eg SRB) files
“Profiles” of model defined for important data types
CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
NDG Data ModelArray
name : string(idl) = WOCE SR3
WOCE SR3 section : Dataset
name : string(idl) = salinityparameter : CFStandardName = sea_water_salinityunits : PhysicalUnits = psumissingValue : NumericscaleFactor : Numericoffset : Numeric
Salinity : Variable
rank : short(idl) = 1axisSize : short(idl) = 50hasData : boolean(idl) = notype : NumericTypeNameisomorphicChildren : boolean(idl) = no
Salinity Data : Array
name : string(idl) = longitudeaxisType : AxisType = xaxisUnits : AxisUnits = degrees_eastrank : short(idl) = 1axisSize : short(idl) = 50type : NumericTypeName = floatarrayDimension : short(idl) = 1
Cruise track longitude : MappedCoordinate
name : string(idl) = latitudeaxisType : AxisType = yaxisUnits : AxisUnits = degrees_northrank : short(idl) = 1axisSize : short(idl) = 50type : NumericTypeName = floatarrayDimension : short(idl) = 1
Cruise track latitude : MappedCoordinate
rank : short(idl) = 1axisSize : short(idl) = 20hasData : boolean(idl) = yestype : NumericTypeName = floatisomorphicChildren : boolean(idl)
Cast 1 : Array
rank : short(idl) = 1axisSize : short(idl) = 40hasData : boolean(idl) = yestype : NumericTypeName = floatisomorphicChildren : boolean(idl)
Cast 50 : Array
name : string(idl) = depthaxisType : AxisType = zaxisUnits : AxisUnits = maxisSize : short(idl) = 20type : NumericTypeName = floatarrayDimension : short(idl) = 1
Cast 1 depth : ProductCoordinate
CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
(B) Metadata Model
Activity
IncludesIncluded-in
IncludesIncluded-in
IncludesIncluded-in
Can-be-aggregated-in
ProducesOutput-by
Derived data entities
Observation stationTypes
Basic data entitiesDataset types
Dataproduction
tools
IncludesIncluded-in
Deploys-aDeployed-on-a
ProducesOutput-at
ProducesOutput-by
Common Data Entities- dimensions, * spatial/temporal- grids- organisations- people- places/areas
Data Granules
CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
(B) Metadata Model: an NDG Intermediate Schema, Conceptual Overview
Tier-0Activity
IncludesIncluded-in
IncludesIncluded-in
IncludesIncluded-in
Inter-tier relationship
Directed relationship
Can-be-aggregated-in
Integrated-into
Produces/Output-by
Can-be-aggregated-in
Tier-1 -Observationstation Types
Common entities
Tier-4 - Basic dataentities
Tier-3 -Datasettypes
Tier-2 - Dataproductiontools
IncludesIncluded-in
CollectsCan-be-collected-in
Integrated-into
Deploys-aDeployed-on-a
Is-time-ordered-series-of
Follows-a
Superset-ofSubset-of
Processed-to-a
Instrument
Ensemble
Analysis
Stationary Moving
SectionProfileLagrangianpath
Grid
Time Trajectory
Point
Area
Place
Model
Simulation
Spatiotemporalentity
Entity withDIF record
Dataowningentity
Sample
Can-be-aggregated-in
Person
Organisation
Role
Tier-5 - Deriveddata entities TimeseriesClimatology
Measurement
IntegrationCan-be-aggregated-in
N-dimensionaldataset
Can-be-aggregated-in
Integrated-into
Integrated-into
ProducesOutput-by
ProducesOutput-by
Spatialdimensions
CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
Outline
• Motivation:– The BADC, BODC, and the Metadata Gateway
• The NDG Goal• NDG Metadata Structures and Architecture
– Metadata Model
– Data Model
• ISO Context
• NDG Prototype Status
• Summary & Challenges
CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
• ISO 19101: Geographic information – Reference model• ISO 19103: Geographic information – Conceptual schema
language• ISO 19107: Geographic information – Spatial schema• ISO 19108: Geographic information – Temporal schema• ISO 19109: Geographic information – Rules for application
schema• ISO 19111: Geographic information – Spatial referencing by
coordinates• ISO 19115: Geographic information – Metadata• ISO 19118: Geographic information – Encoding• ISO 19121: Geographic information – Imagery and gridded
data
ISO TC211
CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
Dataset title
Dataset reference date
Dataset responsible partyMetadata point of contact
Dataset language
Dataset character set
Dataset topic categoryAbstract describing dataset
Spatial resolution of dataset
Spatial representation type
Geographic location of dataset
Vertical/temporal extent for dataset
Reference system
Lineage
Distribution format
On-line resource
Metadata character set
Metadata date stamp
Metadata standard name
Metadata standard version
Metadata file identifier
Metadata language
ISO19115
CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
• Metadata extensions and profiles
ISO
Direct relationship between ISO19115 and our (B) Intermediate schema.
CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
• Profiling of ISO 191xx“The comprehensiveness and large number of options available in various base standards make it difficult to combine them for practical applications. … A profile integrates a set of base standards and/or modules (predefined subsets) of base standards to meet a specific implementation requirement.”
• Registration of profiles“A profile that is registered through an ISO registration procedure becomes an International Standardized Profile (ISP). National standards that are expressed as profiles of ISO base standards may be registered at a national level.”
ISO19101
CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
Further Application in NERC DataGrid
• eg Data model “Coordinates”
Dataset
Variable
Array
Coordinate GranuleDescriptor
1
*
*
*
1
1
1
*
*
*
1
0..1
ISO 19111
ISO 19108
CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
Outline
• Motivation:– The BADC, BODC, and the Metadata Gateway
• The NDG Goal• NDG Metadata Structures and Architecture
– Metadata Model
– Data Model
– ISO Context
• NDG Prototype Status
• Summary & Challenges
CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
The Data Use Chain
Discovery
Authentication
Authorisation
Extraction
Sub-Sampling
Regridding
Processing Display
Delivery
Formatting
Time-line
CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
Key Components – need APIs and standards
NERC DataGrid Key TechnologiesN
DG
Mid
dle
Wa
reG
rid
Fa
bric
Da
ta O
wn
erA
pp
lica
tion
s
DataBase
OGSA/DAI SRB RLS
disc array disc array 1..n disc arrays
DISCOVERY BROWSE REFORMAT DELIVERY
SECURITY
INGEST
GIS CDATVIS
SERVICEPORTAL ANNOTATE
TapeLibs
Globus
Harvest
CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
NDG Discovery Service Element
DirectoryInterchange
Format
DublinCore
GEOProfile
(Z39.50)
IntermediateSchema
Document(s)(XML)
XSLTProcessor
XSLTProcessor
XSLTProcessor
passthru
CatalogueInteroperabiltiy
Protocol ?
NDG DiscoveryServiceElement
XSLT IngestTransformation
ExistingMetadata
Traditional and Grid Service (GT3) Interfaces
CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
Starting with the LAS
Deployment for UK users within a few weeks (constraint is primarily access control)
CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
LAS – Simple Box fill Output
Work for us to do: Labelling is inadequate as yet ..
ERA40
CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
Cache management in LAS/CDAT
Calls cdms.open to open data file.CDAT
BADC/CDAT intercepts command and checks cache
BADC/CDAT
YES
Spectral file is converted
on-the-fly and placed in cache.
NO
Cache unlocked. New cdms.open command
sent to CDAT and cache file opened.
Cache also checks if enough room, deletes oldest files if necessary and checks against disk space limit.
Locks access to cache. Checks if
regular gridded file is in cache list.
localCache.py
18 TB virtual dataset
LAS
ERA-404 TB
Spectral Archive
ERA-40 < 1TBGrid Cache
Internet User
NetCDF file, plot or animations delivered
to user.
Data object delivered to LAS.
CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
NERC DataGrid Prototype
• (by hand) Ingestion of ACSOE data from BADC and BODC.
• NASA GCMD DIF based discovery– Exported from Intermediate Schema
– Harvested by hand
• Working on hand-over-mechanism to pass dataset info to DataModel based LAS service– Generate and populate LAS database in response
– Use standard LAS delivery
Next Steps:
• GT3 based services, improve LAS, improve delivery, implement multiple datamodel profiles, implement multiple discovery services.
CAS2K3, September 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
Summary
NDG project running for a year now, aiming to provide grid-enabled tools to support:– a diverse community
– with diverse datasets
NDG part of the UK National E-science programme, and will leverage off other projects to implement grid solutions.– initial prototype web-service based
– GT3 prototype due early in the new year
Software development based on plagiarising the maximum amount from other groups, and a standards based approach within the NDG.– All code will be in the public domain
Major challenge will not be technical; policy, attitudes, legal issues.
Top Related