Earth System Modelling & the NDG
description
Transcript of Earth System Modelling & the NDG
ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
Earth System Modelling & the NDG
Bryan Lawrence(Kerstin Kleese, Roy Lowry, Kevin O’Neill, Andrew Woolf & others)
NCAS/British Atmospheric Data Centre
Rutherford Appleton Laboratory, CCLRC
ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
NDG Partners
• As funded a partnership between – British Atmospheric Data Centre (BADC, PI: Bryan Lawrence) – British Oceanographic Data Centre (BODC, Co-I: Roy Lowry)– CLRC E-science Centre (Co-I: Kerstin Kleese)– PCMDI at LNL in the US (Dean Williams, Bob Drach, Mike Fiorino)
• Project has caught the imagination, extra funding now supports:– A number of groups at the NERC Centre for Ecology and Hydrology
(CEH: Ecology DataGrid)– NERC Earth Observation Data Centre & Plymouth Marine Lab Remote
Sensing
• Not directly funded major collaborators will include:– ClimatePrediction.net, GODIVA (NERC e-science projects)– NCAS/CGAM: The Centre for Global Atmospheric Modelling at the University of Reading
(via Lois Stenman-Clark and Katherine Bouton)
• Project will support HIGEM
ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
Outline
• Motivation:• The NDG Goals• NDG Metadata• Networks• Summary
ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
British Atmospheric Data Centre
The Role: Key words: Curation and Facilitation!
ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
Easily catalogued, but successful preservation?
One could argue that the writers of these documents did a brilliant job of preserving the bits-and-bytes of their time …
And yes they’ve both been translated … many times, it’s a shame the meanings are different …
Phaistos Disk, 1700BC
ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
NERC Metadata Gateway - SST
No clean handover from discovery to browse and use!
• Geospatial coordinates forgotten. Time reference forgotten. Need to get entire field(s), and find correct time!•And if I want to compare data from different locations?
- multiple logins- multiple formats- discovery?
ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
A priori would any user know to look in the COAPEC data set?
Earth system-science means we have to remove these boundaries!
• detailed file level metadata isn’t visible, and so data mining applications impossible.
NB: Dynamic catalogues!
How good is our metadata?
ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
Finding Data
The Goal: Very simple interface, hide the complex software!
ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
A newer “dataset”
The extreme relevance of this example from Amazon was pointed out by Jon Callahan (LAS project, PMEL)!
ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
PCMDI – Best practice!
(if you know where to look)
Final references are papers!
Is the information coupled to the datasets? What if I take a dataset home, and another, and another … and then forget which is which?
Can I ask the question: what datasets used the Semtner sea ice parameterisation?
ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
Huge variety of Data Sets
ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
Different types of data returned: Wallingford
Supporting very diverse user community: NetCDF is not enough …
ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
Modelling advances: Baseline Numbers
• T42 CCSM (current, 280km)– 7.5GB/yr, 100 years -> .75TB
• T85 CCSM (140km)– 29GB/yr, 100 years -> 2.9TB
• T170 CCSM (70km)– 110GB/yr, 100 years -> 11TB
NCAR
Don Middleton
ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
Capacity-related Improvements
Increased turnaround, model development, ensemble of runs
Increase by a factor of 10, linear data
• Current T42 CCSM– 7.5GB/yr, 100 years -> .75TB * 10 = 7.5TB
ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
Capability-related Improvements
Spatial Resolution: T42 -> T85 -> T170Increase by factor of ~ 10-20, linear data
Temporal Resolution: Study diurnal cycle, 3 hour data
Increase by factor of ~ 4, linear data
CCM3 at T170 (70km)
ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
Capability-related Improvements
Quality: Improved boundary layer, clouds, convection, ocean physics, land model, river runoff, sea ice
Increase by another factor of 2-3, data flat
Scope: Atmospheric chemistry (sulfates, ozone…), biogeochemistry (carbon cycle, ecosystem dynamics),middle Atmosphere Model…
Increase by another factor of 10+, linear data
ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
Model Improvement Wishlist
Grand Total:
Increase compute by a Factor O(1000-10000)
NCAR
Don Middleton
ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
Climate in 20010 – A graphic Illustration
Figures from Gary Strand, NCAR, ESG website
ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
Summary thus far
Contentions: • The average atmospheric scientific project involves
about 1/3 of the time data handling! (Getting, reformatting etc).
• The problem for earth system model projects is about to get worse – for everyone, from the initiator, to the archiver, to the analyst, to the contributor, to the improver.
• (Remember the documentation problem is growing exponentially too: new sub-components etc)
ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
The NERC DataGrid
Wider InternetNERC Grid
taperobot
XML data-base
XML data-base
BADC NDG Wrapper
OnlineData
OnlineData
BODC NDGWrapper
OnlineData
XML data-base
Group NDGWrapper
Software Agent
Grid User
Satellite Supercomputer
Research Group DataSources
Internet Link
Internet User
Internet LinkESG (&other)Applications
Wider Internet
NDGWeb
Portal
XML data-base
ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
XML
QQuery
Destination
NDGPortal
QueryType
One ormoreLocal
NDG DB
XML
D
Browse and redefine query
Discovery
QueryType
Data
Note that definitions A do not need tomatch any ingested A
Documents and Annotations
Detailed
User/SoftwareGenerates Query
XML
CDeliver one or moredocuments to user
XML
B
LocalNDG DBexists?
IngestA
Y
ExtractData
PhysicalData
Deliver Data
NDG Query and Data Delivery
Define DataRequest, Q
XML
A
ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
The Data Use Chain
Discovery
Authentication
Authorisation
Extraction
Sub-Sampling
Regridding
Processing Display
Delivery
Formatting
Time-line
ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
Requirements: Information (1)
Amazon Discovery gives good examples:•Browse•Similar datasets•Details•Content examples
Learn from the library and book handling community!
Our domain Issuesrequire:•Dealing with Volume•Formats•Providing Tools
“Scientist are are real people too”
Jon Callahan (from the LAS project at PMEL)
All require documentation (aka metadata);
We need to improve our information handling
ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
NDG Metadata Taxonomy
Metadata
XML
AXML
B
A: Usage metadata generatedfrom (or about) the data.
Normally generated directlyfrom internal metadata
XML
CXML
D
XML
QQ: Schema whichdefines supported
queries uponA,B,C,D
Relationships
B: Generic completemetadata, semantic , syntactic(A), including discipline specfic
(E).
C: Metadata generated todescribe both documentations
and annotations (as opposed tobinary data).
D: Discovery metadatasuitable for harvesting.
Probably based on Dublin core& GEO. Subset of B and C.
Definitions
XML
D
XML
C
XML
B
XML
EXML
SE: Extra metadata,discipline specific.
S: Summary metadata(overlap between A&D)
XML
AS?
XML
D
XML
E
ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
What is metadata?
The answer depends on who you are!
Firstly: information to help one use one’s own data: e.g. calibration data (E) , netcdf
metadata (A)
Internet User
Metadata can help one find other people’s data
… and then help one obtain and use it. (D)
Metadata can be used to enable the preservation of data for posterity (all of
ABCD)
It is information passed with the data to enable someone else to use it. It describes the
data. (B & E)Metadata can be used to
enable automatic software to find (D) & manipulate data
(A).
ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
NDG A and B metadata in practice
Clear separation of function between use and discovery.• Standards Compliant
• Avoid tie-in to details of particular fields or data formats or even components
Metadata model (B)• “Intermediate” schema, supports multiple discovery formats
NDG Data Model (A).• provides an abstract semantic model for the structure of data within NDG,
• enables the specification of concrete instances for use by NDG Data Services
ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
(B) Metadata Model
Activity
IncludesIncluded-in
IncludesIncluded-in
IncludesIncluded-in
Can-be-aggregated-in
ProducesOutput-by
Derived data entities
Observation stationTypes
Basic data entitiesDataset types
Dataproduction
tools
IncludesIncluded-in
Deploys-aDeployed-on-a
ProducesOutput-at
ProducesOutput-by
Common Data Entities- dimensions, * spatial/temporal- grids- organisations- people- places/areas
Data Granules
ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
(B) Metadata Model: an NDG Intermediate Schema, Conceptual Overview
Tier-0Activity
IncludesIncluded-in
IncludesIncluded-in
IncludesIncluded-in
Inter-tier relationship
Directed relationship
Can-be-aggregated-in
Integrated-into
Produces/Output-by
Can-be-aggregated-in
Tier-1 -Observationstation Types
Common entities
Tier-4 - Basic dataentities
Tier-3 -Datasettypes
Tier-2 - Dataproductiontools
IncludesIncluded-in
CollectsCan-be-collected-in
Integrated-into
Deploys-aDeployed-on-a
Is-time-ordered-series-of
Follows-a
Superset-ofSubset-of
Processed-to-a
Instrument
Ensemble
Analysis
Stationary Moving
SectionProfileLagrangianpath
Grid
Time Trajectory
Point
Area
Place
Model
Simulation
Spatiotemporalentity
Entity withDIF record
Dataowningentity
Sample
Can-be-aggregated-in
Person
Organisation
Role
Tier-5 - Deriveddata entities TimeseriesClimatology
Measurement
IntegrationCan-be-aggregated-in
N-dimensionaldataset
Can-be-aggregated-in
Integrated-into
Integrated-into
ProducesOutput-by
ProducesOutput-by
Spatialdimensions
ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
NDG Discovery Service Element
DirectoryInterchange
Format
DublinCore
GEOProfile
(Z39.50)
IntermediateSchema
Document(s)(XML)
XSLTProcessor
XSLTProcessor
XSLTProcessor
passthru
CatalogueInteroperabiltiy
Protocol ?
NDG DiscoveryServiceElement
XSLT IngestTransformation
ExistingMetadata
Traditional and Grid Service (GT3) Interfaces
ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
NDG Semantic Data Model (A)
name : string(idl) = WOCE SR3
WOCE SR3 section : Dataset
name : string(idl) = salinityparameter : CFStandardName = sea_water_salinityunits : PhysicalUnits = psumissingValue : NumericscaleFactor : Numericoffset : Numeric
Salinity : Variable
rank : short(idl) = 1axisSize : short(idl) = 50hasData : boolean(idl) = notype : NumericTypeNameisomorphicChildren : boolean(idl) = no
Salinity Data : Array
name : string(idl) = longitudeaxisType : AxisType = xaxisUnits : AxisUnits = degrees_eastrank : short(idl) = 1axisSize : short(idl) = 50type : NumericTypeName = floatarrayDimension : short(idl) = 1
Cruise track longitude : MappedCoordinate
name : string(idl) = latitudeaxisType : AxisType = yaxisUnits : AxisUnits = degrees_northrank : short(idl) = 1axisSize : short(idl) = 50type : NumericTypeName = floatarrayDimension : short(idl) = 1
Cruise track latitude : MappedCoordinate
rank : short(idl) = 1axisSize : short(idl) = 20hasData : boolean(idl) = yestype : NumericTypeName = floatisomorphicChildren : boolean(idl)
Cast 1 : Array
rank : short(idl) = 1axisSize : short(idl) = 40hasData : boolean(idl) = yestype : NumericTypeName = floatisomorphicChildren : boolean(idl)
Cast 50 : Array
name : string(idl) = depthaxisType : AxisType = zaxisUnits : AxisUnits = maxisSize : short(idl) = 20type : NumericTypeName = floatarrayDimension : short(idl) = 1
Cast 1 depth : ProductCoordinate
ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
NDG Prototype
Layout not important (yet!)
It’s what’s under the hood that counts …
( … the data is NOT in NetCDF. The original data is available …
… the search covered data that could have been harvested …
… the architecture works!)
ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
NDG Metadata Status
• We have built a SIMPLE prototype based primarily on our data model and used our structures to find, locate, reformat and deliver data typical of BODC and BADC observational data. (This is a first)
• We are about to re-engineer.• Key issues to address will be
– Vocabularies, and
– Ontologies
– Developing a Model Attribute Language (with CGAM, PRISM, PCMDI and others).
• Populating our metadata; a boring and laborious job!
ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
Wider Internet
Research Group
Satellite
SuperComputer
Shared Resources
DB
Research Group
Research Group
Metadata Origins
Consider a hierarchy of data users beginning with an individual scientist, who may herself be part of a research group, itself part of a community sharing resources, lying in the wider internet …To be well integrated the metadata should have a role at each level!(The data portal client and server interface may be different at each level).At each level “extra” metadata will be required, probably produced by dedicated staff at the research group, or data centre.
ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
Requirements (2)
We need to think about our networks and our tools for moving and keeping track of data!
• We can’t rely on the “leave it at the supercomputer site”– How do we do joint analysis?
– How do we process the data at all?
• Malcolm Atkinson quoting Jim Gray pointed out that it takes:
~ o(minute) to grep or ftp a GB
~ o(2 days) to grep or ftp a TB
~ o(3 years) to grep or ftp a PB
• Requires – sophisticated “fire and forget” file transfer (that has to out perform
“sneaker net”).
– Disk and compute resources for processing.
ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
SuperJanet4
We need to address
• local firewall issues (not just at the Met Office)
• spur bandwidths. The limits are not in the backbones!
2 Mbit/s link
-80 minutes to transfer 500 MB cf 40 minutes with GridFTP, or less than 1 minute between DL and RAL (1 Gbit/s)
ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
ESG1 Results (Supercomputing, 2001)
Allcock et al. 2001
Dallas to Chicago:
ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
Starting with the LAS
Deployment for UK users within a few weeks (constraint is primarily access control)
ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
LAS – Simple Box fill Output
Work for us to do: Labelling is inadequate as yet ..
ERA40
ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
Cache management in LAS/CDAT
Calls cdms.open to open data file.CDAT
BADC/CDAT intercepts command and checks cache
BADC/CDAT
YES
Spectral file is converted
on-the-fly and placed in cache.
NO
Cache unlocked. New cdms.open command
sent to CDAT and cache file opened.
Cache also checks if enough room, deletes oldest files if necessary and checks against disk space limit.
Locks access to cache. Checks if
regular gridded file is in cache list.
localCache.py
18 TB virtual dataset
LAS
ERA-404 TB
Spectral Archive
ERA-40 < 1TBGrid Cache
Internet User
NetCDF file, plot or animations delivered
to user.
Data object delivered to LAS.
ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
Summary
• Earth System Modelling extends the data handling challenge.
• We need better information management• We need better tools for moving things around• We need better tools for using remote data• … and we need data manipulation hardware!
The NDG is attempting (with help) to address:• Information management• Data movement• Tools to manipulate large volumes of data.
ESM Meeting, Cambridge 2003 BADC: badc.nerc.ac.uk, NERC DataGrid: www.ndg.badc.rl.ac.uk
You’ve gone TOO FAR!