NCAR NCAR Data and Grid Efforts: The Earth System Grid & The Community Data Portal Don Middleton...

33
NCAR NCAR Data and Grid NCAR Data and Grid Efforts: Efforts: The Earth System Grid The Earth System Grid & The & The Community Community Data Portal Data Portal Don Middleton Don Middleton NCAR Scientific Computing Division NCAR Scientific Computing Division CAS2003 CAS2003 September 11, 2003 September 11, 2003

Transcript of NCAR NCAR Data and Grid Efforts: The Earth System Grid & The Community Data Portal Don Middleton...

Page 1: NCAR NCAR Data and Grid Efforts: The Earth System Grid & The Community Data Portal Don Middleton NCAR Scientific Computing Division CAS2003 September 11,

NCAR

NCAR Data and Grid Efforts:NCAR Data and Grid Efforts:The Earth System GridThe Earth System Grid

& The & The CommunityCommunity Data Portal Data Portal

Don MiddletonDon Middleton

NCAR Scientific Computing DivisionNCAR Scientific Computing Division

CAS2003CAS2003

September 11, 2003September 11, 2003

Page 2: NCAR NCAR Data and Grid Efforts: The Earth System Grid & The Community Data Portal Don Middleton NCAR Scientific Computing Division CAS2003 September 11,

NCAR

The Earth System GridThe Earth System Grid

U.S. DOE SciDAC funded R&D effort - a U.S. DOE SciDAC funded R&D effort - a ““Collaboratory Pilot Project”Collaboratory Pilot Project”

Build an “Earth System Grid” that enables Build an “Earth System Grid” that enables management, discovery, distributed access, management, discovery, distributed access, processing, & analysis of distributed terascale processing, & analysis of distributed terascale climate research dataclimate research data

Build upon Globus ToolkitBuild upon Globus Toolkit and DataGrid and DataGrid technologies and technologies and deploydeploy

Potential broad application to other areasPotential broad application to other areas

http://www.earthsystemgrid.org

Page 3: NCAR NCAR Data and Grid Efforts: The Earth System Grid & The Community Data Portal Don Middleton NCAR Scientific Computing Division CAS2003 September 11,

NCAR

ESG TeamESG Team ANLANL

– Ian Foster (PI)Ian Foster (PI)– Veronika NefedovaVeronika Nefedova– (John Bresenhan)(John Bresenhan)– (Bill Allcock)(Bill Allcock)

LBNLLBNL– Arie ShoshaniArie Shoshani– Alex SimAlex Sim

ORNLORNL– David BernholdteDavid Bernholdte– Kasidit ChanchioKasidit Chanchio– Line PouchardLine Pouchard

LLNL/PCMDILLNL/PCMDI– Bob DrachBob Drach– Dean Williams (PI)Dean Williams (PI)

USC/ISIUSC/ISI– Anne ChervenakAnne Chervenak– Carl KesselmanCarl Kesselman– (Laura Perlman)(Laura Perlman)

NCARNCAR– David BrownDavid Brown– Luca CinquiniLuca Cinquini– Peter FoxPeter Fox– Jose GarciaJose Garcia– Don Middleton (PI)Don Middleton (PI)– Gary StrandGary Strand

Page 4: NCAR NCAR Data and Grid Efforts: The Earth System Grid & The Community Data Portal Don Middleton NCAR Scientific Computing Division CAS2003 September 11,

NCAR

Page 5: NCAR NCAR Data and Grid Efforts: The Earth System Grid & The Community Data Portal Don Middleton NCAR Scientific Computing Division CAS2003 September 11,

NCAR

Baseline NumbersBaseline Numbers T42 CCSM (current, 280km)T42 CCSM (current, 280km)

– 7.5GB/yr, 100 years -> .75TB7.5GB/yr, 100 years -> .75TB T85 CCSM (140km)T85 CCSM (140km)

– 29GB/yr, 100 years -> 2.9TB29GB/yr, 100 years -> 2.9TB T170 CCSM (70km)T170 CCSM (70km)

– 110GB/yr, 100 years -> 11TB110GB/yr, 100 years -> 11TB

Page 6: NCAR NCAR Data and Grid Efforts: The Earth System Grid & The Community Data Portal Don Middleton NCAR Scientific Computing Division CAS2003 September 11,

NCAR

Capacity-related ImprovementsCapacity-related ImprovementsIncreased turnaround, model development, ensemble of runs

Increase by a factor of 10, linear data

Current T42 CCSMCurrent T42 CCSM– 7.5GB/yr, 100 years -> .75TB * 10 = 7.5GB/yr, 100 years -> .75TB * 10 =

7.5TB7.5TB

Page 7: NCAR NCAR Data and Grid Efforts: The Earth System Grid & The Community Data Portal Don Middleton NCAR Scientific Computing Division CAS2003 September 11,

NCAR

Capability-related Improvements Capability-related Improvements Spatial Resolution: T42 -> T85 -> T170

Increase by factor of ~ 10-20, linear data Temporal Resolution: Study diurnal cycle, 3 hour data

Increase by factor of ~ 4, linear data

CCM3 at T170 (70km)

Page 8: NCAR NCAR Data and Grid Efforts: The Earth System Grid & The Community Data Portal Don Middleton NCAR Scientific Computing Division CAS2003 September 11,

NCAR

Capability-related Improvements Capability-related Improvements

Quality: Improved boundary layer, clouds, convection, ocean physics, land model, river runoff, sea ice

Increase by another factor of 2-3, data flat

Scope: Atmospheric chemistry (sulfates, ozone…), biogeochemistry (carbon cycle, ecosystem dynamics),middle Atmosphere Model…

Increase by another factor of 10+, linear data

Page 9: NCAR NCAR Data and Grid Efforts: The Earth System Grid & The Community Data Portal Don Middleton NCAR Scientific Computing Division CAS2003 September 11,

NCAR

Model Improvement WishlistModel Improvement Wishlist

Grand Total:

Increase compute by a Factor O(1000-10000)

Page 10: NCAR NCAR Data and Grid Efforts: The Earth System Grid & The Community Data Portal Don Middleton NCAR Scientific Computing Division CAS2003 September 11,

NCAR

Longer-term MissionsLonger-term Missions - - Observation of Key Earth System InteractionsObservation of Key Earth System Interactions

Terra

Aura

Aqua

Landsat 7

Exploratory - Exploratory - Explore Specific Earth System Processes and Parameters and Explore Specific Earth System Processes and Parameters and Demonstrate TechnologiesDemonstrate Technologies

GRACE

PICASSO

Cloudsat

QuikScat

EO-1

ICEsat Jason-1

SRTMVCL

We Will Examine Practically Every Aspect of the Earth We Will Examine Practically Every Aspect of the Earth System from Space in This DecadeSystem from Space in This Decade

Triana

Courtesy of Tim Killeen, NCAR

Page 11: NCAR NCAR Data and Grid Efforts: The Earth System Grid & The Community Data Portal Don Middleton NCAR Scientific Computing Division CAS2003 September 11,

NCAR

ESG ScenarioESG Scenario End 2002: 1.2 million files comprising End 2002: 1.2 million files comprising

~75TB of data at NCAR, ORNL, LANL, ~75TB of data at NCAR, ORNL, LANL, NERSC, and PCMDINERSC, and PCMDI

End 2007: As much as 3 PB (3,000 TB) End 2007: As much as 3 PB (3,000 TB) of data (!)of data (!)

Current practice is already broken – the Current practice is already broken – the future will be even worse if something future will be even worse if something isn’t done…isn’t done…

Page 12: NCAR NCAR Data and Grid Efforts: The Earth System Grid & The Community Data Portal Don Middleton NCAR Scientific Computing Division CAS2003 September 11,

NCAR

ESG Scenario (cont.)ESG Scenario (cont.) DataData

– Different formats are converted to netCDFDifferent formats are converted to netCDF– netCDF is not standardized to the CF modelnetCDF is not standardized to the CF model– Different sites require knowledge of different methods of accessDifferent sites require knowledge of different methods of access

MetadataMetadata– Most kept in online files separate from data and unsearchable unless one is Most kept in online files separate from data and unsearchable unless one is

“in the know”“in the know”– Some kept in people’s brainsSome kept in people’s brains

Access controlAccess control– ManualManual– Not formalizedNot formalized

Data requestsData requests– Beginnings of a formal process (e.g., the PCMDI model)Beginnings of a formal process (e.g., the PCMDI model)– Beginnings of web portalsBeginnings of web portals– Far too much done by handFar too much done by hand– Logging nearly non-existentLogging nearly non-existent

Page 13: NCAR NCAR Data and Grid Efforts: The Earth System Grid & The Community Data Portal Don Middleton NCAR Scientific Computing Division CAS2003 September 11,

NCAR

ESG: ChallengesESG: Challenges Enabling the simulation and data Enabling the simulation and data

management teammanagement team Enabling the core research community Enabling the core research community

in analyzing and visualizing resultsin analyzing and visualizing results Enabling broad multidisciplinary Enabling broad multidisciplinary

communities to access simulation communities to access simulation resultsresultsWe need integrated scientific work environments that enable

smooth WORKFLOW for knowledge development: computation, collaboration & collaboratories, data management, access, distribution, analysis, and visualization.

Page 14: NCAR NCAR Data and Grid Efforts: The Earth System Grid & The Community Data Portal Don Middleton NCAR Scientific Computing Division CAS2003 September 11,

NCAR

ESG: StrategiesESG: Strategies Move data a minimal amount, keep it close to Move data a minimal amount, keep it close to

computational point of origin when possiblecomputational point of origin when possible– Data access protocols, distributed analysisData access protocols, distributed analysis

When we must move data, do it fast and with When we must move data, do it fast and with a minimum amount of human interventiona minimum amount of human intervention– Storage Resource Management, fast networksStorage Resource Management, fast networks

Keep track of what we have, particularly Keep track of what we have, particularly what’s on deep storagewhat’s on deep storage– Metadata and Replica CatalogsMetadata and Replica Catalogs

Harness a federation of sites, web portalsHarness a federation of sites, web portals– Globus Toolkit -> The Earth System Grid -> The Globus Toolkit -> The Earth System Grid -> The

UltraDataGridUltraDataGrid

Page 15: NCAR NCAR Data and Grid Efforts: The Earth System Grid & The Community Data Portal Don Middleton NCAR Scientific Computing Division CAS2003 September 11,

NCAR

Server

Tera/Peta-scaleArchive

HRM

Tools for reliable staging,

transport, and replication

Server

Tera/Peta-scaleArchive

HRM

ClientSelectionControl

MonitoringHRM

Storage/Data Management

Page 16: NCAR NCAR Data and Grid Efforts: The Earth System Grid & The Community Data Portal Don Middleton NCAR Scientific Computing Division CAS2003 September 11,

NCAR

HRM aka “DataMover”HRM aka “DataMover” Running well across DOE/HPSS systemsRunning well across DOE/HPSS systems New component built that abstracts New component built that abstracts

NCAR Mass Storage SystemNCAR Mass Storage System Defining next generation of Defining next generation of

requirements with climate production requirements with climate production groupgroup

First “real” usageFirst “real” usage“The bottom line is that it now works fines and is over 100 times faster than what I was doing before. As important as two orders of magnitude increase in throughput is, more importantly I can see a path that will essentially reduce my own time spent on file transfers to zero in the development of the climate model database” – Mike Wehner, LBNL

Page 17: NCAR NCAR Data and Grid Efforts: The Earth System Grid & The Community Data Portal Don Middleton NCAR Scientific Computing Division CAS2003 September 11,

NCAR

OPeNDAPOPeNDAP

An Open Source Project for a An Open Source Project for a Network Data Access ProtocolNetwork Data Access Protocol

(originally DODS, the Distributed (originally DODS, the Distributed Oceanographic Data System)Oceanographic Data System)

Page 18: NCAR NCAR Data and Grid Efforts: The Earth System Grid & The Community Data Portal Don Middleton NCAR Scientific Computing Division CAS2003 September 11,

NCAR

OPeNDAP-g-Transparency-Performance-Security-Authorization-(Processing)Typical Application

Data(local)

netCDF lib

Application

Data(remote)

OPeNDAP Client

Application

OPeNDAPViahttp

Big Data(remote)

ESG client

Application

ESG+

DODS

OpenDAP Server ESG Server

Distributed Application

data

Distributed Data Access Services

OPeNDAPViaGrid

Page 19: NCAR NCAR Data and Grid Efforts: The Earth System Grid & The Community Data Portal Don Middleton NCAR Scientific Computing Division CAS2003 September 11,

NCAR

For XML encoding of metadata (and data) of any generic netCDF For XML encoding of metadata (and data) of any generic netCDF filefile

Objects: netCDF, dimension, variable, attributeObjects: netCDF, dimension, variable, attribute Beta version reference implementation as Java Library Beta version reference implementation as Java Library

(http://www.scd.ucar.edu/vets/luca/netcdf/extract_metadata.htm)(http://www.scd.ucar.edu/vets/luca/netcdf/extract_metadata.htm)

ESG: NcML Core SchemaESG: NcML Core Schema

netCDFnetCDF

nc:netCDFType

nc:dimension

nc:variable

nc: attribute

nc:attribute

nc:values

nc:VariableType

Page 20: NCAR NCAR Data and Grid Efforts: The Earth System Grid & The Community Data Portal Don Middleton NCAR Scientific Computing Division CAS2003 September 11,

NCAR

Object[1] id

Object[1] id

Activity[0,1] name[0,1] description[0,1] rights[0,n] date type=[0,n] note[0,n] participant role=[0,n] reference uri=

Activity[0,1] name[0,1] description[0,1] rights[0,n] date type=[0,n] note[0,n] participant role=[0,n] reference uri=

isA

Investigation

Investigation

isA

Project[0,n] topic type=[0,1] funding

Project[0,n] topic type=[0,1] funding

isA Ensemble

Ensemble

Campaign

Campaign

isPartOf

Simulation[0,n] simulationInput type=[0,n] simulationHardware

Simulation[0,n] simulationInput type=[0,n] simulationHardware

Observation

Observation

Experiment

Experiment

Analysis

Analysis

isPartOf

hasParent

hasChild

hasSibling

Dataset[0,1] type[0,1] conventions[0,n] date type=[0,n] format type= uri=[0,1] timeCoverage[0,1] spaceCoverage

Dataset[0,1] type[0,1] conventions[0,n] date type=[0,n] format type= uri=[0,1] timeCoverage[0,1] spaceCoverage

isA

generatedBy

isPartOf

Person[0,1] firstName[0,1] lastName[0,1] contact

Person[0,1] firstName[0,1] lastName[0,1] contact

Institution[0,1] name[0,1] type[0,1] contact

Institution[0,1] name[0,1] type[0,1] contact

isAworksF

or

participant role=

Class

Class

AbstractClass

AbstractClass

inheritanceassociation

LEGEND

Service[0,1] name[0,1] description

Service[0,1] name[0,1] description

serviceId

Page 21: NCAR NCAR Data and Grid Efforts: The Earth System Grid & The Community Data Portal Don Middleton NCAR Scientific Computing Division CAS2003 September 11,

NCAR

ESG Metadata ProgressESG Metadata Progress Co-developed NcML with UnidataCo-developed NcML with Unidata

– CF conventions in progress, almost doneCF conventions in progress, almost done Developed & evaluated a prototype metadata systemDeveloped & evaluated a prototype metadata system Finalized an initial schema for PCM/CCSMFinalized an initial schema for PCM/CCSM

– Address interoperability with federal standards and Address interoperability with federal standards and NASA/GCMD via the generation of DIF/FGDC/ISONASA/GCMD via the generation of DIF/FGDC/ISO

– Address interoperability with digital libraries via the Address interoperability with digital libraries via the creation of Dublin Corecreation of Dublin Core

Testing relational and native XML databases, and OGSA-Testing relational and native XML databases, and OGSA-DAIDAI

Exploratory work for first-generation ontologyExploratory work for first-generation ontology Authoring of discovery metadata in progressAuthoring of discovery metadata in progress

Page 22: NCAR NCAR Data and Grid Efforts: The Earth System Grid & The Community Data Portal Don Middleton NCAR Scientific Computing Division CAS2003 September 11,

NCAR

ESG Web PortalESG Web PortalDemonstrationDemonstration

Page 23: NCAR NCAR Data and Grid Efforts: The Earth System Grid & The Community Data Portal Don Middleton NCAR Scientific Computing Division CAS2003 September 11,

NCAR

RLS

MSS

HRM

HPSSHRM

RLS

HPSSHRM

RLS

DISKHRM

RLS

DISKcache

OGSA-DAIMySQLRDBMS

ESG WEB PORTALTomcat/Struts

cross-updatecross-update

gridFTP

gridFTP

gridFTP

query

query MyProxy

authenticate

GRAMGATEKEEPER

submit

execute

gridFTP SERVER

gridFTP SERVER

gridFTP SERVER

gridFTP SERVER

LAS SERVERvisualize

LBNL

ISI

LLNL

NCAR ORNL

CAS

ANLESG Topology (CAS 2003)

Page 24: NCAR NCAR Data and Grid Efforts: The Earth System Grid & The Community Data Portal Don Middleton NCAR Scientific Computing Division CAS2003 September 11,

NCAR

Collaborations & RelationshipsCollaborations & Relationships CCSM Data Management GroupCCSM Data Management Group The Globus ProjectThe Globus Project Other SciDAC Projects: Climate, Security & Policy for Other SciDAC Projects: Climate, Security & Policy for

Group Collaboration, Scientific Data Management Group Collaboration, Scientific Data Management ISIC, & High-performance DataGrid ToolkitISIC, & High-performance DataGrid Toolkit

OPeNDAP/DODS (multi-agency)OPeNDAP/DODS (multi-agency) NSF National Science Digital Libraries Program NSF National Science Digital Libraries Program

(UCAR & Unidata THREDDS Project)(UCAR & Unidata THREDDS Project) U.K. e-Science and British Atmospheric Data CenterU.K. e-Science and British Atmospheric Data Center NOAA NOMADS and CEOS-gridNOAA NOMADS and CEOS-grid Earth Science Portal group (multi-agency, intnl.)Earth Science Portal group (multi-agency, intnl.)

Page 25: NCAR NCAR Data and Grid Efforts: The Earth System Grid & The Community Data Portal Don Middleton NCAR Scientific Computing Division CAS2003 September 11,

NCAR

Immediate DirectionsImmediate Directions Broaden usage of DataMover and refineBroaden usage of DataMover and refine Continue building metadata catalogsContinue building metadata catalogs Revisit overall security model and consider Revisit overall security model and consider

simplified approachessimplified approaches Redesign and implement user interfaceRedesign and implement user interface Alpha version of OPeNDAPgAlpha version of OPeNDAPg

– Test and evaluate with three client applications Test and evaluate with three client applications (ncview, CDAT, & NCL)(ncview, CDAT, & NCL)

Develop automation for data publishing (GT3)Develop automation for data publishing (GT3) Deploy for IPCC runsDeploy for IPCC runs

Page 26: NCAR NCAR Data and Grid Efforts: The Earth System Grid & The Community Data Portal Don Middleton NCAR Scientific Computing Division CAS2003 September 11,

NCAR

The Community Data Portal (CDP)The Community Data Portal (CDP)

Provide a common portal to NCAR, UCAR, and university dataProvide a common portal to NCAR, UCAR, and university data Provide cyberinfrastructure that dramatically lowers the cost of Provide cyberinfrastructure that dramatically lowers the cost of

sharing data (there is large interest in this)sharing data (there is large interest in this) Directly couple to simulation and data analysis systemsDirectly couple to simulation and data analysis systems Begin capturing rich metadata and catalog our scientific Begin capturing rich metadata and catalog our scientific

experiments for the worldexperiments for the world MSS -> A petascale Mass Knowledge SystemMSS -> A petascale Mass Knowledge System Federate internationally (ESG, THREDDS, U.K. e-Science, Federate internationally (ESG, THREDDS, U.K. e-Science,

NOMADS, PRISM, GEON, etc.)NOMADS, PRISM, GEON, etc.)

“The dataportal has changed my life…” Ben Kirtman, COLA

Page 27: NCAR NCAR Data and Grid Efforts: The Earth System Grid & The Community Data Portal Don Middleton NCAR Scientific Computing Division CAS2003 September 11,

NCAR

A Quick Tour of the CDPA Quick Tour of the CDP

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.‘ing Our Data

Page 28: NCAR NCAR Data and Grid Efforts: The Earth System Grid & The Community Data Portal Don Middleton NCAR Scientific Computing Division CAS2003 September 11,

Community Data Portal Metadata Software

THREDDScatalogs

ESGmetadata

DCmetadata

NcMLmetadata

THREDDS catalog parserapplication

relational DB(MySQL)

XML native DB(Xindice

XML viewerweb application

schema-specific

stylesheets

stores full XML doc

shreds XML doc into tables

Search & Discoveryweb application

simple query(SQL)

Results: list of triplets(dataset id, metadata schema,

metadata URL)THREDDS catalogs browser

Web application

reference

othermetadata

parses

futureadvanced query(Xpath, Xquery)

displays

links to

uses

Page 29: NCAR NCAR Data and Grid Efforts: The Earth System Grid & The Community Data Portal Don Middleton NCAR Scientific Computing Division CAS2003 September 11,

NCAR

Data->KnowledgeData->Knowledge

Mass StorageSystem (1.3PB) Petascale Knowledge

Repository

Establish new paradigms for managing and accessingscientific data based on semantic organization.

Page 30: NCAR NCAR Data and Grid Efforts: The Earth System Grid & The Community Data Portal Don Middleton NCAR Scientific Computing Division CAS2003 September 11,

NCAR

Closing ThoughtsClosing Thoughts Building an environment for the long-Building an environment for the long-

termterm– Difficult, expensive, and time-consumingDifficult, expensive, and time-consuming– Requires longer-term projectsRequires longer-term projects

Team-building is a critical processTeam-building is a critical process– Collaboration technologies really helpCollaboration technologies really help

Managing all the collaborations is a Managing all the collaborations is a challengechallenge– But extremely valuableBut extremely valuable

Good progress, first real usageGood progress, first real usage

Page 31: NCAR NCAR Data and Grid Efforts: The Earth System Grid & The Community Data Portal Don Middleton NCAR Scientific Computing Division CAS2003 September 11,

NCAR

Managing Expectations…Managing Expectations…

Page 32: NCAR NCAR Data and Grid Efforts: The Earth System Grid & The Community Data Portal Don Middleton NCAR Scientific Computing Division CAS2003 September 11,

NCAR

LinksLinks Earth System GridEarth System Grid

– www.earthsystemgrid.orgwww.earthsystemgrid.org Community Data PortalCommunity Data Portal

– dataportal.ucar.edudataportal.ucar.edu

Page 33: NCAR NCAR Data and Grid Efforts: The Earth System Grid & The Community Data Portal Don Middleton NCAR Scientific Computing Division CAS2003 September 11,

NCAR

ENDEND