SCD Research Data For UCAR Data Management Working Group January 10, 2001 Steven Worley Scientific...

27
SCD Research Data For UCAR Data Management Working Group January 10, 2001 Steven Worley Scientific Computing Division Data Support Section

Transcript of SCD Research Data For UCAR Data Management Working Group January 10, 2001 Steven Worley Scientific...

  • Slide 1
  • SCD Research Data For UCAR Data Management Working Group January 10, 2001 Steven Worley Scientific Computing Division Data Support Section
  • Slide 2
  • Four Categories of Data Service User Profile Data Content Data Access
  • Slide 3
  • Four Categories of Data Service Archives directly from the MSS Accessible to all with NCAR computing accounts Web accessible online data server Information interface for all data Individual requests Customized on per request basis Data preparation for large projects E.g. Reanalyses at ECMWF and NCEP
  • Slide 4
  • User Profile, MSS User Groups
  • Slide 5
  • User profile, online data server Users based on network address domain, data for 1995-1998 ~ 20K unique addresses per year Domain% of total.com24.edu18.gov4 International17.mil1.net16 No domain (IP)24.org1
  • Slide 6
  • User profile, individual request Requests excluding CD-ROMS Based on 1998-1999 data 28% U.S. Univ. (179 of 638) 11% Foreign Univ. (69) 27% Foreign Non-Univ. (171) 34% U.S. Gov. and Commercial (219) (remarkably, some foreign and government sources find it desirable to acquire their own data from SCD/DSS)
  • Slide 7
  • User profile, all users All users by year, excluding online category
  • Slide 8
  • User profile, finding the data Peer and colleague recommendations Acknowledgements in publications WWW searches and perusing
  • Slide 9
  • Quick look at DSS Information Interface Website, dss.ucar.edudss.ucar.edu Top level information and dataset groupings Oceanographic datasets by CategoryOceanographic datasets by Category
  • Slide 10
  • Important improvements for the Information Interface More top level documents to guide users to the best datasets For improved searches Carefully worded.html.. Pages with introductory text that clearly defines the dataset with keywords that promote discovery. .html.., note, not all search engines boost ranking based on these.
  • Slide 11
  • User profile, compliments Fast service, requests receive prompt action. Staff with scientific knowledge to offer assistance and guidance. Flexible system can adapt to meet users requirements.
  • Slide 12
  • What makes this system work The data records and files remain in simple structures This way the archive should always be accessible to programs written with low level languages The data can survive evolutions in OS systems and software, 50-years is not too much. Programs can be written that allow fast and efficient manipulation of large collections. Internal checksum keys can be strategically placed to insure data integrity at any level.
  • Slide 13
  • User profile, complaints All the data is not online even though this quite impractical 12+ TB All the data is not in their favorite format, IDL, HDF, netCDF, GrIB, ASCII, GIS, Binary,.xls, Matlab, etc. Can I just get the piece I need? Do you mean I need to know some FORTRAN or C Language?
  • Slide 14
  • User Profile, skill set Best skill set for our users includes knowing some FORTRAN and/or C. Trend; more and more people are requesting data in application environment specific formats Will the next generation scientist know a basic computing language?
  • Slide 15
  • Data Content, size and characteristic Veritable smorgasbord of data. Overall size, 12+ TB 500+ distinct datasets Many historical observations from the atmosphere, and ocean Many operational analyses and reanalyses Dataset sizes, < 1 MB to several TB Many original formats. GrIB is dominate in our analyses and reanalyses datasets
  • Slide 16
  • Data Content, metadata management Primarily, metadata is managed on our online information server. Each dataset has a WWW page. All dataset WWW pages are automatically formed. Corrections, addition, and changes are made to text files manipulated under a Unix change and control system. Advantage: history of all changes and data files associated with the dataset, and the WWW pages are always current.
  • Slide 17
  • Data Content, metadata management Have considerable amounts of hard copy references and metadata. - We are making scanned images of these now.
  • Slide 18
  • Data Content, long term archive and security Small datasets and irreplaceable observations and analyses have two copies on the MSS Although we cannot guarantee they reside on separate cartridges Files are write password protected prevents accidental overwrites. We have been fortunate to have a very reliable MSS and our success will continue to rely on it in the future.
  • Slide 19
  • Data Content, long term archive and security Areas of concern We dont have adequate offsite backups At least critical observations should be protected from catastrophe at the Mesa Lab In the event of loss of single copy large datasets we rely on other centers for replacement This needs to be discussed more nationally Redistribution may have restrictions or be costly
  • Slide 20
  • Data Content, long term archive and security Areas of concern, continued Must always remain on guard so important data are not lost due to short sighted policy decisions. Must participate in national and international projects so that the archive content is continually refreshed with the most scientifically important data, at low cost.
  • Slide 21
  • Data Access, annual summary
  • Slide 22
  • Data Access, aids to access Maintain FORTRAN code to read all data files Sometimes for many platforms (Unix, PC) The MSS file location is defined for all datasets, and is available online. Staff specialist are assigned and identified for each dataset
  • Slide 23
  • Data Access, most frequent NCEP/NCAR Global Atmospheric Reanalysis, 2.6 TB How? MSS WWW (monthly means) CDROMS FTP Various Tape Media (large capacity)
  • Slide 24
  • Data Access, largest barrier Discovering what is available Gaining access to the MSS collection (when they dont have a computing account) Not having experience with low level languages, e.g. FORTRAN and C/C++
  • Slide 25
  • Data Access, product development Yes we do, and we feel it is very important! Why? Can QC the data and identify problems early Can reorganize into logical collection, or create popular subsets. Reduce the volume of large collections to manageable size for users Saves many users extra work
  • Slide 26
  • Data Access, improvements for scientific advancement Minimize the barriers that inhibit discovery metadata problem. Supply the data in the users favorite format or provide tools that can convert the data where it is practical and efficient. Place more data, and valuable higher level data products on line
  • Slide 27
  • END