Climate Data Records and Science Data Stewardship: Playing for Keeps Bruce R. Barkstrom National...

14
Climate Data Records and Science Data Stewardship: Playing for Keeps Bruce R. Barkstrom National Climatic Data Center NOAA

Transcript of Climate Data Records and Science Data Stewardship: Playing for Keeps Bruce R. Barkstrom National...

Page 1: Climate Data Records and Science Data Stewardship: Playing for Keeps Bruce R. Barkstrom National Climatic Data Center NOAA.

Climate Data Records and Science Data Stewardship:

Playing for Keeps

Bruce R. Barkstrom

National Climatic Data CenterNOAA

Page 2: Climate Data Records and Science Data Stewardship: Playing for Keeps Bruce R. Barkstrom National Climatic Data Center NOAA.

Outline

• What are CDRs– An Example– General Characteristics

• What’s Involved in SDS– Assuring that the data and context are valuable to the

future – Making sure data are ready to preserve– Making sure data and context will be useful– Making sure data and context will survive– Being cost effective

Page 3: Climate Data Records and Science Data Stewardship: Playing for Keeps Bruce R. Barkstrom National Climatic Data Center NOAA.

An Example CDR – Solar Constant

• Original data cover several decades

• Multiple data sources• Work needed:

– Physical model of causes of differences

– Development of homogeneous data set versions

– Estimation of detectable variability and trends

Page 4: Climate Data Records and Science Data Stewardship: Playing for Keeps Bruce R. Barkstrom National Climatic Data Center NOAA.

CDR Characteristics

• Covers long time period (decades or more if possible)• Likely to have multiple data sources• Every attempt to deal with errors on a physical basis• Every attempt to make errors homogeneous over record

– Software must have full configuration management– Input data sources should be as homogeneous as possible

• Intent is to provide– Quantified variability: Cumulative Distribution Functions (CDFs)

of parameter variations, not only for global averages, but also regional values and extreme value statistics

– Quantification of Change Detection: Ability to test observed CDFs against expected CDFs of potential changes

Page 5: Climate Data Records and Science Data Stewardship: Playing for Keeps Bruce R. Barkstrom National Climatic Data Center NOAA.

How Do We Assess the Value of a CDR?

• 3 Approaches:– Cost of Acquiring CDR– Cost of Reconstruction – if possible

Need to have original data, need to assemble hardware and software, need to run (maybe 2 or 3 million jobs)

– Present Value of Future Use Economists discount future benefits at 7%

Page 6: Climate Data Records and Science Data Stewardship: Playing for Keeps Bruce R. Barkstrom National Climatic Data Center NOAA.

Valuation is Tough

• OMB Question: Why do we need more than $2B/year for climate?

• CCSP and CEOS both have had trouble prioritizing

• Probably two scales of value– Scientific “Value” – represented by “Bretherton

Issues”– Societal Benefit – represented by reduction in

damage, lives saved, new industries created• Quantifying to OMB’s satisfaction is difficult• Question 1: Can CI help with justifying priorities?

Page 7: Climate Data Records and Science Data Stewardship: Playing for Keeps Bruce R. Barkstrom National Climatic Data Center NOAA.

Good Archival Practice

• ISO Standard for “What an Archive Should Do for Long-Term Preservation”– OAIS Reference Model

• Recommendation:– Prepare a Submission Agreement between an

Archive and a Data Provider– Evaluate condition and completeness of candidate

data and metadata– Plan work required to repair deficiencies

• SDS Preferred Approach – use “Maturity Model”

Page 8: Climate Data Records and Science Data Stewardship: Playing for Keeps Bruce R. Barkstrom National Climatic Data Center NOAA.

Maturity Model

• Evaluate Maturity 3 ways:– Scientific Maturity– Preservation Maturity– Societal Benefit

• For Each Axis:– Reduce evaluation to non-dimensional scaling of

attributes– Ask for evaluation from experts

• Question 2: Can CI help with evaluation of maturity?

Page 9: Climate Data Records and Science Data Stewardship: Playing for Keeps Bruce R. Barkstrom National Climatic Data Center NOAA.

Work Required to Produce CDRs

• Evaluation of Available Record for Gaps and Understandability– Gaps– Documentation

• Evaluation of Candidate CDR Uncertainties– Error Sources Considered– Calibration and Validation

• Evaluation of Record Repair Work– Gaps– Recalibration– Uncertainty Estimation

Page 10: Climate Data Records and Science Data Stewardship: Playing for Keeps Bruce R. Barkstrom National Climatic Data Center NOAA.

Roles of Satellite Data and In-Situ Data

• In-situ Data Complements Satellite Data– Satellites for coverage – although challenge is

getting adequate length of record– In-situ for calibration and validation

• For Data Stewardship– Need preservation of context: cal-val data

preservation, source code, documentation of procedures, metadata

– Results of intercomparisons should have measurable improvement in uncertainty

Page 11: Climate Data Records and Science Data Stewardship: Playing for Keeps Bruce R. Barkstrom National Climatic Data Center NOAA.

Some Thoughts on Quantifying Impact of In-Situ Data

• Errors in satellite measurements– Estimates should be based on physical causes– Stewardship needs way of making publically available – and accomodating changes in assessments by

community over time– Statistical in nature– Delimited by time interval and spatial region– Most rigorously specified as CDF of error– Might be simply specified in terms of std dev of error about “average” measured value

• Cal-Val efforts should improve “error bars”– Stringency: ratio of error dispersion about mean after cal-val to dispersion before

• 1 – no improvement; 2 to 5 – moderate improvement; >10 – really stringent requirement on cal-val• Related to number of independent samples in cal-val set

– Plausibility: significance of improvement• Unsuspicious – p of difference 20%; Somewhat convincing – p ~ 5%; Fairly confident – p ~ 1%

• Number of iterations in reprocessing– Inversely proportional to experience

– Increases with required stringency and plausibility

• Question 3: Can CI help evaluate proposed In-Situ Validation Data Sets for Error Reductions, Stringency, and Plausibility?

Page 12: Climate Data Records and Science Data Stewardship: Playing for Keeps Bruce R. Barkstrom National Climatic Data Center NOAA.

The Odds for Long-Term Preservation

• Preservation inclines one toward pessimism– If p is annual probability of survival and– N is number of years to survive– Probability of survival is (1 – p)**N– To have 99% probability of survival for 200 years,

requires p = 5. E -05

• Standard approach to reducing risk– Assess mechanisms of loss– Quantify annual probability of loss and probable value

of loss [note return to valuation issue]– Find affordable risk mitigation approach

Page 13: Climate Data Records and Science Data Stewardship: Playing for Keeps Bruce R. Barkstrom National Climatic Data Center NOAA.

Science Data Stewardship:What are the Odds

• Important Risks– IT Security Incidents

• 10% per year probability; maybe 10% of collection at risk of corruption (p = 1%/yr – need dispersion acrosss systems)

– Operator Error• 10% per year probability; loss depends on time operators

work and degree of automation (p = 1%/yr – need QA)– Hardware or Software Error

• 5% per year probability; loss as in operator error– Hardware or Software Obsolescence

• 100% probability of loss in 5 to 10 years (p = 20%/yr)• Suggests treating expenses of hardware and software

replacement as “insurance expenses” – not assets

Page 14: Climate Data Records and Science Data Stewardship: Playing for Keeps Bruce R. Barkstrom National Climatic Data Center NOAA.

Science Data Stewardship:How Do We Improve the Odds?

• SDS will require several new things:– Making the history and details of data provenance public

(anything proprietary dies)– Capturing now-tacit knowledge before it disappears (knowledge

not captured dies when the knower retires, gets sick, or dies)– Creating methods of tracing the evolution of data, metadata, and

assessments of same• Expectation: SDS grants program provides avenue for

bringing in ideas that – improve information survivability– reduce cost of archival– make data and context more useful for those that come after

• If we don’t succeed, we’ve all been publishing in The Journal of Irreproducable Results