Repositories collect lots of technical metadata, but lack tools to use it to better understand the...

1
Repositories collect lots of technical metadata, but lack tools to use it to better understand the objects in their care, and to apply it precisely in management and operations. Recent years have seen the development of a number of metadata tools designed to identify and characterize digital objects in preservation workflows: JHOVE, Metadata Extraction Tool, DROID, and most recently FITS. Each is available to the digital preservation community, and the community makes good use of them. Repositories dutifully collect and store the technical and structural metadata exposed and output by these tools, yet typically repositories have limited or no means to analyze and evaluate object characterization data in ways that would facilitate more effective management of the objects under their care. According to a recent Planets survey report, less than a third of organizations reported that they have “complete control over the formats that they will accept and enter into their archives.” 1 Repository managers have concerns about risks associated with file formats, and file format obsolescence generally. To support preservation services for content considered to encoded in “risky” formats, some repositories are developing policies and profiles that reflect their local concerns and operational contexts. 2 They seek tools to assess the technical metadata gathered and stored in routine repository operations against those policies in order to make sense of it on local terms and inform a decision-making process, such as: accept / reject determine level of risk assign level of service take action now / later. Recent efforts to develop and apply assessment methodologies in digital object workflows and repository operations include: AONS II (Automated Obsolescence Notification System), National Library of Australia and APSR 3 CIV (Configurable Image Validator), Assessment Rules Implementation In JHOVE2 – a next-generation characterization tool currently in development at California Digital Library, Portico, and Stanford University – the team has designed an approach to facilitate policy-based assessment as an object is processed. The tool produces characterization data through a series of identification, feature extraction, validation, and assessment processes. It tells you what you have, as the starting point for iterative preservation planning and action. It is possible to assess the properties of an object against a set of “rules” configured by the user. Using logical expressions as its terms, a rule is an “assertion” about prior characterization properties. An assertion may be concerned with: The presence or absence of a property; Constraints on property values; Combinations of properties or values. In assessment, the evaluation of the assertion results in a new characterization property. In this sense, the process generates custom metadata that has significance in the context in which the object is being managed. The basic formation of a rule is shown below. The user configures the property and value to test, and selects an evaluatory phrase relating the two to form the complete assertion. Technical implementation of assessment within the JHOVE2 framework is in design; prototyping will begin soon. A leading requirement is that rule configuration is simple. Non-technical staff and technical staff alike must be able to easily configure rules and “run an assessment”. The JHOVE2 release in 2010 will include a small selection of sample rules and a thorough tutorial. Assessment with JHOVE2 has natural applications in ingest, migration, publishing and digitization workflows. A clean, well-defined, open API will be available in order to extend it to build tools capable of more complex analyses, such as a weighted scoring system or matching of technology profiles. JHOVE2 can be integrated with other identification tools as well as format and software registries to form robust policy engines and other rules-based systems. Such systems have great potential in supporting and enabling digital preservation activities and services both at the local level and across the community. Introduction Examples About the JHOVE2 Project JHOVE2 is a collaboration by the California Digital Library, Portico, and Stanford University Libraries with funding from the Library of Congress’ National Digital Infrastructure and Information Preservation Program. The two year project will conclude, and the open source tool will be released, in September 2010. * The JHOVE2 Team is … CDL: Stephen Abrams, Patricia Cruse, John Kunze, Marisa Strong, Perry Willett Portico: John Meyer, Sheila Morrissey, Evan Owens Stanford: Richard Anderson, Tom Cramer, Hannah Frost with Walter Henry, Nancy Hoebelheinrich, Keith Johnson, Justin Littman http://confluence.ucop.edu/display/JH OVE2Info/ <property> Is Equal To Is Not Equal To Is Greater Than Is Less Than Contains Does Not Contain value Response If True Response If False This response constitutes the customizable metadata that is available for subsequent processing or analysis. Rules can be executed as atoms, or chained together to form compound statements for more complex assessments. (1)TIFF with nonaligned byte offset Rule Configuration Output Assessment: Assertion: Message [Information], Contains, Non-wordAlignedOffset Result: True Response If True: Acceptable Assertion Message [Information], Contains, Non-wordAlignedOffset Response If True Acceptable Response If False Acceptable Assessing Digital Objects with Hannah Frost AND TEAM* Stanford University Libraries and Academic Information Resources In addition, the user provides two responses for each rule: one to report if the assertion is true, and one to report if it is false. (2) PDF with malformed dictionary Rule Configuration Output Assessment: Assertion: Message [Information], Contains, Malformed dictionary Result: True Response If True: At Risk Assertion Message [Error], Contains, Malformed dictionary Response If True At Risk Response If False No Risk (3) WAVE does not meet encoding specification Rule Configuration Output Assessment: Assertion1: isValid, isEqualTo, True Assertion2: BitDepth, isEqualTo, 24 Assertion3: SamplingFrequency, isEqualTo, 96000 Result: False Response If False: Reject Assertion1 isValid, isEqualTo, True Assertion2 BitDepth, isEqualTo, 24 Assertion3 SamplingFrequency, isEqualTo, 96000 Response If True Accept Response If False Reject Bibliography 1. Planets (2009). Survey Analysis Report, IST-2006-033789, DT11-D1. http://www.planets-project.eu/market-survey/reports/docs/Planets_DT11- D1_SurveyReport.pdf 2. Rog, J. and van Wijk, C. (2008). Evaluating File Formats for Long-term Preservation. National Library of the Netherlands; The Hague, The Netherlands. http://www.kb.nl/hrd/dd/dd_links_en_publicaties/publicaties/KB_file_format _evaluation_method_27022008.pdf. 3. Pearson, D. and Webb, C. (2008). Defining File Format Obsolescence: A Risky Journey. International Journal of Digital Curation. Vol 1: No 3. http://www.ijdc.net/index.php/ijdc/article/view/76 4. De Vorsey, K. and McKinney, P. (2009). One Man’s Obsoleteness is Another Man’s Innovation: A Risk Analysis Methodology for Digital Collections. Presented at Archiving 2009, Arlington, Virginia, May 2009.

Transcript of Repositories collect lots of technical metadata, but lack tools to use it to better understand the...

Page 1: Repositories collect lots of technical metadata, but lack tools to use it to better understand the objects in their care, and to apply it precisely in.

Repositories collect lots of technical metadata, but lack tools to use it to better understand the objects in their care, and to apply it precisely in management and operations.

Recent years have seen the development of a number of metadata tools designed to identify and characterize digital objects in preservation workflows: JHOVE, Metadata Extraction Tool, DROID, and most recently FITS. Each is available to the digital preservation community, and the community makes good use of them.

Repositories dutifully collect and store the technical and structural metadata exposed and output by these tools, yet typically repositories have limited or no means to analyze and evaluate object characterization data in ways that would facilitate more effective management of the objects under their care.

According to a recent Planets survey report, less than a third of organizations reported that they have “complete control over the formats that they will accept and enter into their archives.”1 Repository managers have concerns about risks associated with file formats, and file format obsolescence generally. To support preservation services for content considered to encoded in “risky” formats, some repositories are developing policies and profiles that reflect their local concerns and operational contexts.2 They seek tools to assess the technical metadata gathered and stored in routine repository operations against those policies in order to make sense of it on local terms and inform a decision-making process, such as:

accept / reject determine level of risk assign level of service take action now / later.

Recent efforts to develop and apply assessment methodologies in digital object workflows and repository operations include: 

AONS II (Automated Obsolescence Notification System), National Library of Australia and APSR3 CIV (Configurable Image Validator), Library of Congress Institutional Technology Profiles, National Library of New Zealand4.

Assessment Rules ImplementationIn JHOVE2 – a next-generation characterization tool currently in development at California Digital Library, Portico, and Stanford University – the team has designed an approach to facilitate policy-based assessment as an object is processed. The tool produces characterization data through a series of identification, feature extraction, validation, and assessment processes. It tells you what you have, as the starting point for iterative preservation planning and action.

It is possible to assess the properties of an object against a set of “rules” configured by the user. Using logical expressions as its terms, a rule is an “assertion” about prior characterization properties. An assertion may be concerned with:

The presence or absence of a property; Constraints on property values; Combinations of properties or values.

In assessment, the evaluation of the assertion results in a new characterization property. In this sense, the process generates custom metadata that has significance in the context in which the object is being managed.

The basic formation of a rule is shown below. The user configures the property and value to test, and selects an evaluatory phrase relating the two to form the complete assertion.

Technical implementation of assessment within the JHOVE2 framework is in design; prototyping will begin soon. A leading requirement is that rule configuration is simple. Non-technical staff and technical staff alike must be able to easily configure rules and “run an assessment”. The JHOVE2 release in 2010 will include a small selection of sample rules and a thorough tutorial.

Assessment with JHOVE2 has natural applications in ingest, migration, publishing and digitization workflows. A clean, well-defined, open API will be available in order to extend it to build tools capable of more complex analyses, such as a weighted scoring system or matching of technology profiles.

JHOVE2 can be integrated with other identification tools as well as format and software registries to form robust policy engines and other rules-based systems. Such systems have great potential in supporting and enabling digital preservation activities and services both at the local level and across the community.

Introduction Examples

About the JHOVE2 Project JHOVE2 is a collaboration by the California Digital Library, Portico, and Stanford University Libraries with funding from the Library of Congress’ National Digital Infrastructure and Information Preservation Program. The two year project will conclude, and the open source tool will be released, in September 2010.

* The JHOVE2 Team is …

CDL: Stephen Abrams, Patricia Cruse, John Kunze, Marisa Strong, Perry Willett

Portico: John Meyer, Sheila Morrissey, Evan Owens

Stanford: Richard Anderson, Tom Cramer, Hannah Frost

with Walter Henry, Nancy Hoebelheinrich, Keith Johnson, Justin Littman http://confluence.ucop.edu/display/JHOVE2Info/

About the JHOVE2 Project JHOVE2 is a collaboration by the California Digital Library, Portico, and Stanford University Libraries with funding from the Library of Congress’ National Digital Infrastructure and Information Preservation Program. The two year project will conclude, and the open source tool will be released, in September 2010.

* The JHOVE2 Team is …

CDL: Stephen Abrams, Patricia Cruse, John Kunze, Marisa Strong, Perry Willett

Portico: John Meyer, Sheila Morrissey, Evan Owens

Stanford: Richard Anderson, Tom Cramer, Hannah Frost

with Walter Henry, Nancy Hoebelheinrich, Keith Johnson, Justin Littman http://confluence.ucop.edu/display/JHOVE2Info/

<property>

Is Equal ToIs Not Equal ToIs Greater ThanIs Less ThanContainsDoes Not Contain

value

Response If TrueResponse If False

This response constitutes the customizable metadata that is available for subsequent processing or analysis.

Rules can be executed as atoms, or chained together to form compound statements for more complex assessments.

(1) TIFF with nonaligned byte offset

Rule Configuration

Output Assessment: Assertion: Message [Information], Contains, Non-wordAlignedOffset Result: True Response If True: Acceptable

Assertion Message [Information], Contains, Non-wordAlignedOffset

Response If True Acceptable

Response If False Acceptable

Assessing Digital Objects with Hannah Frost AND TEAM*

Stanford University Libraries and Academic Information Resources

In addition, the user provides two responses for each rule: one to report if the assertion is true, and one to report if it is false.

(2) PDF with malformed dictionary

Rule Configuration

Output Assessment: Assertion: Message [Information], Contains, Malformed dictionary Result: True Response If True: At Risk

Assertion Message [Error], Contains, Malformed dictionary

Response If True At Risk

Response If False No Risk

(3) WAVE does not meet encoding specification

Rule Configuration

Output Assessment: Assertion1: isValid, isEqualTo, True Assertion2: BitDepth, isEqualTo, 24 Assertion3: SamplingFrequency, isEqualTo, 96000 Result: False Response If False: Reject

Assertion1 isValid, isEqualTo, True

Assertion2 BitDepth, isEqualTo, 24

Assertion3 SamplingFrequency, isEqualTo, 96000

Response If True Accept

Response If False Reject

Bibliography1. Planets (2009). Survey Analysis Report, IST-2006-033789, DT11-D1. http://www.planets-project.eu/market-survey/reports/docs/Planets_DT11-D1_SurveyReport.pdf

2. Rog, J. and van Wijk, C. (2008). Evaluating File Formats for Long-term Preservation. National Library of the Netherlands; The Hague, The Netherlands. http://www.kb.nl/hrd/dd/dd_links_en_publicaties/publicaties/KB_file_format_evaluation_method_27022008.pdf.

3. Pearson, D. and Webb, C. (2008). Defining File Format Obsolescence: A Risky Journey. International Journal of Digital Curation. Vol 1: No 3. http://www.ijdc.net/index.php/ijdc/article/view/76

4. De Vorsey, K. and McKinney, P. (2009). One Man’s Obsoleteness is Another Man’s Innovation: A Risk Analysis Methodology for Digital Collections. Presented at Archiving 2009, Arlington, Virginia, May 2009.

Bibliography1. Planets (2009). Survey Analysis Report, IST-2006-033789, DT11-D1. http://www.planets-project.eu/market-survey/reports/docs/Planets_DT11-D1_SurveyReport.pdf

2. Rog, J. and van Wijk, C. (2008). Evaluating File Formats for Long-term Preservation. National Library of the Netherlands; The Hague, The Netherlands. http://www.kb.nl/hrd/dd/dd_links_en_publicaties/publicaties/KB_file_format_evaluation_method_27022008.pdf.

3. Pearson, D. and Webb, C. (2008). Defining File Format Obsolescence: A Risky Journey. International Journal of Digital Curation. Vol 1: No 3. http://www.ijdc.net/index.php/ijdc/article/view/76

4. De Vorsey, K. and McKinney, P. (2009). One Man’s Obsoleteness is Another Man’s Innovation: A Risk Analysis Methodology for Digital Collections. Presented at Archiving 2009, Arlington, Virginia, May 2009.