The Open Archives Initiative Protocol for Metadata Harvesting
Working with metadata in digital archives
-
Upload
jacqueline-frederick -
Category
Documents
-
view
12 -
download
0
description
Transcript of Working with metadata in digital archives
Working with metadata in digital archives
Erpanet Metadata in Digital PreservationMarburg, 3-5 September 2003
Bill [email protected]
Tessella Support Services plc3 Vineyard ChambersAbingdon OX14 3PXUnited Kingdom
www.tessella.com
Metadata functions
Collect
Store
Import Search
Export
View
Edit
Collect metadata (1)
Some must be manual – assist user, prevent mistakes
Avoid duplication – record hierarchiesautomation in user environment
(business process, workflow etc.) automatic analysis of file properties processing history (virus checking
results etc.)
Collect metadata (2)
UK National Archives Digital Archive – Stellent “OutsideIn”
analyses file to determine type could also form part of approach
to extract metadata from content
Collect metadata (3)
Pfizer Central Electronic ArchiveSmall metadata setAutomatic collection of metadata
Software agents on user serversPossible to do moreImprove ease of useImprove accuracy
Pfizer aiming to simplify provenance metadata
Import metadata (1)
Transfer format – XML link metadata to files during
transfer virus checking, file format
analysis etc.Maintain loose coupling between
components of system – agreed interfaces
Import metadata (2)
Efficiency – large transfers XML can be expensive to process speed memory – DOM can be 20 times
larger than XML file
Storage - requirements
don’t lose it! maintain links between
metadata, records and files find what you are looking for retrieve
Storage approaches
encapsulation vs. ease of access volume of data speed of searching vs. speed of
import/export typically metadata in database
and files on file server
The National Archives (UK) Digital Archive approach
Relational database for metadata, file server for computer files
Metadata stored as XML documents in database
A few key elements stored in tables and indexed (unique identifier, PROCAT reference)
Links between records, files, accessions, metadata managed in database
Subset of metadata identified as searchable – values extracted into text based index
File contents not currently searchable
UK Digital Archive (2)
record and file metadata kept separately flexible relationship between records and
computer files Unlimited depth of record hierarchy (records
can contain sub-records) metadata imported/exported as XML so
easier/quicker to store as XML designed for ease of extension to metadata
(disadvantage of extracting metadata into database tables)
<GSMElement name=“Title”> rather than <Title>
Alternatives
VERS approach: metadata and content files encapsulated together within XML file
+ve: record is self-contained +ve: well-suited to use of digital signatures
on both metadata and content -ve: more denormalisation required for
access -ve: complexity of adding to or editing
metadata -ve: if file is needed for more than one
record, must be duplicated
Interoperability
Not much experience in practice so farXML helps - but not much!Likely to be similar but not identical
schemasDifferent implementations of same
schemaShort term: ad hoc mapping between
schemas for specific systemsLonger term: various initiatives, but
standardisation and semantics-based approaches are difficult
Extending or changing the schema
Schema may (will!) change in future
No “one size fits all” approachTNA plans for extensions to core
metadata according to file type and according to function
Version control
Preservation metadata
Maintain ability to understand and authentically reproduce content files
PRONOM system – separate database for file formats/accessibility
KB preservation layer model approach
Technology watch
Authentication/Integrity
Digital signatures – has something changed? (also simpler hashing algorithms)
Digital signatures – who signed it?Control accessAudit logs
Conclusions
Digital preservation is still a young discipline, so “best” approach not always clear
Do something! Learn from experience
Design for flexibility/replaceability – records must outlive any implementation