Improving Metadata Quality: Strategies and Services
description
Transcript of Improving Metadata Quality: Strategies and Services
Improving Metadata Quality: Strategies and Services
Diane I. Hillmann
Naomi Dushay
Jon Phipps
National Science Digital Library
Introduction
• Useful services depend on good metadata, but most metadata is not very good
• Human created metadata is expensive
• Automated crawling strategies are limited by:
– Accessibility barriers (rights issues, technical issues)
– Variable results with crawling technologies for non-text
• Best metadata does not rely solely on information contained within the resource itself– Ex.: Controlled vocabularies, descriptions, links
The NSDL Environment
• Functions to some extent as a metadata aggregator– Simple, two-level hierarchy (Collections & items)
– Based on OAI-PMH harvest model
– Each harvested item associated with a collection
• Collection records managed via internal system that also drives automated harvest/ingest processes– Harvested records split into elements for storage and
reassembled for output
Why Transform Metadata at All?
• Four categories of problems limit metadata usefulness:– Missing data: elements not present– Incorrect data: values not conforming to proper
usage– Confusing data: embedded html tags, improper
separation of multiple elements, etc.– Insufficient data: no indication of controlled
vocabularies, formats, etc.
Transforming Metadata “Safely”
• Enhance original data with no risk of degradation• Provide low cost, scaleable way to improve the
quality and predictability of data– Remove “noise”: empty elements, useless values
– Detect and identify controlled vocabularies: DCMIType and IMT values
– Normalize presentation: clean up values, remove double XML encodings, extra whitespace, etc.
Beyond “Safe Transforms”
• Managing each "record" separately made automated maintenance and enhancement difficult
• Many sources of data required more tailored quality improvement
• Distinction between improvements to the metadata expression and additional information about the resource itself
• Potential to make the knowledge and expertise of NSDL data managers available to downstream consumers of the data
From Records to Elements
• Metadata record -- “a series of statements about resources” which can be aggregated to build a more complete profile of a resource
• Statements come with source information, and links to details about services and harvests
NSDL Metadata Repository
iViaEnhancement
Service
Provider Aorig metadata<dc:title><dc:identifier><dc:creator><dc:type>
ENCEnhancement
Service
<dct:audience source=ENC><dct:educationLevel source=ENC>
NSDL normalized/augmented<dc:title source=A><dc:creator source=A>
<dc:subject GEM source=iVia><dc:subject LCSH source=iVia><dc:subject LCC source=iVia>
<dc:identifier URI source=MR><dc:type DCMIType source=MR>
Provider A
iVia enhancements <dc:subject GEM><dc:subject LCSH><dc:subject LCC>
OAI
Safe xform enhancements<dc:identifier URI><dc:type DCMIType>
ENC enhancements<dct:audience><dct:educationLevel>
OAI
NSDL SafeTransforms
OAI
OAI
Exposing Quality Information
• Metadata statements vary in quality, and may be subjective
• Quality of statements can be determined to a great extent by knowledge of the source, and knowledge of the methodology used to create the statement
• Detailed provenance itself is a good indicator of quality metadata
Exposing Data to Downstream Users
• Two major issues:– Linking statements to particular harvested source
records (including the datestamp of the harvest)
– Linking records to the services that provided them (including descriptions of those services and the methods used to create the metadata)
• Required the creation and exposure of service records and a service vocabulary to categorize them
<record><metadata>
<nsdl_dcard_m >…<dc:identifier sourceRecordID="332518“
sourceServiceID="316878">http://www.chem.qmw.ac.uk/surfaces/scc/
</dc:identifier><dc:identifier sourceRecordID="993251“
sourceServiceID="8957432" xsi:type="dct:URI">http://www.chem.qmw.ac.uk/surfaces/scc/
</dc:identifier>
<dc:language sourceRecordID="332518“ sourceServiceID="316878">eng-GB
</dc:language><dc:language sourceRecordID="993251“
sourceServiceID="8957432" xsi:type="dct:RFC3066">en-GB
</dc:language>…
</nsdl_dcard_m ></metadata>
</record>
<about><sourceRecords>
<sourceRecord recordID="332518" sourceServiceID="316878"><datestamp>
2002-11-11</datestamp><identifier>
http://nsdl.org/mr/oai:nsdl.org:316878:oai:asdlib.org:asdl001709</identifier>
</sourceRecord>
<sourceRecord recordID="993251" sourceServiceID="8957432"><datestamp>
2004-15-05T05:11:00Z</datestamp><identifier>
http://nsdl.org/mr/oai:nsdl.org:nsdl.service:993251</identifier>
</sourceRecord>…
</sourceRecords></about>
<about><sourceServices>
<sourceService serviceID="316878"><dc:title>
Analytical Sciences Digital Library (ASDL)</dc:title><dc:description>
The ASDL is an electronic library that collects, catalogs and links web-based information or discovery material an...
</dc:description><dc:type xsi:type="nsdl:serviceType">
Collection</dc:type><serviceDescription xsi:type="nsdl:xml">
http://nsdl.org/mr/xml/316878</serviceDescription>
</sourceService><sourceService serviceID="8957432">
<dc:title>NSDL Metadata Normalization Service
</dc:title><dc:description>
The NSDL Metadata Normalization Service provides the spices that help to create delicious sausage from metadata chicken lips, feathers...
</dc:description>…
</sourceService></sourceServices>
</about>
<about xmlns:dc="http://purl.org/dc/elements/1.1/"><collectionMembership>
<collection collectionID="316878"><dc:title>
Analytical Sciences Digital Library (ASDL)</dc:title><dc:description>
The ASDL is an electronic library that collects, catalogs and links web-based information or discovery material an...
</dc:description><dc:identifier xsi:type="URI">
http://www.asdlib.org/</dc:identifier><dc:identifier>
oai:nsdl.org:nsdl.nsdl:00229</dc:identifier>
</collection>
<collection collectionID="4718"><dc:title>
ENC Online: The best selection of K-12 mathematics and science curriculum resources on the Internet!
</dc:title>…
</collection></collectionMembership>
</about>
Service Provision Model: iVia
• A variety of metadata generation services– Crawling to determine what resources are part
of a “collection”– Metadata creation for each resource– Augmenting metadata, adding subjects,
classification, format
iVia Service Issues
• Human review of results is essential– Error handling and Blacklisting
iVia Service Issues
• Human review of results is essential– Log review
iVia Service Challenge
• Repeatable Crawls– Storing and reusing the crawl parameters– Repeating the crawl on a schedule– Incremental updates of the iVia data– Editor notification of crawl completion– Initiation of incremental reharvest
Metadata Quality Services
• Metadata generation & augmentation• Metadata transformation
(“safe” and “collection specific”)• Equivalence• Crosswalking (schema and vocabulary)• Persistence/archiving• Annotation• Metadata improvement and rating
“Conducting” Service Interactions
• Order, timing, and response important
• Passive and active interactions; human and automated triggers
• Parameters for each interaction stored
• Supporting “freshness” and automated updating
Typical Service Orchestration
Introducing Lenny…• Editor initiates an iVia
Guided Crawl• Editor reviews results,
blacklists• Editor notifies Lenny that crawl is complete• Lenny initiates OAI harvest and ingest• Lenny notified of ingest success
Typical Service Orchestration
• Lenny initiates Safe Transform Service
• Service notifies Lenny that it’s done
• Lenny initiates OAI harvest and ingest
• Lenny notified of ingest success
Typical Service Orchestration
• Lenny initiates Collection-Specific Transform Service
• Service notifies Lenny that it’s done
• Lenny initiates OAI harvest and ingest
• Lenny notified of ingest success
• Lenny rests
The Who and Where of Services
• Many of the services we describe are useful to most metadata aggregators
• No aggregator can afford to create many single purpose services closely coupled with a single aggregator
• Shared, open services can provide a useful basis for improved metadata for all
Conclusions
• New role for “metadata aggregators”—providing enhanced metadata for other services to re-use– Integrating fragmentary metadata created by
automated services– Improving metadata in standard ways– Exposing all relevant data in ways that allow
consumers to evaluate quality and usefulness
Conclusions
“This model of service provision holds much potential in an environment where persistent metadata quality issues threaten to overwhelm aggregators hoping to build services on top of harvested metadata. No single aggregator can fill in the quality gaps alone, but if metadata services are built to interoperate with a variety of aggregators using low barrier protocols like OAI-PMH, many can benefit from the work, freeing resources for new service development.”