A centre of expertise in digital information management The OAI Protocol for Metadata Harvesting...
-
Upload
riley-hart -
Category
Documents
-
view
214 -
download
1
Transcript of A centre of expertise in digital information management The OAI Protocol for Metadata Harvesting...
a centre of expertise in digital information managementwww.ukoln.ac.uk
The OAI Protocol for Metadata Harvesting
Andy [email protected]
UKOLN, University of Bath
IVOA Registry Meeting, London
March 2003
2
Contents
• a brief history of OAI• 10 technical things you should know
about the OAI-PMH
3
OAI roots
• the roots of OAI lie in the development of eprint archives…– arXiv, CogPrints, NACA (NASA), RePEc, NDLTD,
NCSTRL
• each offered Web interface for deposit of articles and for end-user searches
• difficult for end-users to work across archives without having to learn multiple different interfaces
• recognised need for single search interface to all archives– Universal Pre-print Service (UPS)
4
Searching vs. harvesting
• two possible approaches to building a single search interface to multiple eprint archives…– cross-searching multiple archives based on protocol
like Z39.50– harvesting metadata into one or more ‘central’
services – bulk move data to the user-interface
• US digital library experience in this area indicated that cross-searching not preferred approach– distributed searching of N nodes viable, but only for
small values of N
5
Searching vs. harvesting
search service
search service
…or…
6
Harvesting requirements
• in order that harvesting approach can work there need to be agreements about…– transport protocols – HTTP vs. FTP vs. …– metadata formats – DC vs. MARC vs. …– quality assurance – mandatory elements,
mechanisms for naming of people, subjects, etc., handling duplicated records, best-practice
– intellectual property and usage rights – who can do what with the records
• work in this area resulted in the “Santa Fe Convention”
7
Development of OAI-PMH
• 2 year metamorphosis thru various names– Santa Fe Convention, OAI-PMH versions 1.0, 1.1…– OAI Protocol for Metadata Harvesting 2.0
• development steered by international technical committee
• inter-version stability helped developer confidence
• move from focus on eprints to more generic protocol– move from OAI-specific metadata schema to mandatory
support for DC
8
Bluffer’s guide to OAI
1. OAI-PMH is a low-cost mechanism for harvesting metadata records– from ‘data providers’ to ‘service providers’
2. allows ‘service provider’ to say ‘give me some or all of your metadata records’– where ‘some’ is based on date-stamps, sets,
metadata formats
3. not limited to repositories of eprints– images, museum artefacts, learning objects, …
4. based on HTTP and XML– simple, Web-friendly, autonomous– fast, flexible deployment
http://www.openarchives.org/http://www.openarchives.org/
9
Bluffer’s guide to OAI
5. OAI-PMH is not a search protocol– but use can underpin search-based services based
on Z39.50 or SRW or SOAP or…
6. OAI-PMH carries only metadata– content (e.g. full-text or image) made available
separately – typically at URL in metadata
7. mandates simple DC as record format– but extensible to any XML format – IMS, ONIX,
MARC, METS, etc.
8. extensible framework for metadata about– repository, resources, ‘items’, sets
– can include rights metadata
10
Bluffer’s guide to OAI
9. metadata and ‘content’ often made freely available – but not a requirement– OAI-PMH can be used between closed groups– or, can make metadata available but restrict
access to content in some way
10.underlying HTTP protocol provides– access control – e.g. HTTP BASIC– compression mechanisms (for improving
performance of harvesters)– could, in theory, also provide encryption if
required
11
Resources, items and records
resource
all available metadata about David
item
Dublin Coremetadata
MARCmetadata
SPECTRUMmetadata records
item = identifier
12
Protocol requests
• six different request types– Identify– ListMetadataFormats– ListSets– ListIdentifiers– ListRecords– GetRecord
• harvester need not use all types• repository must implement all types• required and optional arguments
– on request types
13
Record structure
• metadata about a resource in a particular XML format
• header (mandatory)• identifier (1)• datestamp (1)• setSpec elements (*)• status attribute for deleted item (?)
• metadata (mandatory)• XML encoded metadata within root tag which
provides namespace and schema• repositories must support Dublin Core
• about (optional)• rights statements• provenance statements
14
Dublin Core
• OAI-PMH mandates use of simple DC as lowest common denominator
• agreed XML schema – ‘oai_dc’– simple DC – 15 metadata properties– all DC properties optional and repeatable
Title Contributor Source
Creator Date Language
Subject Type Relation
Description Format Coverage
Publisher Identifier Rights
http://dublincore.org/http://dublincore.org/
15
OAI demonstration
• repository explorer demo
16
OAI and GoogleWebsite(s)
multimediadatabase(s)
DP9 gateway
OAI gatewaymakes harvestedmetadataavailable toGoogle…
eprintarchive(s)
17
Implementing OAI
• OAI protocol is relatively simple• implementation and deployment tends
to be very fast• lots of available toolkits
– Java, Perl, PHP, etc.
• complete tools also available– e.g. tools that sit in front of
existing databases
• see ‘tools’ area on theOAI Web site…
18
Creative Commons
• CC is “devoted to expanding the range of creative work available for others to build upon and share”
• provides ‘standard’ licences for content– attribution
– noncommercial
– no derivative works
– share alike
• mechanisms for indicating licence on Web pages
• need similar mechanism in OAI
http://www.creativecommons.org/http://www.creativecommons.org/
19
Questions…
a centre of expertise in digital information managementwww.ukoln.ac.uk