A centre of expertise in digital information management The OAI Protocol for Metadata Harvesting...

20
a centre of expertise in digital information management www.ukoln.ac.uk The OAI Protocol for Metadata Harvesting Andy Powell [email protected] UKOLN, University of Bath IVOA Registry Meeting, London March 2003

Transcript of A centre of expertise in digital information management The OAI Protocol for Metadata Harvesting...

Page 1: A centre of expertise in digital information management  The OAI Protocol for Metadata Harvesting Andy Powell a.powell@ukoln.ac.uk UKOLN,

a centre of expertise in digital information managementwww.ukoln.ac.uk

The OAI Protocol for Metadata Harvesting

Andy [email protected]

UKOLN, University of Bath

IVOA Registry Meeting, London

March 2003

Page 2: A centre of expertise in digital information management  The OAI Protocol for Metadata Harvesting Andy Powell a.powell@ukoln.ac.uk UKOLN,

                                                             

2

Contents

• a brief history of OAI• 10 technical things you should know

about the OAI-PMH

Page 3: A centre of expertise in digital information management  The OAI Protocol for Metadata Harvesting Andy Powell a.powell@ukoln.ac.uk UKOLN,

                                                             

3

OAI roots

• the roots of OAI lie in the development of eprint archives…– arXiv, CogPrints, NACA (NASA), RePEc, NDLTD,

NCSTRL

• each offered Web interface for deposit of articles and for end-user searches

• difficult for end-users to work across archives without having to learn multiple different interfaces

• recognised need for single search interface to all archives– Universal Pre-print Service (UPS)

Page 4: A centre of expertise in digital information management  The OAI Protocol for Metadata Harvesting Andy Powell a.powell@ukoln.ac.uk UKOLN,

                                                             

4

Searching vs. harvesting

• two possible approaches to building a single search interface to multiple eprint archives…– cross-searching multiple archives based on protocol

like Z39.50– harvesting metadata into one or more ‘central’

services – bulk move data to the user-interface

• US digital library experience in this area indicated that cross-searching not preferred approach– distributed searching of N nodes viable, but only for

small values of N

Page 5: A centre of expertise in digital information management  The OAI Protocol for Metadata Harvesting Andy Powell a.powell@ukoln.ac.uk UKOLN,

                                                             

5

Searching vs. harvesting

search service

search service

…or…

Page 6: A centre of expertise in digital information management  The OAI Protocol for Metadata Harvesting Andy Powell a.powell@ukoln.ac.uk UKOLN,

                                                             

6

Harvesting requirements

• in order that harvesting approach can work there need to be agreements about…– transport protocols – HTTP vs. FTP vs. …– metadata formats – DC vs. MARC vs. …– quality assurance – mandatory elements,

mechanisms for naming of people, subjects, etc., handling duplicated records, best-practice

– intellectual property and usage rights – who can do what with the records

• work in this area resulted in the “Santa Fe Convention”

Page 7: A centre of expertise in digital information management  The OAI Protocol for Metadata Harvesting Andy Powell a.powell@ukoln.ac.uk UKOLN,

                                                             

7

Development of OAI-PMH

• 2 year metamorphosis thru various names– Santa Fe Convention, OAI-PMH versions 1.0, 1.1…– OAI Protocol for Metadata Harvesting 2.0

• development steered by international technical committee

• inter-version stability helped developer confidence

• move from focus on eprints to more generic protocol– move from OAI-specific metadata schema to mandatory

support for DC

Page 8: A centre of expertise in digital information management  The OAI Protocol for Metadata Harvesting Andy Powell a.powell@ukoln.ac.uk UKOLN,

                                                             

8

Bluffer’s guide to OAI

1. OAI-PMH is a low-cost mechanism for harvesting metadata records– from ‘data providers’ to ‘service providers’

2. allows ‘service provider’ to say ‘give me some or all of your metadata records’– where ‘some’ is based on date-stamps, sets,

metadata formats

3. not limited to repositories of eprints– images, museum artefacts, learning objects, …

4. based on HTTP and XML– simple, Web-friendly, autonomous– fast, flexible deployment

http://www.openarchives.org/http://www.openarchives.org/

Page 9: A centre of expertise in digital information management  The OAI Protocol for Metadata Harvesting Andy Powell a.powell@ukoln.ac.uk UKOLN,

                                                             

9

Bluffer’s guide to OAI

5. OAI-PMH is not a search protocol– but use can underpin search-based services based

on Z39.50 or SRW or SOAP or…

6. OAI-PMH carries only metadata– content (e.g. full-text or image) made available

separately – typically at URL in metadata

7. mandates simple DC as record format– but extensible to any XML format – IMS, ONIX,

MARC, METS, etc.

8. extensible framework for metadata about– repository, resources, ‘items’, sets

– can include rights metadata

Page 10: A centre of expertise in digital information management  The OAI Protocol for Metadata Harvesting Andy Powell a.powell@ukoln.ac.uk UKOLN,

                                                             

10

Bluffer’s guide to OAI

9. metadata and ‘content’ often made freely available – but not a requirement– OAI-PMH can be used between closed groups– or, can make metadata available but restrict

access to content in some way

10.underlying HTTP protocol provides– access control – e.g. HTTP BASIC– compression mechanisms (for improving

performance of harvesters)– could, in theory, also provide encryption if

required

Page 11: A centre of expertise in digital information management  The OAI Protocol for Metadata Harvesting Andy Powell a.powell@ukoln.ac.uk UKOLN,

                                                             

11

Resources, items and records

resource

all available metadata about David

item

Dublin Coremetadata

MARCmetadata

SPECTRUMmetadata records

item = identifier

Page 12: A centre of expertise in digital information management  The OAI Protocol for Metadata Harvesting Andy Powell a.powell@ukoln.ac.uk UKOLN,

                                                             

12

Protocol requests

• six different request types– Identify– ListMetadataFormats– ListSets– ListIdentifiers– ListRecords– GetRecord

• harvester need not use all types• repository must implement all types• required and optional arguments

– on request types

Page 13: A centre of expertise in digital information management  The OAI Protocol for Metadata Harvesting Andy Powell a.powell@ukoln.ac.uk UKOLN,

                                                             

13

Record structure

• metadata about a resource in a particular XML format

• header (mandatory)• identifier (1)• datestamp (1)• setSpec elements (*)• status attribute for deleted item (?)

• metadata (mandatory)• XML encoded metadata within root tag which

provides namespace and schema• repositories must support Dublin Core

• about (optional)• rights statements• provenance statements

Page 14: A centre of expertise in digital information management  The OAI Protocol for Metadata Harvesting Andy Powell a.powell@ukoln.ac.uk UKOLN,

                                                             

14

Dublin Core

• OAI-PMH mandates use of simple DC as lowest common denominator

• agreed XML schema – ‘oai_dc’– simple DC – 15 metadata properties– all DC properties optional and repeatable

Title Contributor Source

Creator Date Language

Subject Type Relation

Description Format Coverage

Publisher Identifier Rights

http://dublincore.org/http://dublincore.org/

Page 15: A centre of expertise in digital information management  The OAI Protocol for Metadata Harvesting Andy Powell a.powell@ukoln.ac.uk UKOLN,

                                                             

15

OAI demonstration

• repository explorer demo

Page 16: A centre of expertise in digital information management  The OAI Protocol for Metadata Harvesting Andy Powell a.powell@ukoln.ac.uk UKOLN,

                                                             

16

OAI and GoogleWebsite(s)

multimediadatabase(s)

DP9 gateway

OAI gatewaymakes harvestedmetadataavailable toGoogle…

eprintarchive(s)

Page 17: A centre of expertise in digital information management  The OAI Protocol for Metadata Harvesting Andy Powell a.powell@ukoln.ac.uk UKOLN,

                                                             

17

Implementing OAI

• OAI protocol is relatively simple• implementation and deployment tends

to be very fast• lots of available toolkits

– Java, Perl, PHP, etc.

• complete tools also available– e.g. tools that sit in front of

existing databases

• see ‘tools’ area on theOAI Web site…

Page 18: A centre of expertise in digital information management  The OAI Protocol for Metadata Harvesting Andy Powell a.powell@ukoln.ac.uk UKOLN,

                                                             

18

Creative Commons

• CC is “devoted to expanding the range of creative work available for others to build upon and share”

• provides ‘standard’ licences for content– attribution

– noncommercial

– no derivative works

– share alike

• mechanisms for indicating licence on Web pages

• need similar mechanism in OAI

http://www.creativecommons.org/http://www.creativecommons.org/

Page 19: A centre of expertise in digital information management  The OAI Protocol for Metadata Harvesting Andy Powell a.powell@ukoln.ac.uk UKOLN,

                                                             

19

Questions…

Page 20: A centre of expertise in digital information management  The OAI Protocol for Metadata Harvesting Andy Powell a.powell@ukoln.ac.uk UKOLN,

a centre of expertise in digital information managementwww.ukoln.ac.uk