Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana...

35
Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop

Transcript of Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana...

Page 1: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop.

Metadata Harvesting

The Hague, 13 & 14 January 2009

Julie Verleyen

Scientific Coordinator, Europeana Office

EuropeanaLocal Knowledge Sharing Workshop

Page 2: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop.

• Harvesting in Europeana: workflow and

requirements

• Best-practices

• Recommendations

• Common issues

• Tools / Software

• Resources

• Documentation

Table Of Content

Page 3: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop.

1. Determine collections to be contributed

• Questionnaire

Harvesting in Europeana

Page 4: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop.

2. Obtain OAI-PMH repository parameters:– Absolute minimum (enough for fully

implemented, tested and documented OAI repositories)• Server base URL

– Very useful to have:• Mapping between described collection(s) and OAI-

PMH set(s)• Prefix of metadata format to use preferably for

Europeana (if not described in ListMetadataFormats response): ex: oai_dc, mods, tel, ese

Harvesting in Europeana

Page 5: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop.

3. Configuration of harvester

4. Full harvest with ListRecords request– Records collected in XML files ≤ 10MB– Harvest stored in SVN

Harvesting in Europeana

Page 6: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop.
Page 7: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop.

• Compliancy to OAI-PMH 2.0 protocol specifications

http://www.openarchives.org/OAI/openarchivesprotocol.html .

Follow implementation guidelines OAI-PMH v2 for

repository implementers

http://www.openarchives.org/OAI/2.0/guidelines-repository.htm

• Full functional tests!!

Best-practices: implementation

Page 8: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop.

OAI validation

=

Your OAI repository correctly implements the OAI-PMH!

Correct response to all OAI-PMH requests: with arguments, various error conditions, every XML schema of every OAI response is valid,...

Best-practices: OAI validation

Page 9: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop.

• Follow the Open Archive Initiative Protocol Testing

• Validate your server using the validator supplied by the OAI.

http://www.openarchives.org/data/registerasprovider.html

Without registering clicking checkbox "only validate and do not register (you may then register later)."

Recommended approach to OAI validation

Page 10: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop.

http://www.openarchives.org/data/registerasprovider.html

Page 11: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop.

#Protocol_Conformance_Testing

Page 12: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop.

http://www.openarchives.org/data/registerasprovider.html => bottom of the page

Page 13: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop.
Page 14: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop.

• Set = "an optional construct for grouping

items for the purpose of selective

harvesting.“

Issues and recommendations: sets

Page 15: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop.

Number of obstacles related to sets:

• Interpreting how a repository has organized sets and determining which sets to harvest – Issue: setName not human understandable

and/or no setDescription provided. – Issue: Large number of sets to sort through.

• Knowing when there are records that belong to no sets – Issue: Items that belong to no sets are included in

the OAI repository.

• Knowing when there are empty sets – Issue: Data provider exposes sets with no

records.

Page 16: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop.

Number of obstacles related to sets:

• Understanding relationships between sets

– Issue: Relationships between sets are not

expressed.

• Mechanism to express relationships between hierarchical

sets

• But no mechanism to express relationships between

overlapping sets!

• The only way to know: harvest the identifiers or records

which contain the header information sets record

belongs to

Page 17: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop.

Number of obstacles related to sets:

• Knowing how many records there are within a

set before harvesting

– Issue: Not expressing how many records are

within a set which can be expressed via a

completeListSize attribute in a resumptionToken

or within the set description.

• Knowing when a set structure has been

substantially changed

– Issue: Changes in a set structure has not been

communicated

Page 18: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop.

• No single best practice for the organization of sets.

• Realistically: data providers organize sets in a way which best meets the needs of their primary service provider and can be easily done within their own internal workflows.

• Useful to organize the metadata items into sets according to the collections of resources they represent. – Concept of collections varies and not completely clear

in Europeana. – Useful for harvester to understand notion of collection

for data providers

Sets: recommendations

Page 19: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop.

• Repository implementation following OAI-

PMH v2.0 + tested

• Inform Europeana harvesting responsible of

any repository changes / maintenance

• No regular harvesting schema determined

yet

• “SLA” between data providers and

harvesters

Basic requirements

Page 20: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop.

• Unavailability / unreliability of repository

server

• Implementation of OAI-PMH v2 incomplete

– resumptionToken not supported

– Only ListIdentifiers

• XML syntax errors

• Character encoding errors

• Short lifetime of resumptionToken

Common issues

Page 21: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop.

TEL/Europeana OAI-PMH Harvester – Offline

documentation

– Harvester

– Java standalone application with GUI

– Multiple harvesting jobs

– Resuming unfinished jobs

– Logging

– No scheduling, No configuration interface

Tools / Software

Page 22: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop.

REPOX - http://repox.ist.utl.pt/

• Repository + Harvester

• Java standalone application with web GUI

• Multiple harvesting jobs, Scheduler

• Statistics

• Management of XML metadata repository – Versioning and identification of records

– Different metadata format

– User interface to create metadata crosswalks: Schema mapper

Tools / Software

Page 23: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop.

OAIcat from OCLC - http://www.oclc.org/research/software/oai/cat.htm

• Framework conforming to the OAI-PMH v2.0

• Repository + Harvesting

• Java web application

• Scheduling, logging

• Limited scalability (~2M records)

Tools / Software

Page 24: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop.

Other implementations in different languages to plug-in into a Library Management System:

– PHP: OAIbiblio data provider implementation of the OAI-PMH, version 2.0. This toolkit can be easily customized to communicate with an already existing, multi-table MySQL database

– PERL: CelestialOAI aggregator/cache application that imports OAI metadata from version 1.0,1.1,2.0 OAI-compliant repositories, and re-exposes that metadata through either an aggregated or per-repository OAI-compliant 2.0 interface. Celestial requires oai-perl v2, MySQL, Perl 5.6.x and a CGI-capable web server

– Ruby: ruby-oai Includes a client library, a server/provider library and a interactive harvesting shell

– Python: pyoai packageenables high-level access to an OAI-PMH Metadata Repository and also implements a framework for quickly creating OAI-PMH compliant servers

Tools / Software (TELplus D2.1)

Page 25: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop.

• ESE XML validation schemas developed by

partners

Tools / Software

Page 26: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop.

• The Open Archives Initiative Protocol for Metadata Harvesting v2.0 http://www.openarchives.org/OAI/openarchivesprotocol.html

• TELplus D2.1, “OAI-PMH implementation and

tools guidelines”, 21 pages– Protocol overview and description of main

concepts

– OAI-PMH implementation in libraries

– References

Resources

Page 27: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop.
Page 28: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop.

• Wiki “Best Practices for OAI Data Provider

Implementations and Shareable Metadata”:

Excellent source of guidelines, tutorials,

recommendations, implementation softwares and

tools, references etc...

http://webservices.itcs.umich.edu/mediawiki/oaibp/in

dex.php/Main_Page

Resources

Page 29: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop.
Page 30: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop.
Page 31: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop.
Page 32: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop.
Page 33: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop.
Page 34: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop.

• Requirements:

– Europeana OAI-PMH Harvesting

– Europeana OAI-PMH Repositories

• ESE XML validation schema

• Europeana OAI-PMH data providers registry &

forum/mailing list

– Local systems

– OAI-PMH repository solution

– Contact

Documentation in Europeana context

Page 35: Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana Office EuropeanaLocal Knowledge Sharing Workshop.

Thank youQuestions? Remarks?...

[email protected]