Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana...
-
Upload
tyrone-boyd -
Category
Documents
-
view
217 -
download
0
Transcript of Metadata Harvesting The Hague, 13 & 14 January 2009 Julie Verleyen Scientific Coordinator, Europeana...
Metadata Harvesting
The Hague, 13 & 14 January 2009
Julie Verleyen
Scientific Coordinator, Europeana Office
EuropeanaLocal Knowledge Sharing Workshop
• Harvesting in Europeana: workflow and
requirements
• Best-practices
• Recommendations
• Common issues
• Tools / Software
• Resources
• Documentation
Table Of Content
1. Determine collections to be contributed
• Questionnaire
Harvesting in Europeana
2. Obtain OAI-PMH repository parameters:– Absolute minimum (enough for fully
implemented, tested and documented OAI repositories)• Server base URL
– Very useful to have:• Mapping between described collection(s) and OAI-
PMH set(s)• Prefix of metadata format to use preferably for
Europeana (if not described in ListMetadataFormats response): ex: oai_dc, mods, tel, ese
Harvesting in Europeana
3. Configuration of harvester
4. Full harvest with ListRecords request– Records collected in XML files ≤ 10MB– Harvest stored in SVN
Harvesting in Europeana
• Compliancy to OAI-PMH 2.0 protocol specifications
http://www.openarchives.org/OAI/openarchivesprotocol.html .
Follow implementation guidelines OAI-PMH v2 for
repository implementers
http://www.openarchives.org/OAI/2.0/guidelines-repository.htm
• Full functional tests!!
Best-practices: implementation
OAI validation
=
Your OAI repository correctly implements the OAI-PMH!
Correct response to all OAI-PMH requests: with arguments, various error conditions, every XML schema of every OAI response is valid,...
Best-practices: OAI validation
• Follow the Open Archive Initiative Protocol Testing
• Validate your server using the validator supplied by the OAI.
http://www.openarchives.org/data/registerasprovider.html
Without registering clicking checkbox "only validate and do not register (you may then register later)."
Recommended approach to OAI validation
http://www.openarchives.org/data/registerasprovider.html
#Protocol_Conformance_Testing
http://www.openarchives.org/data/registerasprovider.html => bottom of the page
• Set = "an optional construct for grouping
items for the purpose of selective
harvesting.“
Issues and recommendations: sets
Number of obstacles related to sets:
• Interpreting how a repository has organized sets and determining which sets to harvest – Issue: setName not human understandable
and/or no setDescription provided. – Issue: Large number of sets to sort through.
• Knowing when there are records that belong to no sets – Issue: Items that belong to no sets are included in
the OAI repository.
• Knowing when there are empty sets – Issue: Data provider exposes sets with no
records.
Number of obstacles related to sets:
• Understanding relationships between sets
– Issue: Relationships between sets are not
expressed.
• Mechanism to express relationships between hierarchical
sets
• But no mechanism to express relationships between
overlapping sets!
• The only way to know: harvest the identifiers or records
which contain the header information sets record
belongs to
Number of obstacles related to sets:
• Knowing how many records there are within a
set before harvesting
– Issue: Not expressing how many records are
within a set which can be expressed via a
completeListSize attribute in a resumptionToken
or within the set description.
• Knowing when a set structure has been
substantially changed
– Issue: Changes in a set structure has not been
communicated
• No single best practice for the organization of sets.
• Realistically: data providers organize sets in a way which best meets the needs of their primary service provider and can be easily done within their own internal workflows.
• Useful to organize the metadata items into sets according to the collections of resources they represent. – Concept of collections varies and not completely clear
in Europeana. – Useful for harvester to understand notion of collection
for data providers
Sets: recommendations
• Repository implementation following OAI-
PMH v2.0 + tested
• Inform Europeana harvesting responsible of
any repository changes / maintenance
• No regular harvesting schema determined
yet
• “SLA” between data providers and
harvesters
Basic requirements
• Unavailability / unreliability of repository
server
• Implementation of OAI-PMH v2 incomplete
– resumptionToken not supported
– Only ListIdentifiers
• XML syntax errors
• Character encoding errors
• Short lifetime of resumptionToken
Common issues
TEL/Europeana OAI-PMH Harvester – Offline
documentation
– Harvester
– Java standalone application with GUI
– Multiple harvesting jobs
– Resuming unfinished jobs
– Logging
– No scheduling, No configuration interface
Tools / Software
REPOX - http://repox.ist.utl.pt/
• Repository + Harvester
• Java standalone application with web GUI
• Multiple harvesting jobs, Scheduler
• Statistics
• Management of XML metadata repository – Versioning and identification of records
– Different metadata format
– User interface to create metadata crosswalks: Schema mapper
Tools / Software
OAIcat from OCLC - http://www.oclc.org/research/software/oai/cat.htm
• Framework conforming to the OAI-PMH v2.0
• Repository + Harvesting
• Java web application
• Scheduling, logging
• Limited scalability (~2M records)
Tools / Software
Other implementations in different languages to plug-in into a Library Management System:
– PHP: OAIbiblio data provider implementation of the OAI-PMH, version 2.0. This toolkit can be easily customized to communicate with an already existing, multi-table MySQL database
– PERL: CelestialOAI aggregator/cache application that imports OAI metadata from version 1.0,1.1,2.0 OAI-compliant repositories, and re-exposes that metadata through either an aggregated or per-repository OAI-compliant 2.0 interface. Celestial requires oai-perl v2, MySQL, Perl 5.6.x and a CGI-capable web server
– Ruby: ruby-oai Includes a client library, a server/provider library and a interactive harvesting shell
– Python: pyoai packageenables high-level access to an OAI-PMH Metadata Repository and also implements a framework for quickly creating OAI-PMH compliant servers
Tools / Software (TELplus D2.1)
• ESE XML validation schemas developed by
partners
Tools / Software
• The Open Archives Initiative Protocol for Metadata Harvesting v2.0 http://www.openarchives.org/OAI/openarchivesprotocol.html
• TELplus D2.1, “OAI-PMH implementation and
tools guidelines”, 21 pages– Protocol overview and description of main
concepts
– OAI-PMH implementation in libraries
– References
Resources
• Wiki “Best Practices for OAI Data Provider
Implementations and Shareable Metadata”:
Excellent source of guidelines, tutorials,
recommendations, implementation softwares and
tools, references etc...
http://webservices.itcs.umich.edu/mediawiki/oaibp/in
dex.php/Main_Page
Resources
• Requirements:
– Europeana OAI-PMH Harvesting
– Europeana OAI-PMH Repositories
• ESE XML validation schema
• Europeana OAI-PMH data providers registry &
forum/mailing list
– Local systems
– OAI-PMH repository solution
– Contact
Documentation in Europeana context
Thank youQuestions? Remarks?...