Models, Architectures, and Technologies of Digital Libraries (2) Session 4 LIS 60639 Implementation...

44
Models, Architectures, and Technologies of Digital Libraries (2) Session 4 LIS 60639 Implementation of Digital Libraries Dr. Yin Zhang

Transcript of Models, Architectures, and Technologies of Digital Libraries (2) Session 4 LIS 60639 Implementation...

Page 1: Models, Architectures, and Technologies of Digital Libraries (2) Session 4 LIS 60639 Implementation of Digital Libraries Dr. Yin Zhang.

Models, Architectures, and Technologies of Digital Libraries

(2)

Session 4

LIS 60639 Implementation of Digital Libraries

Dr. Yin Zhang

Page 2: Models, Architectures, and Technologies of Digital Libraries (2) Session 4 LIS 60639 Implementation of Digital Libraries Dr. Yin Zhang.

2

1. Important protocols for digital libraries

Rhyno (2004): Ch. 2 Important protocols for digital libraries and OSS options for using them

Page 3: Models, Architectures, and Technologies of Digital Libraries (2) Session 4 LIS 60639 Implementation of Digital Libraries Dr. Yin Zhang.

3

What is a protocol and why?

• Digital libraries usually are called on to communicate with many different external systems.

– These duties can range from delivering Web-based interfaces for remote users to exposing content to third-party applications.

– Certain interactions are so common or have so many requirements that a protocol has been established for standardizing and streamlining the process.

• A protocol is a set of ground rules for how systems carry out specific activities.

• Protocols often define which format and syntax systems use for exchanging information and what one system must indicate to another before any data is made available.

Page 4: Models, Architectures, and Technologies of Digital Libraries (2) Session 4 LIS 60639 Implementation of Digital Libraries Dr. Yin Zhang.

4

Core protocols for DL projects (1)• The Hypertext Transfer Protocol (HTTP) powers the Web and is the protocol that

most Web users interact with when using a Web browser.

• HTTP's ability to be plugged into many different types of technologies is shown in Figure 2.3.

• Most Web users are unaware of how many hoops the content delivered to their browsers has been through. With the use of a gateway, HTTP also can be the basis for interacting with many other types of protocols.

• A gateway takes the results of one protocol and translates them to fit the requirements of a different protocol or application; for example, taking the results of an HTML form and using the values to formulate a query to a remote database.

• For example, CGI (Common Gate Interface) is a specification introduced in 1994 to allow HTML content to be created dynamically.

The ubiquitous nature of HTTP is a testimony to both its simplicity and extensibility. A more complex protocol would be harder to map to other applications.

As a result, HTTP became firmly entrenched in the toolkits of application developers at an early stage of the Web's development and remains there today.

http://www.w3.org/Protocols/

Page 5: Models, Architectures, and Technologies of Digital Libraries (2) Session 4 LIS 60639 Implementation of Digital Libraries Dr. Yin Zhang.

5

HTTP Software Examples

• Web server software guide:

http://webdesign.about.com/cs/webservers/bb/abwebservers.htm

• Free web server software

http://en.wikipedia.org/wiki/Category:Free_web_server_software

• Apache: – Apache exists to provide a robust and commercial-grade reference

implementation of the HTTP protocol

– Apache dominates the Web server world

Page 6: Models, Architectures, and Technologies of Digital Libraries (2) Session 4 LIS 60639 Implementation of Digital Libraries Dr. Yin Zhang.

6

Core protocols for DL projects (2)• OAI-PMH - Open Archives Initiative (OAI) Protocol for Metadata Harvesting (PMH)

– It has been called the "HTTP of digital libraries” even though the protocol actually uses HTTP as a transport mechanism between digital collections.

– OAI-PMH is several years younger than HTTP, with origins in a 1999 meeting in Santa Fe, New Mexico, to address a series of problems that were occurring in the e-print server world.

– As disciplinary e-print servers became more common, it was difficult to support searching across multiple repositories.

– Repositories needed greater capabilities to automatically identify and copy papers that had been deposited in other repositories

– The solution was the definition of an interface to permit an e-print server to expose metadata for the papers it held. This would allow the metadata to be picked up by programs on the Web called harvesters.

– Harvesting programs travel around a network gathering, or harvesting, content by copying it to a central site.

More in Reading 4.

Page 7: Models, Architectures, and Technologies of Digital Libraries (2) Session 4 LIS 60639 Implementation of Digital Libraries Dr. Yin Zhang.

7

Core protocols for DL projects (3)

• Z39.50 has roots that stretch back to the early 1970s and the Linked Systems Project for searching bibliographic databases and transferring records among the major library institutions (e.g., Library of Congress, OCLC, etc.).

• Z39.50 is a protocol that allows a client machine (called an origin) to search a server machine (called a target).

• Despite its close association with the library community, Z39.50 is a relatively generic protocol with a rich set of functions for search and retrieval, including the ability to sort result sets and registries of objects such as attribute sets that specify search points.

• These search points can be mapped onto the indexes and search capabilities of the underlying server.

• Perhaps the best-known attribute set is Bib-1, originally designed for bibliographic resources. but now commonly used for a wide range of applications

Page 8: Models, Architectures, and Technologies of Digital Libraries (2) Session 4 LIS 60639 Implementation of Digital Libraries Dr. Yin Zhang.

8

Z39.50 (cont. 1)

• Bib-1 Attribute Set: – http://www.loc.gov/z3950/agency/defns/bib1.html– http://www.loc.gov/z3950/agency/bib1.html

• Bib-1 comprises six types of groupings of attributes, or attribute types, that define a deep level of precision in putting together queries:

– Use attributes (type = 1) define the access point: 1 Personal-name 2 Corporate-name 3 Conference-name 4 Title 5 Title-series 6 Title-uniform 7 ISBN

8 ISSN …

– Relation attributes (type = 2) define the relation of the search term to the values in the database1 Less than 2 Less than or equal 3 Equal 4 Greater or equal 5 Greater than 6 Not equal …

– Position attributes (type = 3) specify the location of the search term within the field or subfield in which it appears.

1 First in field 2 First in subfield 3 Any position in field

– Structure attributes (type = 4) specify the type of search term. 1 Phrase 2 Word 3 Key 4 Year 5 Date (normalized) 6 Word list 100 Date (un-normalized) …

– Truncation attributes (type = 5) specify whether one or more characters may be omitted in matching the search term in the target system at the position specified by the Truncation attribute.

1 Right truncation 2 Left truncation 3 Left and right truncation 100 Do not truncate ….

– Completeness attributes (type = 6) specify that the contents of the search term represent a complete or incomplete subfield or a complete field.

1 Incomplete subfield 2 Complete subfield 3 Complete field

Page 9: Models, Architectures, and Technologies of Digital Libraries (2) Session 4 LIS 60639 Implementation of Digital Libraries Dr. Yin Zhang.

9

Z39.50 (cont. 2)

• Z39.50-compliant systems can use the these attributes correspond to numbers in the standard to deconstruct queries.

• For a search query: FIND TITLE PROGRAM* OR SUBJECT UNIX

Use attributes (type = 1)

1 Personal-name 2 Corporate-name 3 Conference-name 4 Title 5 Title-series 6 Title-uniform .. 21 Subject heading

Relation attributes (type = 2)

1 Less than 2 Less than or equal 3 Equal 4 Greater or equal 5 Greater than 6 Not equal …

Position attributes (type = 3)

1 First in field 2 First in subfield 3 Any position in field

Structure attributes (type = 4)

1 Phrase 2 Word 3 Key 4 Year 5 Date (normalized) 6 Word list 100 Date (un-normalized) …

Truncation attributes (type = 5)

1 Right truncation 2 Left truncation 3 Left and right truncation 100 Do not truncate ….

Completeness attributes (type = 6)

1 Incomplete subfield 2 Complete subfield 3 Complete field

Page 10: Models, Architectures, and Technologies of Digital Libraries (2) Session 4 LIS 60639 Implementation of Digital Libraries Dr. Yin Zhang.

10

Z39.50 (cont. 3)

• Z39.50 is more complex than either HTTP or OAl and is an important protocol for digital libraries because it is designed to meet the very real complexities of information retrieval.

• It also can be used as a tool to build distributed search services, also know as federated search systems:

– The client in a federated system sends a search to all of the servers comprising the federation.

– It can then gather the results and attempt to eliminate duplicates or perform value-added services such as clustering the results under topics, unlike the harvesting approach used with OAI that takes entire sets of records (see Figure 2.6).

Page 11: Models, Architectures, and Technologies of Digital Libraries (2) Session 4 LIS 60639 Implementation of Digital Libraries Dr. Yin Zhang.

11

Z39.50 Software

• Z39.50 is an abstract layer on top of an existing system, so it isn't surprising that most Z39.50 tools are architected to work on top of other applications.

• Suggested by Library of Congress:

http://www.loc.gov/z3950/agency/resources/software.html

– Free Software

– Commercial Software

• Suggested in this chapter a few open source applications (see Table 2.5).

Page 12: Models, Architectures, and Technologies of Digital Libraries (2) Session 4 LIS 60639 Implementation of Digital Libraries Dr. Yin Zhang.

12

Other protocols for DL projects (4)• There are some protocols are supported widely outside of the digital library

community

• SOAP: Simple Object Access Protocol (SOAP) – It combines XML with HTTP for accessing services, objects, and servers. – It is a lynchpin of a suite of technologies called Web Services that leverages the Web for

delivering application functions in a well-defined manner.– SOAP allows a great deal of information to be passed to an application, and it leverages

XML for laying out the data that goes between DL applications.

• RSS: RDF Site Summary (RSS) – It is an XML-based format that allows simultaneous publication, or syndication, of lists of

hyperlinks, along with other information or metadata, that help viewers decide whether they want to follow a link.

• Shibboleth: http://shibboleth.internet2.edu/– Shibboleth is an authentication and authorization project under the auspices of Internet 2, a

consortium of a group of universities working in partnership with industry vendors and government agencies to develop and deploy advanced network applications and technologies.

– The Shibboleth System is a standards based, open source software package for web single sign-on across or within organizational boundaries. It allows sites to make informed authorization decisions for individual access of protected online resources in a privacy-preserving manner.

Page 13: Models, Architectures, and Technologies of Digital Libraries (2) Session 4 LIS 60639 Implementation of Digital Libraries Dr. Yin Zhang.

13

Discussion and Reflection

• Summary:

– Protocols make network systems work together and are the basis of many formal communications.

– Digital libraries depend on protocols, particularly HTTP, OAI-PHM, and Z39.50, to provide services. Think of

• HTTP as the highway between digital libraries, with

• OAI as a friendly but comprehensive census taker that periodically turns up on the highway for updates on changes in the collection, and

• Z39.50 as a sometimes more demanding visitor asking for less predicable and more specific information on the collection.

• SOAP, RSS, and Shibboleth promise to enhance further and expand the boundaries of digital library services.

• Issues raised in this reading

• How such issues are addressed in your DL case

Page 14: Models, Architectures, and Technologies of Digital Libraries (2) Session 4 LIS 60639 Implementation of Digital Libraries Dr. Yin Zhang.

14

2. Interoperability: Standards and protocols

Witten & Bainbridge (2003): 8.5-8.7 in Ch. 8 Interoperability: Standards and protocols

Page 15: Models, Architectures, and Technologies of Digital Libraries (2) Session 4 LIS 60639 Implementation of Digital Libraries Dr. Yin Zhang.

15

Interoperability

• Interoperability is the name of the game for libraries. An important part of traditional library culture is the ability to locate copies of information in other libraries and receive them on loan-interlibrary loan. Libraries work together to provide a truly universal international information service. The degree of cooperation is enormous and laudable.

• For digital libraries to communicate with one another, standards are needed for representing documents, metadata, and queries.

• The components are in place. What we need are protocols that put them all together to achieve effective and widespread communication.

• Different protocols have sprung from the two different cultures upon which digital libraries are founded. Two principal ones:

– the Z39.50 protocol developed by the library community and maintained by the Library of Congress, and

– the Open Archives Initiative (OAl) protocol, developed by members of various communities concerned with electronic documents.

Page 16: Models, Architectures, and Technologies of Digital Libraries (2) Session 4 LIS 60639 Implementation of Digital Libraries Dr. Yin Zhang.

16

Supporting the Z39.50 protocol

• A particular Z39.50 system need not implement all parts of the protocol. The protocol is so complex that full implementation is a daunting undertaking and may in any case be inappropriate for a particular digital library site.

• For this reason the standard specifies a minimal implementation, which comprises the

– Initialize Facility,

– Search Facility,

– Present Service (part of the Retrieval Facility), and

– Type 1 Queries (part of the registry).

• Using this baseline implementation, a typical client-server exchange works as follows:

– First the client uses the Initialization Facility to establish contact with the server and negotiate values for certain resource limits.

– This puts the client in a position to transmit a Type 1 query using the Search Facility.

– The number of matching documents is returned, and the client then interacts with the Present Service to access the contents of desired documents.

• Greenstone DL software supports Z39.50

Page 17: Models, Architectures, and Technologies of Digital Libraries (2) Session 4 LIS 60639 Implementation of Digital Libraries Dr. Yin Zhang.

17

Supporting the Open Archives Initiative (OAI)

• For a given digital library site to become an OAl data provider, software needs to be written that can respond to CGI requests and access the database system that stores the documents.

• Many programming languages have library support for implementing CGI scripts - Perl, Python, Java, and C++, among others although the database itself will probably dictate the most suitable choice.

• Greenstone can support the construction of a digital library collection based on OAl exported data by the following two steps:

1. obtaining the raw material from a data provider and configuring a suitable collection2. augmenting the collection configuration file with a built-in OAI plugin

– With the issuing of the appropriate import.pl and buiIdcol.pl commands, the end result of these two stages is a searchable, browsable Greenstone collection based on the exported content.

– Further configuration of indexes and classifiers is possible depending on the metadata available.

Page 18: Models, Architectures, and Technologies of Digital Libraries (2) Session 4 LIS 60639 Implementation of Digital Libraries Dr. Yin Zhang.

18

Research protocols – (1) Dienst

• Two long-standing digital library protocols from the research community that are designed to promote interoperability.

• The trouble with interoperability though is that the purpose is defeated if several groups promote different interoperability schemes.

• Dienst - Dienst, at Cornell University, is one of the longest-running digital library projects in the research community: its origins stretch back to 1992. It has three facets: – a conceptual architecture for distributed digital libraries, – an open protocol for service communication, and – a software system that implement the protocol.

Page 19: Models, Architectures, and Technologies of Digital Libraries (2) Session 4 LIS 60639 Implementation of Digital Libraries Dr. Yin Zhang.

19

Research protocols – (1) Dienst (cont.)

• The protocol supports – search and retrieval of documents,

– browsing documents,

– adding new documents, and

– registering users. Each of these is an independent

• There are six categories of DL collection services: – repository services store digital documents and associated metadata;

– index services accept queries and return lists of document identifiers;

– query mediator services dispatch queries to the relevant index servers;

– info services return information about the state of a server;

– collection services provide information on how a set of services interact;

– registry services store user information.

Page 20: Models, Architectures, and Technologies of Digital Libraries (2) Session 4 LIS 60639 Implementation of Digital Libraries Dr. Yin Zhang.

20

(2) Simple digital library interoperability protocol (SDLIP)

• lnteroperation among distributed objects has been a central plank of Stanford University's digital library project, the lnfobus.

• Many lnfobus objects are in fact proxies to estab.lished information sources and services.

• The original Digital Library lnteroperation Protocol (DLIP) has since been superseded by the Simple Digital Library Interoperability Protocol (SDLIP), designed in collaboration with other U.S. research projects.

• SDLIP paces emphasis on a design that is scalable, permitting the development of digital library applications that run on handheld devices such as Palm Pilots) as well as workstation- and mainframe-based systems.

• There are four parts (called interfaces) to the protocol: searching, accessing results, metadata, and delivery.

Page 21: Models, Architectures, and Technologies of Digital Libraries (2) Session 4 LIS 60639 Implementation of Digital Libraries Dr. Yin Zhang.

21

Translating between protocols

• The Stanford research group provides a Java-based software development kit to support SDLIP.

• The translator runs as a server in its own right.

• For example, the translator server implements the intersection of the Greenstone protocol and SDLIP's search and source metadata interfaces.

Page 22: Models, Architectures, and Technologies of Digital Libraries (2) Session 4 LIS 60639 Implementation of Digital Libraries Dr. Yin Zhang.

22

Discussion and Reflection

• Summary: – Four digital library protocols: Z39.50, Open Archives Initiative (OAl), Dienst, and

SDLIP

– all support browsing and document retrieval, and all but OAl support searching

– Text searching is relatively well understood-alI support ranked and Boolean queries, with a rich array of options: fielded search, stemming, case matching, and so forth.

• Issues raised in this reading

• How such issues are addressed in your DL case

Page 23: Models, Architectures, and Technologies of Digital Libraries (2) Session 4 LIS 60639 Implementation of Digital Libraries Dr. Yin Zhang.

23

3. General purpose technologies useful for digital repositories

Reese & Banerjee (2008): Ch. 4 General purpose technologies useful for digital repositories

Page 24: Models, Architectures, and Technologies of Digital Libraries (2) Session 4 LIS 60639 Implementation of Digital Libraries Dr. Yin Zhang.

24

The Changing Face of Metadata

• The foundation of any digital repository is the underlying metadata structures that provide meaning to the information objects that it stores.

• Libraries have traditionally treated the creation and maintenance of bibliographic metadata as one of the core values of the profession.

• For libraries to truly integrate their digital content, their bibliographic infrastructure must change dramatically. This change must include both the metadata creation and delivery methods of bibliographic content.

• The days of a homogenous bibliographic standard for all content are coming to an end as more specialized descriptive formats are needed to describe the various types of materials being produced today and into the future.

• This chapter will focus on the technologies that make up today's current digital repository systems

– XML (eXtensible Markup Language), and

– SOAP (Simple Object Access Protocol)

Page 25: Models, Architectures, and Technologies of Digital Libraries (2) Session 4 LIS 60639 Implementation of Digital Libraries Dr. Yin Zhang.

25

XML in Libraries

• The library community has been one of the early implementers of XML-based descriptive schemas.

• Issues of document delivery, indexing, and display have pushed the library community to consider XML-based markup languages as a method of preserving digital and bibliographical information

• Today, libraries make use of XML nearly every day. We can find XML in the ILS systems, in image management tools, and in many other facets of the library.

Page 26: Models, Architectures, and Technologies of Digital Libraries (2) Session 4 LIS 60639 Implementation of Digital Libraries Dr. Yin Zhang.

26

XML in digital repositories

• The ability to provide XML-formatted data from one's digital repository is a valuable access method.

• When making decisions regarding a digital repository, one must look at how well the digital repository supports XML and XML-related technologies.

• One should ask the following questions:– Does the digital repository support XML-structured bibliographic and

administrative metadata? Does the digital repository support structural XML-based metadata schemas like METS (Metadata Encoding and Transition Standard)?

– Can the metadata be harvested or extracted? And can the data be extracted in XML?

– Does the digital repository support SOAP or other XML query syntaxes?

– Can my digital repository support multiple metadata formats?

Page 27: Models, Architectures, and Technologies of Digital Libraries (2) Session 4 LIS 60639 Implementation of Digital Libraries Dr. Yin Zhang.

27

Why Use XML-based Metadata?

• XML is human readable– One of the primary benefits associated with XML is that the generated metadata

is human readable.

– This characteristic of XML (1) makes data more transparent, (2) makes the data less susceptible to data corruption, and (3) reduces the likelihood of data lockup.

• XML offers a quicker cataloging strategy– In many cases, XML-based metadata schemas will lower many of the barriers

organizations currently face when creating bibliographic metadata.

• XML can represent multi-formatted and embedded documents– One of XML's strengths is its ability to represent hierarchical data structures and

relationships.

– An XML record could be generated that contains information on a single document available in multiple physical formats with the unique features of each item captured within the XML data structure.

Page 28: Models, Architectures, and Technologies of Digital Libraries (2) Session 4 LIS 60639 Implementation of Digital Libraries Dr. Yin Zhang.

28

Why Use XML-based Metadata? (continued)

• XML metadata becomes “smarter”– In an XML document, metadata fields can have attributes and

properties that can be acted upon.

– Data can be manipulated and reordered without having to rework the source XML document.

– The ability to illustrate relationships and interlinks between documents - the ability to store content or links to content within the metadata

• XML is not just a library standard– While the LIS community has created XML-based schemas like

MODS, METS, and Dublin Core, the fact that these schemas are in XML allows libraries to look outside the traditional library vendors to a broader development community.

Page 29: Models, Architectures, and Technologies of Digital Libraries (2) Session 4 LIS 60639 Implementation of Digital Libraries Dr. Yin Zhang.

Web Services and SOAP

• SOAP: the Simple Object Access Protocol

• SOAP is a standard method for generating API for Web-based applications.

• As a digital repository's content and traffic grow, users of the repository may want to access the repository's content outside the traditional user interface.

• A digital repository that lacks Web services support greatly reduces the amount of integration that an organization can accomplish with its content.

• Technologies like SOAP hold the keys to opening a digital repository beyond the "walls of the application platform, allowing other services like search engines or users to search, harvest, or integrate data from one digital repository into their own context or workflow.

Page 30: Models, Architectures, and Technologies of Digital Libraries (2) Session 4 LIS 60639 Implementation of Digital Libraries Dr. Yin Zhang.

30

Discussion and Reflection

• Issues raised in this reading

• How such issues are addressed in your DL case

Page 31: Models, Architectures, and Technologies of Digital Libraries (2) Session 4 LIS 60639 Implementation of Digital Libraries Dr. Yin Zhang.

31

4. Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH)

• http://www.oaforum.org/tutorial/english/intro.htm

• Rhyno (2004): Ch. 2 Important protocols for digital libraries and OSS options for using them

Page 32: Models, Architectures, and Technologies of Digital Libraries (2) Session 4 LIS 60639 Implementation of Digital Libraries Dr. Yin Zhang.

32

As one of the core protocols for DL projects

• OAI-PMH - Open Archives Initiative (OAI) Protocol for Metadata Harvesting (PMH)

– It has been called the "HTTP of digital libraries” even though the protocol actually uses HTTP as a transport mechanism between digital collections.

– OAI-PMH origined in a 1999 meeting in Santa Fe, New Mexico, to address a series of problems that were occurring in the e-print server world.

– As disciplinary e-print servers became more common, it was difficult to support searching across multiple repositories.

– Repositories needed greater capabilities to automatically identify and copy papers that had been deposited in other repositories

– The solution was the definition of an interface to permit an e-print server to expose metadata for the papers it held. This would allow the metadata to be picked up by programs on the Web called harvesters.

– Harvesting programs travel around a network gathering, or harvesting, content by copying it to a central site.

Page 33: Models, Architectures, and Technologies of Digital Libraries (2) Session 4 LIS 60639 Implementation of Digital Libraries Dr. Yin Zhang.

33

OAI-PMH (continued 1)

• OAI-PMH divides the world into data providers and service providers

• Registered OAI-PMH data providers: http://www.openarchives.org/Register/BrowseSites

• Data providers who support the OAI-PMH may choose to list their repository in the OAI registry, which serves to

– Provide a publicly accessible list of OAI conformant repositories, making it easy for service providers to discover repositories from which metadata can be harvested. Repositories may also wish to expose a friends container as part of their Identify response as a parallel means for guiding service providers towards repositories from which metadata can be harvested.

– Provide a mechanism for data providers to ensure their conformance with the OAI-PMH specification.

– Provide a means for the OAI to monitor use of the protocol and plan future activities and strategies.

Page 34: Models, Architectures, and Technologies of Digital Libraries (2) Session 4 LIS 60639 Implementation of Digital Libraries Dr. Yin Zhang.

34

OAI-PMH (continued 2)

• Registered OAI-PMH service providers: http://www.openarchives.org/Register/BrowseSites

– As of Feb 9, 2009, there are 959 OAI conforming repositories.

• The concept is that service providers add value to the data they harvest by defining search engines and other applications.

• Although other metadata schemes can be specified, OAI-PMH mandates that Dublin Core be available.

• OAI is purposely designed to be "low barrier" to developers. Relatively simple criteria are used for harvesting:

– date stamps, which identify when resources have last been modified, and

– sets, which group together records based on criteria defined by the data provider.

Page 35: Models, Architectures, and Technologies of Digital Libraries (2) Session 4 LIS 60639 Implementation of Digital Libraries Dr. Yin Zhang.

Main Technical Ideas of OAI-PMH (1)

• The main ideas of OAI– world-wide consolidation of scholarly archives

– free access to the archives (at least: metadata)

– consistent interfaces for archives and service provider

– low barrier protocol / effortless implementation (e.g., because based on HTTP, XML, DC)

• Basic functioning of OAI-PMH– Data Providers (open archives, repositories) provide free access to metadata, and may, but

do not necessarily, offer free access to full texts or other resources. OAI-PMH provides an easy to implement, low barrier solution for Data Providers.

– Service Providers use the OAI interfaces of the Data Providers to harvest and store metadata. Note that this means that

• there are no live search requests to the Data Providers; rather, services are based on the harvested data via OAI-PMH.

• Service Providers may select certain subsets from Data Providers (e.g., by set hierarchy or date stamp).

• Service Providers offer (value-added) services on the basis of the metadata harvested, and they may enrich the harvested metadata in order to do so.

Page 36: Models, Architectures, and Technologies of Digital Libraries (2) Session 4 LIS 60639 Implementation of Digital Libraries Dr. Yin Zhang.

Main Technical Ideas of OAI-PMH (2)

• OAI-PMH: overview and structure model – OAI-PMH supports six request types (known as "verbs"), e.g.,

http://archive.org?verb=ListRecords&from=2002-11-01.

– Responses are encoded in XML syntax. OAI-PMH supports any metadata format encoded in XML. Dublin Core is the minimal format specified for basic interoperability.

Page 37: Models, Architectures, and Technologies of Digital Libraries (2) Session 4 LIS 60639 Implementation of Digital Libraries Dr. Yin Zhang.

Data Provider: prerequisitesThese are the things you must, should, or may have in place in order to implement OAI-PMH as a Data Provider:

– metadata on resources ("items") These should be stored in a database (such as an SQL database). A file system may be necessary. It is necessary to have a unique identifier for each item.

– Web server, accessible via the Internet, e.g. Apache, IIS

– programming interface / API • e.g. Perl, PHP, Java-Servlet • web server extension • access to database (or filesystem) • not needed: session management

– archive identifier / base URL

– unique identifier for each item

– metadata format (one or more; at least: unqualified Dublin Core)

– datestamps for metadata (created / last modified)

– logical set hierarchy (may have) This is most usefully by agreement within communities, especially subject communities

– flow control by implementation of resumption token (optional, but 'larger' repositories should have it)

Page 38: Models, Architectures, and Technologies of Digital Libraries (2) Session 4 LIS 60639 Implementation of Digital Libraries Dr. Yin Zhang.

Data Provider: components and architecture Components:

• Argument Parser validates OAI requests.

• Error Generator creates XML responses with encoded error messages.

• Database Query / Local Metadata Extraction retrieves metadata from the repository, according to the required metadata format.

• XML Generator / Response Creation creates XML responses with encoded metadata information.

• Flow Control realises incomplete list sequences for 'larger' repositories. It uses resumption token as the control mechanism.

This diagram illustrates an example architecture for a Data Provider

Page 39: Models, Architectures, and Technologies of Digital Libraries (2) Session 4 LIS 60639 Implementation of Digital Libraries Dr. Yin Zhang.

Service Provider: prerequisites

• There are three technical infrastructure prerequisites for implementing an OAI-PMH Service Provider that will harvest metadata from Data Providers via OAI-PMH:

– an Internet-connected server

– a database system (relational or XML)

– a programming environment. (The programming environment must be one that can issue HTTP requests to web servers, can issue database requests, and includes an XML parser.)

Page 40: Models, Architectures, and Technologies of Digital Libraries (2) Session 4 LIS 60639 Implementation of Digital Libraries Dr. Yin Zhang.

Service Provider: components and architecture • Archive management involves the selection of repositories to be harvested. Entries to your list of repositories to be harvested may

be made manually or you can automatically add or remove archives using the official registry.

• Request Component creates HTTP requests and sends them to OAI repositories (Data Provider). It demands metadata using the allowed verbs of the OAI-PMH. It may do selective harvesting using the set parameter.

• Scheduler realises timed and regular retrieval of the associated archives. The simplest case would be manual initiation of the jobs, but this can be automated, e.g., as a cron job.

• Flow Control is implemented via resumption token, partitioning of the result list into incomplete sections with a new request to retrieve more results. An HTTP error 503 (service not available) allows analysis of the response to extract a “retry-after” period.

• Update Mechanism realises the consolidation of metadata which have been harvested earlier (merge old and new data). The easiest case would be to delete all ‘old’ metadata from each repository before harvesting it again. A reasonable alternative is to do an incremental update (from parameter) – insert new metadata and overwrite changed / deleted metadata (assignment using the unique identifiers).

• XML Parser analyses the responses received from the repositories, with validation using the XML schema, and transforms the metadata encoded in XML into the internal data structure.

• Normaliser transforms data in different metadata formats into a homogenous structure. It harmonises representation of, for example, date, author, language code. It may map between or translate different languages.

• Database receives the output of the normaliser mapping the XML structure of the metadata into a relational database that will handle multiple values of elements. An alternative is to use an XML database.

• Duplication Checker merges identical records from different data providers. One possibility for implementing this is by the unique identifier for each item (for example, by URN). However, this solution is often not easily practicable and is not risk or error free.

• Service Module provides the actual service to the 'public'. The basis for a service provided is the harvested and stored records of the associated archives. That is, it uses only the local database for requests etc., and thus it does not make calls on the Data Providers during operation.

Page 41: Models, Architectures, and Technologies of Digital Libraries (2) Session 4 LIS 60639 Implementation of Digital Libraries Dr. Yin Zhang.

Basics of XML schemas for OAI-PMH

• OAI-PMH uses XML Schemas to define record formats.

• OAI-PMH allows for any metadata format, so long as it is encoded in XML with an XML Schema.

• You can exchange any metadata you like using OAI-PMH as long as you can encode it as XML and define an XML Schema for it.

• OAI-PMH mandates the oai_dc schema as a minimum standard for interoperability.

• All repositories must support oai_dc for a minimum level of interoperability.

– If oai_dc does not have enough elements, you can extend it.

– If oai_dc is not precise enough, a qualified Dublin Core schema can be used.

– If oai_dc is not the right schema for your community or purpose, then use something else as well.

Page 42: Models, Architectures, and Technologies of Digital Libraries (2) Session 4 LIS 60639 Implementation of Digital Libraries Dr. Yin Zhang.

42

OAI Software and Tools

• There are many OAI tools available. The following table contains links to tools implemented by members of the Open Archives Initiative community:

http://www.openarchives.org/pmh/tools/tools.php

DSpace HP Labs and MIT Libraries

DSpace is an open source digital asset managment software platform that enables institutions to capture and describe digital content. It runs on a variety of hardware platforms and supports OAI-PMH version 2.0.

eprints.org University of Southampton

Software to run centralised, discipline-based as well as distributed, institution-based archives of scholarly publications. The software is OAI compliant, i.e. metadata can be harvested from repositories running the software using the OAI metadata harvesting protocol.

Fedora Cornell University

An open source digital repository architecture that allows packaging of content and distributed services associated with that content.  Fedora supports OAI-PMH requests on content in the repository.

MARCXML framework

Library of Congress

A suite of tools, stylesheets, guidelines and XML documents to support MARC21 records in the XML environment. Includes Universitytools to support transformation/migration from oai_marc to MARCXML, including an XML schema for MARC21 records.

Page 43: Models, Architectures, and Technologies of Digital Libraries (2) Session 4 LIS 60639 Implementation of Digital Libraries Dr. Yin Zhang.

OAI Software and Tools (cont)

• The tools you choose will depend on such considerations as the type of repository or service you are implementing and the technical skills available to you in-house: – if you are setting up an e-print archive you may want to consider

using the EPrints software package, – DSpace provides a digital asset management framework that includes

preservation considerations, and – the advantage offered by PHP OAI Data Provider is support for on-the-

fly output compression aiming at a significant reduction in data transfer load.

• In addition, about thirty OAI-related tools are described in the OA-Forum Final Report on Technical Issues (download from http://www.oaforum.org/documents/). This report also includes a detailed comparison of GNU EPrints and DSpace.

Page 44: Models, Architectures, and Technologies of Digital Libraries (2) Session 4 LIS 60639 Implementation of Digital Libraries Dr. Yin Zhang.

44

Discussion and Reflection

• Issues raised in this reading

• How such issues are addressed in your DL case