Digital Library Interoperability Architecture CS 502 – 20030305 Carl Lagoze – Cornell...
-
Upload
harvey-webster -
Category
Documents
-
view
213 -
download
0
Transcript of Digital Library Interoperability Architecture CS 502 – 20030305 Carl Lagoze – Cornell...
Digital Library Interoperability Architecture
CS 502 – 20030305Carl Lagoze – Cornell University
Interoperability is multidimensional
• Syntax– XML
• Semantics– RDF/RDFS/OWL
• Vocabularies/Ontologies– Dublin Core/ABC/CIDOC-CRM
• Search and discovery– Z39.50– SDLIP– ZING
• Document models– METS– FEDORA
Contrast to Distributed Systems
• Distributed systems– Collections of components at different sites that are
carefully designed to work with each other
• Heterogeneous or federated systems– Cooperating systems in which individual components
are designed or operated automously
Measuring success of interoperability solutions
• Degree of component automony• Cost of infrastructure• Ease of contributing components• Ease of using components• Breadth of task complexity supported by the
solution• Scalability in the number of components
Families of interoperability solutions
S tro ng S tand ard s
F am ilies o f S tand ard s
External M ed iato rs(W rap p ers , gatew ays ,s c hem a trans alato rs )
S em antic S p ec ific atio n(O p eratio nal O nto lo gies )
A gents , A p p letsM o b ile C o d e
Interoperability Trade-offs
Cost
Functionality
HTTPGoogle
Z39.50SGML
DublinCore
MetadataHarvesting Dienst
Cornell CS 502 20020307 7
Dienst
• is a protocol and reference implementation of a distributed digital library service
• where a network of services provide• World Wide Web browser access,• uniform search over distributed indexes,• and access to structured documents.
Why a service based protocol?
• Expose the operational semantics of the services through an API,
• to permit flexible integration of the services,• and use of the services by other
clients/consumers/services.
Defining the services
• Repository – deposit, storage, and access to structured documents.
• Index – process queries on documents and returned handles
• Query Mediator – route queries to appropriate indexes• Collection – define services and content in logical
collections• User Interface – human-oriented front-end for services.• Name Server – Resolves URN’s (handles) to document
location(s)
Dienst Services
WWWbrowser
UserInterface
RepositoryIndexIndex Index
Repository Repository
QM
user query
generic search
request
specific searchrequest
NS
user documentrequest
URI
documentrequest
Collection
Collectionmetadata
Defining the protocol
• Structured messages– Service– Version– Verb– Arguments
• Template/Dienst/<service>/<version>/<verb>[?/]<arguments>
• Example/Dienst/Repository/4.0/Formats/ncstrl.cornell/TR94-1418
Why a Document Model?
• “Documents” in current web are both:– Unstructured (GET)– Chaotic (CGI)
• Different views and pieces of contents are needed for:– Bandwidth reduction– Rights management– Usability
Dienst Document Model
• Metadata – support for multiple descriptive formats
• Views – alternative expression or structural representation of the content encapsulated in the digital object
• Divs – hierarchically nested structure contained in a view
Expressing the document model in the protocol
• Structure – expose the views and structure for the digital object
• Disseminate – select the structural component (and packaging of it) to disseminate
• List-Meta-Formats – list available descriptive formats
Protocol Demonstration
• http://techreports.library.cornell.edu:8081/Dienst/Repository/4.0/List-Contents?file-after=2003-01-01
• http://techreports.library.cornell.edu:8081/Dienst/Repository/1.0/Disseminate/cul.cs/TR90-1160/%23oams/xml
• http://techreports.library.cornell.edu:8081/Dienst/Repository/2.0/Structure/cul.cs/TR90-1160
• http://techreports.library.cornell.edu:8081/Dienst/Repository/4.0/Formats/cul.cs/TR90-1160?part=body
• http://techreports.library.cornell.edu:8081/Dienst/Repository/1.0/Disseminate/cul.cs/TR90-1160/body/inline?pageimage=3
Cornell CS 502 20020307 16
Collection Service
• Periodically polled by each user interface server for– elements of the
collection– index servers for the
collectionUser Interface
Servers
IndexServers
Deploying Collection Globally
• Internet connectivity varies considerably• Good connectivity between nodes often does not
correspond to geographic proximity• Connectivity Region - a group of nodes on the network
that among them have good connectivity, relative to nodes outside of the region.
Connectivity Regions
• When possible route queries within region• In case of failure, use an alternate either within the
region or in a “nearby” region
Origins of the OAI
• Increasing interest in alternative scholarly publishing solutions – e.g., LANL arXiv
• Increasing impact through federation• UPS Mtg., Sante Fe, October 1999
– Representatives of various ePrint, library, publishing, communities
– Goal: definition of an interoperability framework among ePrint providers
– Reality: Rich interoperability protocols like Dienst are too complicated for widespread deployment
– Result: Santa Fe Convention, interoperability through metadata harvesting
DiscoveryCurrent
AwarenessPreservation
Service Providers
Data Providers
Meta
data
harv
estin
g
The World According to OAI
Yes, its about resource discovery over distributed collections
metadata
AuthorTitleAbstractIdentifer
Facilitating/Monitoring Longevity of Distributed Content
W e b S ite W e b S i teM anage d
R e po s i to ry
Se le c t ive W e b C rawling
E ve ntR e c ords
P1 A1
P2 A2
P3 A3
P o lic y E n f o r c er
acti
ons
M anage dR e po s i to ry
P r e se r v a t io n M e t a da t aP r e se r v a t io n M e t a da t a
M e tadata H arve s t ing
PreservationService
DigitalObject
Realaudio video
Powerpoint presentation
SMIL synchronization metadata
structuralmetadata
Portal A Portal B
View A:• View Slides• View Video• View synchronized presentation using applet
View B:• Get Transcript of Audio• Search for keyword• Get Slides translated to French
ToolRepository
Personalization of Content
Cross-Repository Reference Linking
citationmetadata
citationmetadata
citationmetadata
citationmetadata
citationmetadata
LinkageService
OAI Technical Infrastructure Key technical features
• Deploy now technology – 80/20 rule• Two-party model – providers (data providers) and
consumers (service providers)• Simple HTTP encoding• XML schema for some degree of protocol conformance• Extensibility
– Multiple item-level metadata– Collection level metadata
Content and Metadata
resource
Item (metadata)
repository010010
record
http://www.openarchives.org/OAI/openarchivesprotocol.html
record
<record><header>
<identifier>oai:eg:001</identifier><datestamp>1999-01-01</datestamp>
</header><metadata>
<dc xmlns=“http://purl.org/dc”><title>My Example</title>
</dc></metadata><about>
<ea xmlns=“http://www.arXiv.org/ea”<usage>No restrictions</usage>
</ea></about>
</record>
protocol support
format-specificmetadata
community-specific
record data
selective harvesting - datestamps
repos i tory
harvest withindate range
record
record
selective harvesting - sets
repos i tory
harvest within setS1
recordrecord
record
S2
set specifics
• repositories define hierarchical organization• each item in a repository may be organized in
one set, several sets, or no sets at all• meaning of sets or of set hierarchy is not
defined in protocol• individual communities may formulate
common set configurations
HTTP encoding - requests
BASE-URL -----------> an.oa.org/OAI-scriptkeyword arguments -->verb=ListIdentifers&set=S1
GET http://an.oa.org/OAI-script?verb=ListIdentifers&set=S1POST POST http://an.oa.org/OAI-script HTTP/1.0 Content-Length: 78 Content-Type: application/x-www-form-urlencoded verb=ListIdentifers&set=S1
HTTP encoding - responses
<xml version=1.0 encoding=“UTF-9” ?><GetRecord
xmlns=“http://oai.namespace.uri”xmlns:xsi=“http://w3.namespace.uri”xsi:schemaLocation=“http://oai.namespace.uri
http://oai.schemaURL”><responseDate>2000-19-01T19:30:30-04:00</responseDate><requestURL>http://an.oa.org/OAI-script?verb=GetRecord
&identifier=oai%3AarXiv%3A0001&metadataPrefix=oai_dc</requestURL>
<record>record contents
</recordadditional records
</GetRecord>
responseheader
xml namespace
s
responsedata
metadata prefix and schema
• support for harvesting multiple metadata formats– metadata schema: each format must have a validating
XML schema at a publicly accessible URL (communities may define shared formats and schema.
– metadata prefix: each repository maps a prefix to the schema it supports, which is used in protocol requests.
• support for unqualified Dublin Core mandatory– DC OAI record syntax that builds on base DCMI schema– reserved prefix oai_dc.
flow control
protocol requestharves ter
repos i tory
flow control specifics
• applies to all protocol requests that return lists: ListRecords, ListIdentifiers, ListSets
• resumptionToken is opaque• semantics of partitioning of responses within
resumption requests is undefined
Extensibility Feature Summary
• Multiple metadata formats• Collection level metadata
– Identify “about” container
• Record data– Terms and conditions– Provenance
• Set structure– Pre-configured “queries”
Supporting protocol requests:• Identify• ListMetadataFormats• ListSets
Harvesting protocol requests:• ListRecords• ListIdentifiers• GetRecord
repos i tory
harves ter
service provider data provider
OAI Protocol
Challenges and Questions
• Utility of lowest common denominator metadata such as DC
• Quality of metadata from non-professional contributors
• Machines processing to reduce and compliment human effort
• Functionality of service structure