Post on 18-Jan-2018
description
the UPS protoproto project
herbert van de sompel, michael nelson, thomas krichel
UPS 1 MeetingSanta Fe - October 21th 1999
descriptionproject
the UPS protoprotodemo
the data exchange frameworkdex
project why a protoproto?
• UPS: enable cross-archive end-user services• protoproto:
– facilitate discussions– identify issues involved in creating cross-archive services– experiment with digital object concepts for archive
material– does not claim to be a solution
• protoproto is multi-disciplinary– a special instance of cross-archive– there is a market– promotional value
project who?
• coordination: herbert van de sompel, michael nelson, thomas krichel
• involvement of:– Old Dominion U & NASA Langley– U of Surrey– U of Ghent– Los Alamos National Laboratory - Library– Russian Academy of Science - Siberian branch
project sponsors
• Los Alamos National Laboratory - Research Library• JISC eLib WoPEc project
project datasets– metadata only– full text remains at archives– static dumps obtained ca. July 99
the arXivCogPrintsNACANCSTRLNDLTDRePEc
Total
objects85,223
7423,03629,1841,59073,367
193,142
full-text85,223
6593,0369,084951
13,582
112,535
!organization17,983
14100931
2,453
project metadata formats
the arXivCogPrintsNACANCSTRLNDLTDRePEc
formatinternalinternalReferRFC1807MARCReDIF
• Getting metadata out of archives– not all archives support metadata extraction
• some archives have undocumented metadata extraction procedures
– not all archives support rich criteria for extraction • single dump concept only
• Intellectual property and use rights not always clear
project metadata extraction
• Metadata has problems with:– record duplication– crucial missing fields– internal errors– ambiguous references to people and places,
publications
project metadata quality
project metadata conversion
• data enhancements:• creation of unique identifier• addition of raw subject-classification• normalization of publication types
• all datasets converted to ReDIF:• essential to have a single fomat for the creation of services • supply by archives in a single format was not realistic• no downgrading of data
project re-creation of archives
• creation of archives for ReDIF-ed metadata• using intelligent digital objects : “buckets”
arXiv
RePEc
NCSTRL
• Buckets were chosen to study the implications of using rich, intelligent objects in UPS
• Buckets are:– DL protocol / system independent– self-contained and mobile– handle their own display, enforcement of terms and conditions, and
dissemination of their contents– designed for bundling multiple data representations and data instance types
• The aggregative nature of buckets is well suited for adding valued-added services at the object level
project buckets
project creation of end-user service
• NCSTRL+ digital library service• indexing buckets in archives by requesting their metadata• enhanced user-interface• NCSTRL+ search results point at buckets• buckets auto-display• buckets provide link to full-text in native archive
• UPS contains 193K objects– using buckets consumed inodes (~60 inodes per
bucket)• filesystem reformatted with more generous amount of
inodes– Solaris and Dienst conflict
• Dienst wants each object in an publishing authority to be in a single directory
• Solaris has a hard limit of 32K objects in a directory• resolution: use many (100+) authorities for UPS
project scaling problems
project addition of linking service
• integrate the archives with the traditional communication mechanism• context-sensitive linking to deliver extended services via SFX technology
project SFX linking service
metadatametadata evaluate metadata
extended services
system A system B
project SFX linking database
• buckets for arXiv, NCSTRL and RePEc are SFX-aware
• Cogprints, NACA, NDLTD not SFX-aware• SLAC/SPIRES is SFX-aware• linking services for preprint metadata + for published version
project addition of linking service
demo the UPS protoproto
http://ups.cs.odu.edu:8000/
• will be available starting beginning of November• UPS list will be notified• disclaimer “not a production system”
http://ups.cs.odu.edu
dex some issues (I)
•data exchange framework•data provision vs. data implementation •central searching, distributed archives
• need for a framework by which archives can describe themselves:
• content • terms and conditions• protocols, criteria supported to extract (meta)data• metadata scheme, subject classification scheme, material-type scheme, ...
• need for an identifier scheme for archives and archive objects
•(cf. ISSN, ISBN, DOI) • metadata quality obstructs the creation of services• desirabile to extend metadata with citation information• smart objects
• archived objects that are active, not passsive
dex some issues (II)
• Providing data:– publishing into an archive– providing methods for metadata “harvesting”
• provide non-technical context for sharing information also
• Implementing Data:– harvest metadata from providers– implement user interface to data
• Even if provided by the same DL, these are distinct functions
dex providing vs. implementing data
ProviderInputinterface
Nativeend-userinterface
ProviderInputinterface
Nativeend-userinterface
Nativeharvestinginterface
No machine based way to extract metadata…
Machine and user interfacesfor extracting metadata….
dex providing vs. implementing data
ProviderInputinterface
Nativeharvestinginterface
ProviderInputinterface
Nativeend-userinterface
Nativeharvestinginterface
ImplementorNativeend-userinterface
Input and harvesting interfaces optional
Native end-userinterface optional(e.g., RePEc)
dex providing vs. implementing data
• Much of the learning about the constituent UPS archives occurred out of band…
• Given an unknown archive, we should be able to algorithmically determine the archive’s metadata...
ProviderInputinterface
Nativeend-userinterface
Nativeharvestinginterface Where possible, the
harvesting interface should provide the samecriteria as the end-user interface
dex self-describing archives
• Recommended criteria for metadata extraction:– subject classification– accession date– publication date
• Criteria for archive description– metadata formats employed– contact information for archive– publication type scheme– identifier scheme– subject classification scheme
dex self-describing archives
• Useful in:– reference linking– can be used in citations– resolving duplications
• UPS duplications were removed by hand– tracking publication lifecycle
• Need the ability for an object to have multiple unique identifiers – organization, discipline, etc.
dex identifiers
• Premise: Objects are more important than the archives that hold them
• SODA: Smart Objects, Dumb Archives
• Objects should be the canonical authority for• metadata• contents• use
• Objects should be able to grow and change• correct metadata• add new formats• add new services• reflect the lifecycle of the object
dex smart objects
• It would be beneficial if the archived objects could be heterogenous:
• with their own “look-and-feel”• unique functionality / services
– e.g., the data archiving needs of an atmospheric scientist can be different than that of a computer scientist, engineer or medical researcher
• yet maintained a standard API for:• extracting metadata• content retrieval• resource discovery on the object• terms and conditions
dex smart objects
• A strong distinction between the provision of data, and the implementation of data– also, a socio-legal context for sharing metadata
• Open, “self-describing” archives• A universal, unique identifier name space• Archived objects with more intelligence and
flexibility
dex lessons learned