The UPS protoproto project herbert van de sompel, michael nelson, thomas krichel UPS 1 Meeting Santa...

the UPS protoproto project

herbert van de sompel, michael nelson, thomas krichel

UPS 1 MeetingSanta Fe - October 21th 1999

descriptionproject

the UPS protoprotodemo

the data exchange frameworkdex

project why a protoproto?

• UPS: enable cross-archive end-user services• protoproto:

– facilitate discussions– identify issues involved in creating cross-archive services– experiment with digital object concepts for archive

material– does not claim to be a solution

• protoproto is multi-disciplinary– a special instance of cross-archive– there is a market– promotional value

project who?

• coordination: herbert van de sompel, michael nelson, thomas krichel

• involvement of:– Old Dominion U & NASA Langley– U of Surrey– U of Ghent– Los Alamos National Laboratory - Library– Russian Academy of Science - Siberian branch

project sponsors

• Los Alamos National Laboratory - Research Library• JISC eLib WoPEc project

project datasets– metadata only– full text remains at archives– static dumps obtained ca. July 99

the arXivCogPrintsNACANCSTRLNDLTDRePEc

objects85,223

7423,03629,1841,59073,367

193,142

full-text85,223

6593,0369,084951

13,582

112,535

!organization17,983

14100931

project metadata formats

the arXivCogPrintsNACANCSTRLNDLTDRePEc

formatinternalinternalReferRFC1807MARCReDIF

• Getting metadata out of archives– not all archives support metadata extraction

• some archives have undocumented metadata extraction procedures

– not all archives support rich criteria for extraction • single dump concept only

• Intellectual property and use rights not always clear

project metadata extraction

• Metadata has problems with:– record duplication– crucial missing fields– internal errors– ambiguous references to people and places,

publications

project metadata quality

project metadata conversion

• data enhancements:• creation of unique identifier• addition of raw subject-classification• normalization of publication types

• all datasets converted to ReDIF:• essential to have a single fomat for the creation of services • supply by archives in a single format was not realistic• no downgrading of data

project re-creation of archives

• creation of archives for ReDIF-ed metadata• using intelligent digital objects : “buckets”

NCSTRL

• Buckets were chosen to study the implications of using rich, intelligent objects in UPS

• Buckets are:– DL protocol / system independent– self-contained and mobile– handle their own display, enforcement of terms and conditions, and

dissemination of their contents– designed for bundling multiple data representations and data instance types

• The aggregative nature of buckets is well suited for adding valued-added services at the object level

project buckets

project creation of end-user service

• NCSTRL+ digital library service• indexing buckets in archives by requesting their metadata• enhanced user-interface• NCSTRL+ search results point at buckets• buckets auto-display• buckets provide link to full-text in native archive

• UPS contains 193K objects– using buckets consumed inodes (~60 inodes per

bucket)• filesystem reformatted with more generous amount of

inodes– Solaris and Dienst conflict

• Dienst wants each object in an publishing authority to be in a single directory

• Solaris has a hard limit of 32K objects in a directory• resolution: use many (100+) authorities for UPS

project scaling problems

project addition of linking service

• integrate the archives with the traditional communication mechanism• context-sensitive linking to deliver extended services via SFX technology

project SFX linking service

metadatametadata evaluate metadata

extended services

system A system B

project SFX linking database

• buckets for arXiv, NCSTRL and RePEc are SFX-aware

• Cogprints, NACA, NDLTD not SFX-aware• SLAC/SPIRES is SFX-aware• linking services for preprint metadata + for published version

project addition of linking service

demo the UPS protoproto

http://ups.cs.odu.edu:8000/

• will be available starting beginning of November• UPS list will be notified• disclaimer “not a production system”

http://ups.cs.odu.edu

dex some issues (I)

•data exchange framework•data provision vs. data implementation •central searching, distributed archives

• need for a framework by which archives can describe themselves:

• content • terms and conditions• protocols, criteria supported to extract (meta)data• metadata scheme, subject classification scheme, material-type scheme, ...

• need for an identifier scheme for archives and archive objects

•(cf. ISSN, ISBN, DOI) • metadata quality obstructs the creation of services• desirabile to extend metadata with citation information• smart objects

• archived objects that are active, not passsive

dex some issues (II)

• Providing data:– publishing into an archive– providing methods for metadata “harvesting”

• provide non-technical context for sharing information also

• Implementing Data:– harvest metadata from providers– implement user interface to data

• Even if provided by the same DL, these are distinct functions

dex providing vs. implementing data

ProviderInputinterface

Nativeend-userinterface

Nativeharvestinginterface

No machine based way to extract metadata…

Machine and user interfacesfor extracting metadata….

ImplementorNativeend-userinterface

Input and harvesting interfaces optional

Native end-userinterface optional(e.g., RePEc)

• Much of the learning about the constituent UPS archives occurred out of band…

• Given an unknown archive, we should be able to algorithmically determine the archive’s metadata...

Nativeharvestinginterface Where possible, the

harvesting interface should provide the samecriteria as the end-user interface

dex self-describing archives

• Recommended criteria for metadata extraction:– subject classification– accession date– publication date

• Criteria for archive description– metadata formats employed– contact information for archive– publication type scheme– identifier scheme– subject classification scheme

dex self-describing archives

• Useful in:– reference linking– can be used in citations– resolving duplications

• UPS duplications were removed by hand– tracking publication lifecycle

• Need the ability for an object to have multiple unique identifiers – organization, discipline, etc.

dex identifiers

• Premise: Objects are more important than the archives that hold them

• SODA: Smart Objects, Dumb Archives

• Objects should be the canonical authority for• metadata• contents• use

• Objects should be able to grow and change• correct metadata• add new formats• add new services• reflect the lifecycle of the object

dex smart objects

• It would be beneficial if the archived objects could be heterogenous:

• with their own “look-and-feel”• unique functionality / services

– e.g., the data archiving needs of an atmospheric scientist can be different than that of a computer scientist, engineer or medical researcher

• yet maintained a standard API for:• extracting metadata• content retrieval• resource discovery on the object• terms and conditions

dex smart objects

• A strong distinction between the provision of data, and the implementation of data– also, a socio-legal context for sharing metadata

• Open, “self-describing” archives• A universal, unique identifier name space• Archived objects with more intelligence and

flexibility

dex lessons learned

The UPS protoproto project herbert van de sompel, michael nelson, thomas krichel UPS 1 Meeting Santa...

Documents

Transcript of The UPS protoproto project herbert van de sompel, michael nelson, thomas krichel UPS 1 Meeting Santa...

LIS654 lecture 3 whaffle Thomas Krichel 2011-09-27.

LIS650lecture 0 Introductory lecture Thomas Krichel 2007-01-27.

LIS654 lecture repository interoperability Thomas Krichel 2012-03-25.

LIS650lecture 0 Introductory lecture Thomas Krichel 2005-09-11.

Distributed Current Awareness Services Thomas Krichel 2003-09-18.

LIS654 lecture 1 Introduction Thomas Krichel 2011-09-13.

LIS650lecture 1 Major HTML Thomas Krichel 2004-10-02.

LIS650lecture 1 XHTML 1.0 strict Thomas Krichel 2005-09-18.

LIS650lecture 1 Major HTML Thomas Krichel 2005-09-17.

LIS651 lecture 3 functions & sessions Thomas Krichel 2008-11-08.

Y.T. a brief history of the OAI 0 Kaynak: Herbert van de Sompel.

LIS650lecture 0 Introductory lecture Thomas Krichel 2004-01-23.

LIS512 lecture 2 relational databases Thomas Krichel 2010-02-03.

Free author registration Thomas Krichel LIU & НГУ 2008-12-11.

LIS650lecture 0 Introductory lecture Thomas Krichel 2005-09-10.

LIS650 part 4 CSS positioning & site design Thomas Krichel.

LIS650lecture 0 Introductory lecture Thomas Krichel 2008-10-18.

LIS654 lecture copyright 1 Thomas Krichel 2011-11-15.

LIS650 lecture 2 Major CSS Thomas Krichel 2004-11-21.

LIS650lecture 0 Introductory lecture Thomas Krichel 2006-02-03.